Ngokucinga i-R noma i-Python ku-inthanethi, uzothola izigidi zendatshana namakhilomitha ezingxoxo ngesihloko sokuthi iyiphi engcono, eshesha futhi elungele ukusebenza ngedatha. Kodwa ngeshwa, zonke lezi zihloko nezingxabano azisizi kakhulu.
Inhloso yalesi sihloko ukuqhathanisa amasu ayisisekelo okucubungula idatha kumaphakheji aziwa kakhulu azo zombili izilimi. Futhi usize abafundi bafunde ngokushesha okuthile abangakakwazi. Kulabo ababhala ngePython, thola ukuthi ungayenza kanjani into efanayo ku-R, futhi ngokuphambene nalokho.
Phakathi nesihloko sizohlaziya i-syntax yamaphakheji adume kakhulu ku-R. Lawa amaphakheji afakwe kulabhulali. tidyverse
kanye nephakheji data.table
. Futhi qhathanisa i-syntax yabo ne pandas
, iphakheji yokuhlaziya idatha edume kakhulu ePython.
Sizohamba ngesinyathelo ngesinyathelo kuyo yonke indlela yokuhlaziya idatha ukusuka ekuyilayisheni kuye ekwenzeni imisebenzi yewindi lokuhlaziya kusetshenziswa iPython ne-R.
Okuqukethwe
Lesi sihloko singasetshenziswa njengeshidi lokukopela uma ukhohliwe indlela yokwenza umsebenzi wokucubungula idatha kwelinye lamaphakheji acatshangelwayo.
Umehluko omkhulu we-syntax phakathi kwe-R ne-Python
1.1.Ukufinyelela Imisebenzi Yephakheji
1.2.Isabelo
1.3.Ukwenza inkomba
1.4.Izindlela kanye ne-OOP
1.5.Amapayipi
1.6.Izakhiwo zedatha Amagama ambalwa mayelana namaphakheji esizowasebenzisa
2.1.ihlanzekile
2.2.idatha.table
2.3.pandas Ifaka amaphakheji Ilayisha Idatha Idala ama-dataframes Ukukhetha Amakholomu Owadingayo Ihlunga imigqa Ukuhlanganisa nokuhlanganisa Inyunyana yamathebula eqondile (UNION) Ukuhlanganisa amathebula okuvundlile (JOIN) Imisebenzi yefasitela eyisisekelo namakholomu abaliwe Ithebula lezincwadi phakathi kwezindlela zokucubungula idatha ku-R nePython isiphetho Inhlolovo emfushane mayelana nephakheji oyisebenzisayo
Uma unentshisekelo ekuhlaziyweni kwedatha, ungathola okwami
Umehluko omkhulu we-syntax phakathi kwe-R ne-Python
Ukwenza kube lula kuwe ukuthi ushintshe usuka ku-Python uye ku-R, noma ngokuphambene nalokho, ngizokunikeza amaphuzu ambalwa asemqoka okudingeka uwanake.
Ukufinyelela Imisebenzi Yephakheji
Uma iphakheji selilayishwe ku-R, awudingi ukucacisa igama lephakheji ukuze ufinyelele imisebenzi yalo. Ezimweni eziningi lokhu akwamukelwe ku-R, kodwa kuyamukeleka. Akumele ungenise iphakheji nhlobo uma udinga enye yemisebenzi yayo kukhodi yakho, kodwa mane uyibize ngokucacisa igama lephakheji kanye negama lomsebenzi. Isihlukanisi phakathi kwephakheji namagama omsebenzi ku-R iyikholoni ekabili. package_name::function_name()
.
Ku-Python, ngokuphambene nalokho, kuthathwa njenge-classical ukubiza imisebenzi yephakheji ngokucacisa igama layo. Uma iphasela lilandwa, ngokuvamile linikezwa igama elifushanisiwe, isb. pandas
ngokuvamile kusetshenziswa igama-mbumbulu pd
. Umsebenzi wephakheji ufinyelelwa ngechashazi package_name.function_name()
.
Isabelo
Ku-R, kuvamile ukusebenzisa umcibisholo ukunikeza inani entweni. obj_name <- value
, nakuba uphawu olulodwa lokulingana luvunyelwe, uphawu olulodwa lokulingana ku-R lusetshenziswa ngokuyinhloko ukudlulisa amanani kuma-agumenti okusebenza.
Ku-Python, isabelo senziwa kuphela ngophawu olulodwa olulinganayo obj_name = value
.
Ukwenza inkomba
Kukhona futhi umehluko omkhulu lapha. Ku-R, ukukhomba kuqala kokukodwa futhi kufaka zonke izici ezishiwo ebangeni eliwumphumela,
Ku-Python, ukukhomba kuqala ku-zero futhi ububanzi obukhethiwe abufaki ingxenye yokugcina eshiwo kunkomba. Ngakho design x[i:j]
kuPython ngeke ifake isici sika-j.
Kukhona futhi umehluko ekukhombeni okunegethivu, ku-R notation x[-1]
izobuyisela zonke izici zevekhtha ngaphandle kweyokugcina. Ku-Python, inothi efanayo izobuyisela kuphela into yokugcina.
Izindlela kanye ne-OOP
R isebenzisa i-OOP ngendlela yayo, ngibhale ngalokhu esihlokweni tydiverse
kuzoba lula kunalokho pandas
. Nakuba lokhu kungase kube umbono wami subjective.
Ngamafuphi, izinto eziku-R azinazo izindlela (uma sikhuluma ngamakilasi e-S3, kodwa kukhona okunye ukuqaliswa kwe-OOP okungajwayelekile kakhulu). Kukhona imisebenzi ejwayelekile kuphela eyicubungula ngendlela ehlukile kuye ngesigaba sento.
Amapayipi
Mhlawumbe leli igama layo pandas
Ngeke ilunge ngokuphelele, kodwa ngizozama ukuchaza incazelo.
Ukuze ungalondolozi izibalo eziphakathi futhi ungakhiqizi izinto ezingadingekile endaweni yokusebenza, ungasebenzisa uhlobo lwepayipi. Labo. dlulisa umphumela wesibalo usuka komunye umsebenzi uye komunye, futhi ungagcini imiphumela emaphakathi.
Ake sithathe isibonelo sekhodi esilandelayo, lapho sigcina khona izibalo ezimaphakathi ezintweni ezihlukene:
temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )
Senze imisebenzi emi-3 ngokulandelana, futhi umphumela ngamunye walondolozwa entweni ehlukile. Kodwa eqinisweni, asizidingi lezi zinto eziphakathi nendawo.
Noma okubi nakakhulu, kodwa okujwayeleke kakhulu kubasebenzisi be-Excel.
obj <- func3(func2(func1()))
Kulokhu, asizange silondoloze imiphumela yokubala emaphakathi, kodwa ikhodi yokufunda enemisebenzi efakwe esidlekeni ayilula neze.
Sizobheka izindlela ezimbalwa zokucubungula idatha ku-R, futhi zenza imisebenzi efanayo ngezindlela ezahlukene.
Amapayipi kulabhulali tidyverse
kwenziwe ngu-opharetha %>%
.
obj <- func1() %>%
func2() %>%
func3()
Ngakho sithatha umphumela womsebenzi func1()
futhi uyidlulise njengengxabano yokuqala ku func2()
, bese sidlulisela umphumela walesi sibalo njengengxabano yokuqala func3()
. Futhi ekugcineni, sibhala zonke izibalo ezenziwe entweni obj <-
.
Konke lokhu okungenhla kuboniswe kangcono kunamagama ngale meme:
В data.table
amaketanga asetshenziswa ngendlela efanayo.
newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]
Kubakaki wesikwele ngasinye ungasebenzisa umphumela wokusebenza kwangaphambilini.
В pandas
imisebenzi enjalo ihlukaniswa ichashazi.
obj = df.fun1().fun2().fun3()
Labo. sithatha itafula lethu df futhi asebenzise indlela yakhe fun1()
, bese sisebenzisa indlela kumphumela otholiwe fun2()
, ngemva fun3()
. Umphumela ugcinwa entweni into .
Izakhiwo zedatha
Izakhiwo zedatha ku-R nePython ziyefana, kodwa zinamagama ahlukene.
Incazelo
Igama ku-R
Igama ku-Python/pandas
Isakhiwo sethebula
idatha.uhlaka, idatha.ithebula, ithebula
IdathaFrame
Uhlu olunohlangothi olulodwa lwamanani
I-Vector
Uchungechunge kuma-panda noma uhlu ku-Python ehlanzekile
Isakhiwo esinamaleveli amaningi esingewona amathebula
Uhlu
Isichazamazwi (dict)
Sizobheka ezinye izici nomehluko ku-syntax ngezansi.
Amagama ambalwa mayelana namaphakheji esizowasebenzisa
Okokuqala, ngizokutshela kancane mayelana namaphakheji ozowajwayela ngalesi sihloko.
ihlanzekile
Iwebhusayithi esemthethweni:
umtapo tidyverse
ibhalwe nguHedley Wickham, Senior Research Scientist kwaRStudio. tidyverse
iqukethe isethi ebabazekayo yamaphakeji enza ukucubungula idatha kube lula, angu-5 awo afakwe kokungu-10 okulandwayo okuphezulu okuvela endaweni yokugcina ye-CRAN.
Ingqikithi yomtapo wolwazi iqukethe amaphakheji alandelayo: ggplot2
, dplyr
, tidyr
, readr
, purrr
, tibble
, stringr
, forcats
. Ngayinye yalawa maphakheji ihloselwe ukuxazulula inkinga ethile. Ngokwesibonelo dplyr
yakhelwe ukukhohlisa idatha, tidyr
ukuletha idatha ngendlela ehlanzekile, stringr
yenza kube lula ukusebenza ngezintambo, futhi ggplot2
ingelinye lamathuluzi okubona idatha aziwa kakhulu.
inzuzo tidyverse
iwukulula futhi kulula ukuyifunda i-syntax, efana ngezindlela eziningi nolimi lwemibuzo ye-SQL.
idatha.table
Ngu data.table
nguMat Dole we-H2O.ai.
Ukukhishwa kokuqala komtapo wezincwadi kwenzeka ngo-2006.
I-syntax yephakheji ayilula njengaku tidyverse
futhi ikhumbuza kakhudlwana ozimele bedatha bakudala ku-R, kodwa ngesikhathi esifanayo banwetshwa ngokuphawulekayo ekusebenzeni.
Konke ukukhohlisa okunethebula kule phakheji kuchazwa kubakaki abayisikwele, futhi uma uhumusha i-syntax data.table
ku-SQL, uthola okuthile okufana nalokhu: data.table[ WHERE, SELECT, GROUP BY ]
Amandla ale phakheji ijubane lokucubungula inani elikhulu ledatha.
pandas
Iwebhusayithi esemthethweni:
Igama lomtapo wolwazi livela egameni lezomnotho elithi “idatha yephaneli”, elisetshenziselwa ukuchaza amasethi olwazi olwakhiwe ngezinhlangothi eziningi.
Ngu pandas
nguWes McKinney waseMelika.
Uma kukhulunywa ngokuhlaziywa kwedatha kuPython, kuyalingana pandas
Cha. Iphakheji enemisebenzi eminingi, yezinga eliphezulu ekuvumela ukuthi wenze noma yikuphi ukukhohlisa ngedatha, kusukela ekulayisheni idatha kusuka kunoma iyiphi imithombo ukuya ekuyiboneni ngeso lengqondo.
Ifaka amaphakheji engeziwe
Amaphakheji okuxoxwe ngawo kulesi sihloko awafakiwe ekusabalazweni okuyisisekelo kwe-R ne-Python. Nakuba kukhona i-caveat encane, uma ufake ukusatshalaliswa kwe-Anaconda, bese ufaka futhi pandas
akudingekile.
Ifaka amaphakheji ku-R
Uma uvule indawo yokuthuthukisa i-RStudio okungenani kanye, cishe usuyazi kakade indlela yokufaka iphakheji edingekayo ku-R. Ukuze ufake amaphakheji, sebenzisa umyalo ojwayelekile. install.packages()
ngokuyiqhuba ngqo ku-R ngokwayo.
# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")
Ngemuva kokufakwa, amaphakheji adinga ukuxhunywa, okuvame ukusetshenziswa kuwo umyalo library()
.
# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)
Ukufaka amaphakheji kuPython
Ngakho-ke, uma unePython ehlanzekile efakiwe, ke pandas
udinga ukuyifaka ngesandla. Vula umugqa womyalo, noma ukuphela, kuye ngohlelo lwakho lokusebenza bese ufaka umyalo olandelayo.
pip install pandas
Bese sibuyela kuPython bese singenisa iphakheji efakiwe ngomyalo import
.
import pandas as pd
Ilayisha Idatha
Ukumbiwa kwedatha kungenye yezinyathelo ezibaluleke kakhulu ekuhlaziyeni idatha. Kokubili i-Python ne-R, uma uthanda, ikunikeza amathuba abanzi okuthola idatha kunoma iyiphi imithombo: amafayela endawo, amafayela avela ku-inthanethi, amawebhusayithi, zonke izinhlobo zolwazi.
Kuso sonke isiqephu sendatshana sizosebenzisa amasethi edatha ambalwa:
- Okubili okulandiwe okuvela ku-Google Analytics.
- Isethi yedatha ye-Titanic Passenger.
Yonke idatha ikuyami
Ilayisha idatha ku-R: i-tidyverse, i-vroom, isifundi
Ukulayisha idatha kulabhulali tidyverse
Kunamaphakheji amabili: vroom
, readr
. vroom
yesimanjemanje, kodwa esikhathini esizayo amaphakheji angase ahlanganiswe.
Caphula kusuka vroom
.
vroom vs umfundi
Yini ukukhululwa kwevroom
kusho ukuthireadr
? Okwamanje sihlela ukuvumela amaphakheji amabili aguquke ngokuhlukana, kodwa cishe sizohlanganisa amaphakheji esikhathini esizayo. Okunye okungalungile ekufundeni kokuvilapha kwe-vroom ukuthi izinkinga zedatha ezithile azikwazi ukubikwa ngaphambili, ngakho-ke indlela engcono kakhulu yokuzihlanganisa idinga umcabango othile.vroom vs umfundi
Kusho ukuthini ukukhululwa?vroom
ngobareadr
? Okwamanje sihlela ukuthuthukisa womabili amaphakheji ngokwehlukana, kodwa cishe sizowahlanganisa esikhathini esizayo. Okukodwa kokubi kokuvilapha ukufundavroom
ukuthi ezinye izinkinga zedatha azikwazi ukubikwa kusengaphambili, ngakho-ke udinga ukucabanga ukuthi ungazihlanganisa kanjani kangcono.
Kulesi sihloko sizobheka womabili amaphakheji okulayisha idatha:
Ilayisha idatha ku-R: iphakheji ye-vroom
# install.packages("vroom")
library(vroom)
# Чтение данных
## vroom
ga_nov <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Ilayisha idatha ku-R: umfundi
# install.packages("readr")
library(readr)
# Чтение данных
## readr
ga_nov <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Ephaketheni vroom
, kungakhathaliseki ifomethi yedatha ye-csv / tsv, ukulayisha kwenziwa umsebenzi wegama elifanayo vroom()
, ephaketheni readr
sisebenzisa umsebenzi ohlukile wefomethi ngayinye read_tsv()
и read_csv()
.
Ilayisha idatha ku-R: data.table
В data.table
kunomsebenzi wokulayisha idatha fread()
.
Ilayisha idatha ku-R: data.table package
# install.packages("data.table")
library(data.table)
## data.table
ga_nov <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Ilayisha idatha ku-Python: i-pandas
Uma siqhathanisa namaphakheji we-R, khona-ke kulokhu i-syntax iseduze kakhulu pandas
kuyoba readr
, ngoba pandas
ingacela idatha noma yikuphi, futhi kunomndeni wonke wemisebenzi kule phakheji read_*()
.
read_csv()
read_excel()
read_sql()
read_json()
read_html()
Neminye imisebenzi eminingi edizayinelwe ukufunda idatha evela kumafomethi ahlukahlukene. Kodwa ngezinhloso zethu kwanele read_table()
noma read_csv()
ngokusebenzisa ukuphikisana Septhemba ukucacisa isihlukanisi sekholomu.
Ilayisha idatha ku-Python: i-pandas
import pandas as pd
ga_nov = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")
Idala ama-dataframes
Ithebula titanic, esilayishile, kukhona insimu Sex, egcina isihlonzi sobulili somgibeli.
Kodwa ukuze uthole isethulo esilula kakhulu sedatha ngokobulili bomgibeli, kufanele usebenzise igama kunekhodi yobulili.
Ukuze senze lokhu, sizokwakha inkomba encane, ithebula lapho kuzoba khona amakholomu angu-2 kuphela (ikhodi negama lobulili) kanye nemigqa emi-2, ngokulandelana.
Ukudala uhlaka lwedatha ku-R: tidyverse, dplyr
Esibonelweni sekhodi esingezansi, sakha uhlaka lwedatha esilufunayo sisebenzisa umsebenzi tibble()
.
Ukudala uhlaka lwedatha ku-R: dplyr
## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
gender = c("female", "male"))
Ukudala uhlaka lwedatha ku-R: data.table
Ukudala uhlaka lwedatha ku-R: data.table
## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
gender = c("female", "male"))
Ukudala i-dataframe ku-Python: i-pandas
В pandas
Ukwakhiwa kwamafreyimu kwenziwa ngezigaba ezimbalwa, okokuqala sakha isichazamazwi, bese siguqula isichazamazwi sibe i-dataframe.
Ukudala i-dataframe ku-Python: i-pandas
# создаём дата фрейм
gender_dict = {'id': [1, 2],
'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)
Ukukhetha Amakholomu
Amathebula osebenza nawo angase aqukathe amashumi noma amakhulu amakholomu edatha. Kodwa ukwenza ukuhlaziya, njengomthetho, awudingi wonke amakholomu atholakala kuthebula lomthombo.
Ngakho-ke, omunye wemisebenzi yokuqala ozoyenza ngethebula lomthombo ukuwususa olwazini olungadingekile futhi ukhulule inkumbulo etholakala kulolu lwazi.
Ukukhetha amakholomu kokuthi R: tidyverse, dplyr
I-syntax dplyr
ifana kakhulu nolimi lombuzo lwe-SQL, uma ujwayelene nalo uzosheshe uphumelele le phakheji.
Ukukhetha amakholomu, sebenzisa umsebenzi select()
.
Ngezansi kunezibonelo zekhodi ongakhetha ngayo amakholomu ngezindlela ezilandelayo:
- Ukufaka kuhlu amagama amakholomu adingekayo
- Bheka amagama ekholomu usebenzisa izinkulumo ezivamile
- Ngohlobo lwedatha nanoma iyiphi enye impahla yedatha equkethwe kukholomu
Ukukhetha amakholomu kokuthi R: dplyr
# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)
Ikhetha amakholomu kokuthi R: data.table
Imisebenzi efanayo ku data.table
zenziwa ngokuhlukile kancane, ekuqaleni kwesihloko nginikeze incazelo yokuthi yiziphi izimpikiswano ezingaphakathi kubakaki abayisikwele data.table
.
DT[i,j,by]
Kuphi:
i - kuphi, i.e. ukuhlunga ngemigqa
j - khetha|buyekeza|yenza, i.e. ukukhetha amakholomu nokuwaguqula
ngokuqoqa idatha
Ikhetha amakholomu kokuthi R: data.table
## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]
Iyaguquguquka .SD
ikuvumela ukuthi ufinyelele wonke amakholomu, futhi .SDcols
hlunga amakholomu adingekayo usebenzisa izinkulumo ezivamile, noma eminye imisebenzi ukuze uhlunge amagama amakholomu owadingayo.
Ukukhetha amakholomu kuPython, pandas
Ukuze ukhethe amakholomu ngamagama phakathi pandas
kwanele ukunikeza uhlu lwamagama abo. Futhi ukuze ukhethe noma ukhiphe amakholomu ngamagama usebenzisa izinkulumo ezivamile, udinga ukusebenzisa imisebenzi drop()
и filter()
, kanye nengxabano i-eksisi=1, obonisa ngayo ukuthi kuyadingeka ukucubungula amakholomu kunemigqa.
Ukukhetha inkambu ngohlobo lwedatha, sebenzisa umsebenzi select_dtypes()
, nasezingxabanweni Faka noma ungafaki dlulisa uhlu lwezinhlobo zedatha oluhambisana nokuthi yiziphi izinkambu okudingeka ukhethe.
Ukukhetha amakholomu kuPython: ama-pandas
# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])
Ihlunga imigqa
Isibonelo, ithebula lomthombo lingase liqukathe iminyaka embalwa yedatha, kodwa udinga kuphela ukuhlaziya inyanga edlule. Futhi, imigqa eyengeziwe izonciphisa inqubo yokucubungula idatha futhi ivale imemori ye-PC.
Ihlunga imigqa ngo-R: tydyverse, dplyr
В dplyr
umsebenzi usetshenziselwa ukuhlunga imigqa filter()
. Kudingeka i-dataframe njengengxabano yokuqala, bese ubhala izimo zokuhlunga.
Lapho ubhala izinkulumo ezinengqondo zokuhlunga ithebula, kulokhu, cacisa amagama ekholomu ngaphandle kwezingcaphuno futhi ngaphandle kokumemezela igama lethebula.
Uma usebenzisa izinkulumo eziningi ezinengqondo ukuze uhlunge, sebenzisa ama-opharetha alandelayo:
- & noma ukhefana - kunengqondo KANYE
- | - okunengqondo NOMA
Ihlunga imigqa ku-R: dplyr
# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)
Ihlunga imigqa ku-R: data.table
Njengoba ngike ngabhala ngenhla, ngo data.table
I-syntax yokuguqulwa kwedatha ifakwe kubakaki abayisikwele.
DT[i,j,by]
Kuphi:
i - kuphi, i.e. ukuhlunga ngemigqa
j - khetha|buyekeza|yenza, i.e. ukukhetha amakholomu nokuwaguqula
ngokuqoqa idatha
I-agumenti isetshenziselwa ukuhlunga imigqa i, enendawo yokuqala kubakaki abayisikwele.
Amakholomu afinyelelwa ngezinkulumo ezinengqondo ngaphandle kwezimpawu zokucaphuna futhi ngaphandle kokucacisa igama lethebula.
Izinkulumo ezinengqondo zihlobene ngendlela efanayo ne-in dplyr
ngokusebenzisa & kanye | nama-opharetha.
Ihlunga imigqa ku-R: data.table
## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]
Ukuhlunga izintambo kuPython: ama-pandas
Hlunga ngemigqa phakathi pandas
okufana nokuhlunga data.table
, futhi kwenziwa kubakaki abayisikwele.
Kulokhu, ukufinyelela kumakholomu kwenziwa ngempela ngokubonisa igama lohlaka lwedatha; khona-ke igama lekholomu lingabuye likhonjiswe kumamaki okucaphuna kubakaki abayisikwele (isibonelo df['col_name']
), noma ngaphandle kwezingcaphuno ngemva kwesikhathi (isibonelo df.col_name
).
Uma udinga ukuhlunga uhlaka lwedatha ngemibandela embalwa, umbandela ngamunye kufanele ubekwe kubakaki. Izimo ezinengqondo zixhunywe komunye nomunye ngama-opharetha &
и |
.
Ukuhlunga izintambo kuPython: ama-pandas
# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]
Ukuqoqa nokuhlanganisa idatha
Omunye wemisebenzi esetshenziswa kakhulu ekuhlaziyeni idatha ukuqoqa nokuhlanganisa.
I-syntax yokwenza le misebenzi isabalele kuwo wonke amaphakheji esiwabuyekezayo.
Kulokhu, sizothatha ifremu yedatha njengesibonelo titanic, futhi ubale inombolo nesilinganiso sezindleko zamathikithi kuye ngesigaba sekhabhinethi.
Ukuqoqa nokuhlanganisa idatha ku-R: tidyverse, dplyr
В dplyr
umsebenzi usetshenziselwa ukuhlanganisa group_by()
, kanye nokuhlanganisa summarise()
. Empeleni, dplyr
kunomndeni wonke wemisebenzi summarise_*()
, kodwa injongo yalesi sihloko iwukuqhathanisa i-syntax eyisisekelo, ngakho ngeke siye ehlathini elinjalo.
Imisebenzi yokuhlanganisa eyisisekelo:
sum()
- ukuhlanganisamin()
/max()
– inani eliphansi neliphezulumean()
- isilinganisomedian()
- uphakathilength()
- ubuningi
Ukuqoqa nokuhlanganisa ku-R: dplyr
## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
summarise(passangers = length(PassengerId),
avg_price = mean(Fare))
Ukuze usebenze group_by()
sidlule itafula njengengxabano yokuqala titanic, bese ikhombisa insimu I-Pclass, esizohlanganisa ngalo itafula lethu. Umphumela walo msebenzi usebenzisa opharetha %>%
kuphasiswe njengokuphikisana kokuqala komsebenzi summarise()
, futhi wengeza ezinye izinkambu ezi-2: abagibeli и isilinganiso_inani. Okokuqala, sebenzisa umsebenzi length()
kubalwe inani lamathikithi, futhi kwesibili kusetshenziswa umsebenzi mean()
ithole inani lentengo yethikithi.
Ukuqoqa nokuhlanganisa idatha ku-R: data.table
В data.table
i-agumenti isetshenziselwa ukuhlanganisa j
enendawo yesibili kubakaki abayisikwele, kanye nokuhlanganisa by
noma keyby
, ezinesikhundla sesithathu.
Uhlu lwemisebenzi yokuhlanganisa kuleli cala luyafana nalolo oluchazwe ku dplyr
, ngoba lena imisebenzi evela ku-syntax eyisisekelo ye-R.
Ukuqoqa nokuhlanganisa ku-R: idatha.table
## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
avg_price = mean(Fare)),
by = Pclass]
Ukuqoqa nokuhlanganisa idatha ku-Python: ama-panda
Ukuqoqana pandas
fana no dplyr
, kodwa ukuhlanganisa akufani ne dplyr
lutho neze data.table
.
Ukuze wenze iqembu, sebenzisa indlela groupby()
, lapho udinga ukudlulisa uhlu lwamakholomu lapho uhlaka lwedatha luzohlanganiswa khona.
Ukuhlanganisa ungasebenzisa indlela agg()
eyamukela isichazamazwi. Okhiye besichazamazwi amakholomu lapho uzosebenzisa khona imisebenzi yokuhlanganisa, futhi amanani angamagama emisebenzi yokuhlanganisa.
Imisebenzi yokuhlanganisa:
sum()
- ukuhlanganisamin()
/max()
– inani eliphansi neliphezulumean()
- isilinganisomedian()
- uphakathicount()
- ubuningi
Umsebenzi reset_index()
esibonelweni esingezansi isetshenziselwa ukusetha kabusha izinkomba ezifakwe esidlekeni lokho pandas
okuzenzakalelayo kuya ngemva kokuhlanganiswa kwedatha.
Символ ikuvumela ukuthi uye emugqeni olandelayo.
Ukuqoqa nokuhlanganisa ku-Python: ama-pandas
# группировка и агрегация данных
titanic.groupby(["Pclass"]).
agg({'PassengerId': 'count', 'Fare': 'mean'}).
reset_index()
Ukuhlangana okuqondile kwamathebula
Umsebenzi ohlanganisa kuwo amathebula amabili noma ngaphezulu esakhiwo esifanayo. Idatha esiyilayishile iqukethe amathebula ga_nov и ga_dec. Lawa mathebula ayafana ngesakhiwo, i.e. abe namakholomu afanayo, kanye nezinhlobo zedatha kulawa makholomu.
Lokhu ukulayisha okuvela ku-Google Analytics ngenyanga kaNovemba noDisemba, kulesi sigaba sizohlanganisa le datha ibe yithebula elilodwa.
Ihlanganisa amathebula ngokuqondile ku-R: tidyverse, dplyr
В dplyr
Ungahlanganisa amathebula ama-2 kwelinye usebenzisa umsebenzi bind_rows()
, edlulisa amatafula njengezimpikiswano zayo.
Ihlunga imigqa ku-R: dplyr
# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)
Ihlanganisa amathebula ngokuqondile ku-R: data.table
Futhi akuyona into eyinkimbinkimbi, masisebenzise rbind()
.
Ihlunga imigqa ku-R: data.table
## data.table
rbind(ga_nov, ga_dec)
Ukujoyina amathebula ngokuqondile ku-Python: ama-pandas
В pandas
umsebenzi usetshenziselwa ukuhlanganisa amatafula concat()
, lapho udinga ukudlulisa uhlu lwamafreyimu ukuze uwahlanganise.
Ukuhlunga izintambo kuPython: ama-pandas
# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])
Ukuhlanganisa okuvundlile kwamathebula
Umsebenzi lapho amakholomu asuka kwesibili engezwa kuthebula lokuqala ngokhiye. Ivamise ukusetshenziswa lapho kucebisa ithebula leqiniso (isibonelo, ithebula elinedatha yokuthengisa) nedatha ethile eyireferensi (isibonelo, izindleko zomkhiqizo).
Kunezinhlobo eziningana zokuhlanganisa:
Kuthebula elilayishwe ngaphambilini titanic sinekholomu Sex, ehambisana nekhodi yobulili yomgibeli:
1 - owesifazane
2 - owesilisa
Futhi, sidale itafula - incwadi yereferensi ubulili. Ukuze uthole iphrezentheshini elula kakhudlwana yedatha yobulili babagibeli, sidinga ukwengeza igama lobulili ohlwini lwemibhalo. ubulili etafuleni titanic.
Ithebula elivundlile lijoyina ku-R: i-tidyverse, i-dplyr
В dplyr
Kunomndeni wonke wemisebenzi yokujoyina okuvundlile:
inner_join()
left_join()
right_join()
full_join()
semi_join()
nest_join()
anti_join()
Okuvame ukusetshenziswa kakhulu ekusebenzeni kwami ngu left_join()
.
Njengama-agumenti amabili okuqala, imisebenzi ebalwe ngenhla ithatha amathebula amabili ukuze ihlanganiswe, futhi njengengxabano yesithathu by kufanele ucacise amakholomu ozowajoyina.
Ithebula elivundlile lijoyina ku-R: dplyr
# объединяем таблицы
left_join(titanic, gender,
by = c("Sex" = "id"))
Ukujoyina okuvundlile kwamathebula kokuthi R: data.table
В data.table
Udinga ukuhlanganisa amathebula ngokhiye usebenzisa umsebenzi merge()
.
Izimpikiswano zokuhlanganisa() umsebenzi kudatha.table
- x, y - Amathebula okujoyina
- ngo — Ikholomu ewukhiye wokuhlanganisa uma inegama elifanayo kuwo womabili amathebula
- by.x, by.y — Amagama ekholomu azohlanganiswa, uma enamagama ahlukene kumathebula
- konke, konke.x, konke.y — Joyina uhlobo, konke kuzobuyisela yonke imigqa kuwo womabili amathebula, i- all.x ihambisana nomsebenzi othi LEFT JOIN (izoshiya yonke imigqa yethebula lokuqala), all.y — ihambisana UKUSEBENZA NGOKUJOYINA NGESOKUDLA (kuzoshiya yonke imigqa yethebula lesibili).
Ukujoyina okuvundlile kwamathebula kokuthi R: data.table
# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)
Ithebula elivundlile lijoyina ku-Python: ama-pandas
Kanye naku data.table
, ku pandas
umsebenzi usetshenziselwa ukuhlanganisa amatafula merge()
.
Izimpikiswano zomsebenzi wokuhlanganisa() kuma-panda
- kanjani - Uhlobo lokuxhuma: kwesokunxele, kwesokudla, ngaphandle, ingaphakathi
- ku - Ikholomu ewukhiye uma inegama elifanayo kuwo womabili amathebula
- left_on, right_on — Amagama amakholomu angukhiye, uma enamagama ahlukene kumathebula
Ithebula elivundlile lijoyina ku-Python: ama-pandas
# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")
Imisebenzi yefasitela eyisisekelo namakholomu abaliwe
Imisebenzi yamawindi iyafana ngencazelo nemisebenzi yokuhlanganisa, futhi ivame ukusetshenziswa ekuhlaziyeni idatha. Kodwa ngokungafani nemisebenzi yokuhlanganisa, imisebenzi yewindi ayishintshi inani lemigqa yozimele wedatha ophumayo.
Empeleni, sisebenzisa umsebenzi wewindi, sihlukanisa uhlaka lwedatha olungenayo lube izingxenye ngokuya ngemibandela ethile, i.e. ngevelu yenkambu, noma izinkambu ezimbalwa. Futhi senza imisebenzi ye-arithmetic efasiteleni ngalinye. Umphumela wale misebenzi uzobuyiselwa kulayini ngamunye, i.e. ngaphandle kokushintsha ingqikithi yenani lemigqa kuthebula.
Ngokwesibonelo, ake sithathe itafula titanic. Singakwazi ukubala ukuthi ingakanani iphesenti izindleko zethikithi ngalinye bezingaphakathi kwekilasi lekhabethe layo.
Ukuze senze lokhu, sidinga ukuthola emugqeni ngamunye inani lezindleko zethikithi lekilasi lekhabethe lamanje ithikithi elikulo mugqa, bese sihlukanisa izindleko zethikithi ngalinye ngesamba sezindleko zawo wonke amathikithi ekilasi lekhabethe elifanayo. .
Iwindi lisebenza ku-R: tidyverse, dplyr
Ukwengeza amakholomu amasha, ngaphandle kokusebenzisa iqembu lemigqa, ku- dplyr
inikeza umsebenzi mutate()
.
Ungakwazi ukuxazulula inkinga echazwe ngenhla ngokuqoqa idatha ngenkambu I-Pclass nokufingqa inkambu kukholamu entsha ukwenza. Okulandelayo, khipha itafula futhi uhlukanise amanani enkambu ukwenza kulokho okwenzeka esinyathelweni esedlule.
Imisebenzi yewindi ku-R: dplyr
group_by(titanic, Pclass) %>%
mutate(Pclass_cost = sum(Fare)) %>%
ungroup() %>%
mutate(ticket_fare_rate = Fare / Pclass_cost)
Imisebenzi yewindi ku-R: data.table
I-algorithm yesixazululo ihlala ifana ne-in dplyr
, sidinga ukuhlukanisa itafula ngamafasitela ngenkambu I-Pclass. Okukhiphayo kukholomu entsha inani leqembu elihambisana nomugqa ngamunye, bese wengeza ikholomu lapho sibala khona isabelo sezindleko zethikithi ngalinye eqenjini lalo.
Ukwengeza amakholomu amasha ku data.table
opharetha okhona :=
. Ngezansi kunesibonelo sokuxazulula inkinga usebenzisa iphakheji data.table
Imisebenzi yewindi ku-R: data.table
titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost),
by = Pclass]
Imisebenzi yewindi kuPython: pandas
Enye indlela yokwengeza ikholomu entsha kuyo pandas
- sebenzisa umsebenzi assign()
. Ukufingqa izindleko zamathikithi ngeklasi lekhabhinethi, ngaphandle kokuqoqa imigqa, sizosebenzisa umsebenzi transform()
.
Ngezansi isibonelo sesixazululo esifaka kuso etafuleni titanic amakholomu angu-2 afanayo.
Imisebenzi yewindi kuPython: pandas
titanic.assign(Pclass_cost = titanic.groupby('Pclass').Fare.transform(sum),
ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])
Ithebula lezincwadi lemisebenzi nezindlela
Ngezansi kunethebula lokuxhumana phakathi kwezindlela zokwenza imisebenzi ehlukahlukene ngedatha kumaphakheji esiwacubungulile.
Incazelo
ihlanzekile
idatha.table
pandas
Ilayisha Idatha
vroom()
/ readr::read_csv()
/ readr::read_tsv()
fread()
read_csv()
Idala ama-dataframes
tibble()
data.table()
dict()
+ from_dict()
Ukukhetha Amakholomu
select()
ukuphikisana j, indawo yesibili kubakaki abayisikwele
sidlulisa uhlu lwamakholomu adingekayo kubakaki abayisikwele / drop()
/ filter()
/ select_dtypes()
Ihlunga imigqa
filter()
ukuphikisana i, indawo yokuqala kubakaki abayisikwele
Sibala izimo zokuhlunga kubakaki abayisikwele / filter()
Ukuhlanganisa nokuhlanganisa
group_by()
+ summarise()
izingxabano j + by
groupby()
+ agg()
Inyunyana yamathebula eqondile (UNION)
bind_rows()
rbind()
concat()
Ukuhlanganisa amathebula okuvundlile (JOIN)
left_join()
/ *_join()
merge()
merge()
Imisebenzi yefasitela eyisisekelo nokwengeza amakholomu abaliwe
group_by()
+ mutate()
ukuphikisana j usebenzisa opharetha :=
+ ukuphikisana by
transform()
+ assign()
isiphetho
Mhlawumbe esihlokweni engichazile hhayi ukuqaliswa okulungile kakhulu kokucubungula idatha, ngakho-ke ngizojabula uma ulungisa amaphutha ami kumazwana, noma umane wengeze ulwazi olunikezwe esihlokweni ngamanye amasu okusebenza ngedatha ku-R / Python.
Njengoba ngibhale ngenhla, inhloso yalesi sihloko kwakungekona ukuphoqelela umbono womuntu ngokuthi yiluphi ulimi olungcono, kodwa ukwenza lula ithuba lokufunda zombili izilimi, noma, uma kunesidingo, ukufuduka phakathi kwazo.
Uma usithandile lesi sihloko, ngizojabula ukuba nababhalisile abasha kweyami
I-poll
Imaphi amaphakheji alandelayo owasebenzisayo emsebenzini wakho?
Kumazwana ungabhala isizathu sokukhetha kwakho.
Abasebenzisi ababhalisiwe kuphela abangabamba iqhaza kuhlolovo.
Iyiphi iphakheji yokucubungula idatha oyisebenzisayo (ungakhetha izinketho ezimbalwa)
-
45,2%i-tidyverse19
-
33,3%idatha.ithebula14
-
54,8%ama-panda23
Bangu-42 abasebenzisi abavotile. Abasebenzisi abangu-9 bagobile.
Source: www.habr.com