Ngokukhangela i-R okanye iPython kwi-Intanethi, uya kufumana izigidi zamanqaku kunye neekhilomitha zeengxoxo ngesihloko sokuba yeyiphi eyona ingcono, ngokukhawuleza kwaye ilungele ukusebenza ngedatha. Kodwa ngelishwa, onke la manqaku kunye neengxabano azikho luncedo kakhulu.
Injongo yeli nqaku kukuthelekisa ubuchule bokucwangcisa idatha kwiipakethe ezidumileyo kuzo zombini iilwimi. Kwaye uncede abafundi bafunde ngokukhawuleza into abangayaziyo. Kwabo babhala kwiPython, fumana indlela yokwenza into efanayo kwi-R, kwaye ngokuphambene.
Ngexesha lenqaku siza kuhlalutya i-syntax yeepakethe ezidumileyo kwi-R. Ezi ziiphakheji ezibandakanyiweyo kwithala leencwadi tidyverse
kunye nephakheji data.table
. Kwaye thelekisa i-syntax yabo kunye pandas
, iphakheji yohlalutyo lwedatha ethandwa kakhulu kwiPython.
Siza kuhamba inyathelo ngenyathelo kuyo yonke indlela yohlalutyo lwedatha ukusuka ekuyilayisheni ukuya ekwenzeni imisebenzi yohlalutyo lwefestile usebenzisa iPython kunye neR.
Iziqulatho
Eli nqaku linokusetyenziswa njengephepha lokukopela ukuba ulibele ukwenza umsebenzi wokucubungula idatha kwenye yeepakethi eziqwalaselwayo.
Umahluko ophambili wesintaksi phakathi kwe-R kunye nePython
1.1.Ukufikelela kwiPackage Functions
1.2.Isabelo
1.3.Isalathiso
1.4.Iindlela kunye ne-OOP
1.5.Imibhobho
1.6.Ulwakhiwo lwedatha Amagama ambalwa malunga neepakethi esiza kuzisebenzisa
2.1.icocekile
2.2.idatha yedatha
2.3.pandas Kufakelwa iipakethe Ilayisha idatha Ukudala idataframes Ukukhetha iiKholamu ozifunayo Ukuhluza imiqolo Ukwahlulahlula kunye nokudityaniswa Umanyano oluthe nkqo lwetafile (UNION) Ukudibanisa okuthe tye kwetafile (JOIN) Imisebenzi esisiseko yefestile kunye neentsika ezibaliweyo Itheyibhile yembalelwano phakathi kweendlela zokucwangcisa idatha kwi-R kunye nePython isiphelo Uphando olufutshane malunga nokuba yeyiphi ipakethe oyisebenzisayo
Ukuba unomdla kuhlalutyo lwedatha, ungafumana yam
Umahluko ophambili wesintaksi phakathi kwe-R kunye nePython
Ukwenza kube lula kuwe ukuba utshintshe kwi-Python ukuya kwi-R, okanye ngokuphambene noko, ndiya kunika iingongoma ezimbalwa eziphambili okufuneka uzibeke ingqalelo.
Ukufikelela kwiPackage Functions
Nje ukuba ipakethe ilayishwe kwi-R, awudingi ukukhankanya igama lephakheji ukufikelela kwimisebenzi yayo. Kwiimeko ezininzi oku akuqhelekanga ku-R, kodwa kwamkelekile. Awunyanzelekanga ukuba ungenise iphakheji konke konke ukuba ufuna enye yemisebenzi yayo kwikhowudi yakho, kodwa yibize ngokulula ngokuchaza igama lepakethi kunye negama lomsebenzi. Umahluli phakathi kwephakheji kunye namagama omsebenzi kwi-R yikholoni ephindwe kabini. package_name::function_name()
.
KwiPython, ngokuchaseneyo, kuthathwa njengeklasikhi ukubiza imisebenzi yephakheji ngokucacisa ngokucacileyo igama layo. Xa iphakheji ikhutshelwa, idla ngokunikwa igama elifutshane, umz. pandas
ngokuqhelekileyo kusetyenziswa igama elinguzenzele pd
. Umsebenzi wephakheji ufikeleleka ngechaphaza package_name.function_name()
.
Isabelo
Kwi-R, kuqhelekile ukusebenzisa utolo ukunika ixabiso kwinto. obj_name <- value
, nangona uphawu olunye olulinganayo luvumelekile, uphawu olunye olulinganayo ku R lusetyenziswa ikakhulu ukudlulisa amaxabiso kwiimpikiswano zokusebenza.
KwiPython, isabelo senziwa kuphela ngophawu olunye olulinganayo obj_name = value
.
Isalathiso
Kukwakho umahluko omkhulu apha. Ku-R, isalathisi siqala kwenye kwaye siquka zonke izinto ezikhankanyiweyo kuluhlu lwesiphumo,
kwiPython, isalathiso siqala kwiqanda kwaye uluhlu olukhethiweyo aluquki into yokugqibela exeliweyo kwisalathiso. Ngoko uyilo x[i:j]
kwiPython ayisayi kubandakanya i-j element.
Kukho kwakhona iyantlukwano kwisalathiso esilandulayo, kubhalo luka-R x[-1]
izakubuyisela zonke izinto zevektha ngaphandle kweyokugqibela. KwiPython, ubhalo olufanayo luya kubuyisela kuphela into yokugqibela.
Iindlela kunye ne-OOP
R iphumeza i-OOP ngendlela yayo, ndabhala ngale nto kwinqaku tydiverse
kuya kuba lula kunokuba pandas
. Nangona oku kunokuba luluvo lwam oluzimeleyo.
Ngamafutshane, izinto ezikwi-R azinazo iindlela (ukuba sithetha ngeeklasi ze-S3, kodwa kukho ezinye izinto ze-OOP ezingaxhaphakanga kakhulu). Kukho imisebenzi jikelele kuphela eqhuba ngokwahlukileyo ngokuxhomekeke kudidi lwento.
Imibhobho
Mhlawumbi eli ligama lika pandas
Ayizukulunga ngokupheleleyo, kodwa ndiya kuzama ukucacisa intsingiselo.
Ukuze ungagcini izibalo eziphakathi kwaye ungavelisi izinto ezingadingekile kwindawo yokusebenza, ungasebenzisa uhlobo lombhobho. Ezo. dlulisela isiphumo sokubala ukusuka komnye umsebenzi ukuya kolandelayo, kwaye musa ukugcina iziphumo eziphakathi.
Masithathe lo mzekelo ulandelayo wekhowudi, apho sigcina izibalo eziphakathi kwizinto ezahlukeneyo:
temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )
Senze imisebenzi emi-3 ngokulandelelana, kwaye umphumo ngamnye wagcinwa kwinto eyahlukileyo. Kodwa enyanisweni, asizifuni ezi zinto ziphakathi.
Okanye kubi kakhulu, kodwa kuqheleke ngakumbi kubasebenzisi be-Excel.
obj <- func3(func2(func1()))
Kule meko, asigcinanga iziphumo zokubala eziphakathi, kodwa ikhowudi yokufunda enemisebenzi efakwe kwindlwane ayilunganga kakhulu.
Siza kujonga iindlela ezininzi zokucwangcisa idatha kwi-R, kwaye zenza imisebenzi efanayo ngeendlela ezahlukeneyo.
Imibhobho kwithala leencwadi tidyverse
iphunyezwe ngumsebenzisi %>%
.
obj <- func1() %>%
func2() %>%
func3()
Ngaloo ndlela sithatha umphumo womsebenzi func1()
kwaye uyidlulise njengengxabano yokuqala func2()
, emva koko siphumelele isiphumo solu balo njengengxabano yokuqala func3()
. Kwaye ekugqibeleni, sibhala zonke izibalo ezenziwe kwinto obj <-
.
Konke oku kungasentla kuboniswe ngcono kunamagama ngale meme:
В data.table
amatyathanga asetyenziswa ngendlela efanayo.
newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]
Kwizibiyeli zesikwere ngasinye ungasebenzisa isiphumo sokusebenza kwangaphambili.
В pandas
imisebenzi enjalo yahlulwe ngechaphaza.
obj = df.fun1().fun2().fun3()
Ezo. sithatha itafile yethu df kwaye usebenzise indlela yakhe fun1()
, emva koko sisebenzise indlela kwisiphumo esifunyenweyo fun2()
emva fun3()
. Isiphumo sesiphumo sigcinwa kwinto Iinjongo .
Ulwakhiwo lwedatha
Izakhiwo zedatha kwi-R kunye nePython ziyafana, kodwa zinamagama ahlukeneyo.
inkcazelo
Igama kwi-R
Igama kwiPython/pandas
Isakhiwo setafile
idatha.isakhelo, idatha.itheyibhile, ithibhule
DataFrame
Uluhlu olune-dimensional enye lwamaxabiso
Vector
Uchungechunge kwiipanda okanye uluhlu kwiPython ecocekileyo
Ulwakhiwo olunamanqanaba amaninzi angekho kwitheyibhile
Uluhlu
Isichazi-magama (dict)
Siza kujonga ezinye iimpawu kunye nomahluko kwi-syntax engezantsi.
Amagama ambalwa malunga neepakethi esiza kuzisebenzisa
Okokuqala, ndiza kukuxelela kancinci malunga neephakheji oza kuqhelana nazo ngeli nqaku.
icocekile
Iwebhusayithi esemthethweni:
ilayibrari tidyverse
ibhalwe nguHedley Wickham, iNzululwazi yoPhando oluPhezulu eRStudio. tidyverse
iqulethe isethi ekhangayo yeepakethe ezenza lula ukusetyenzwa kwedatha, i-5 yazo ifakwe kwi-10 ephezulu yokukhuphela ukusuka kwindawo yokugcina i-CRAN.
Undoqo wethala leencwadi uqulathe ezi phakheji zilandelayo: ggplot2
, dplyr
, tidyr
, readr
, purrr
, tibble
, stringr
, forcats
. Nganye kwezi phakheji ijolise ekusombululeni ingxaki ethile. Umzekelo dplyr
yenzelwe ukuguqula idatha, tidyr
ukuzisa idatha kwifom ecocekileyo, stringr
yenza lula ukusebenza ngeentambo, kunye ggplot2
sesinye sezona zixhobo zidumileyo zokubonwa kwedatha.
inzuzo tidyverse
bulula kwaye kulula ukufunda isivakalisi, esifana ngeendlela ezininzi nolwimi lombuzo lweSQL.
idatha yedatha
Ngu data.table
nguMat Dole we-H2O.ai.
Ukukhutshwa kokuqala kwethala leencwadi kwenzeka ngo-2006.
I-syntax yephakheji ayilungelekanga njengaku tidyverse
kwaye ikhumbuza ngakumbi kwiifayile zedatha zakudala kwi-R, kodwa kwangaxeshanye zandiswe kakhulu ekusebenzeni.
Zonke iinguqulelo ezinetheyibhile kule phakheji zichazwe kwizibiyeli ezisikwere, kwaye ukuba uguqulela isintaksi. data.table
kwiSQL, ufumana into enje: data.table[ WHERE, SELECT, GROUP BY ]
Amandla ale phakheji sisantya sokucwangcisa amanani amakhulu edatha.
pandas
Iwebhusayithi esemthethweni:
Igama lethala leencwadi livela kwigama le-econometric elithi "idatha yephaneli", esetyenziselwa ukuchaza iiseti zolwazi ezicwangcisiweyo ezininzi.
Ngu pandas
nguWes McKinney waseMelika.
Xa kuziwa kuhlalutyo lwedatha kwiPython, ngokulinganayo pandas
Hayi. I-multifunctional, iphakheji ephezulu evumela ukuba wenze nayiphi na inkohliso ngedatha, ukusuka ekulayisheni idatha ukusuka kuwo nawuphi na umthombo ukuya kumbono wayo.
Ukufakela iipakethe ezongezelelweyo
Iiphakheji ezixutyushwa kweli nqaku azifakwanga kwisiseko se-R kunye nePython yokusasazwa. Nangona kukho i-caveat encinci, ukuba ufakele ukuhanjiswa kwe-Anaconda, emva koko faka ukongeza pandas
a yi funeki.
Ukufaka iipakethe kwi-R
Ukuba uvule imeko-bume yophuhliso lweRStudio kanye, mhlawumbi sele uyayazi indlela yokufaka iphakheji efunekayo kwi-R. Ukufakela iipakethe, sebenzisa umyalelo oqhelekileyo. install.packages()
ngokuyiqhuba ngqo kwi-R ngokwayo.
# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")
Emva kokufakela, iipakethi kufuneka zixhunywe, apho kwiimeko ezininzi kusetyenziswa umyalelo library()
.
# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)
Ukufaka iipakethe kwiPython
Ke, ukuba unePython ecocekileyo efakiweyo, ke pandas
kufuneka uyifake ngesandla. Vula umgca womyalelo, okanye i-terminal, ngokuxhomekeke kwinkqubo yakho yokusebenza kwaye ufake lo myalelo ulandelayo.
pip install pandas
Emva koko sibuyela kwiPython kwaye singenise iphakheji efakiweyo ngomyalelo import
.
import pandas as pd
Ilayisha idatha
Ukumbiwa kwedatha yenye yezona nyathelo zibalulekileyo kuhlalutyo lwedatha. Zombini iPython kunye ne-R, ukuba ziyafuneka, zibonelela ngamathuba abanzi okufumana idatha kuyo nayiphi na imithombo: iifayile zendawo, iifayile ezivela kwi-Intanethi, iiwebhusayithi, zonke iintlobo zedatha.
Kulo lonke inqaku siza kusebenzisa iiseti zedatha ezininzi:
- Ukukhuphela kabini kwi-Google Analytics.
- Iseti yedatha yabakhweli beTitanic.
Yonke idatha ikum
Ilayisha idatha kwi-R: i-tidyverse, i-vroom, i-reader
Ukulayisha idatha kwithala leencwadi tidyverse
Kukho iipakethe ezimbini: vroom
, readr
. vroom
yangoku ngakumbi, kodwa kwixesha elizayo iipakethe zinokudityaniswa.
Caphula kwi vroom
.
vroom vs umfundi
Yintoni ukukhululwa kwevroom
kuthetha ukubareadr
? Okwangoku siceba ukuvumela iipakethe ezimbini ukuba zivele ngokwahlukeneyo, kodwa mhlawumbi siya kumanyanisa iipakethe kwixesha elizayo. Enye into engalunganga ekufundeni kwe-vroom yingxaki ethile yedatha ayinakuxelwa ngaphambili, ngoko ke indlela engcono kakhulu yokuzimanyanisa ifuna ingcamango ethile.vroom vs umfundi
Kuthetha ukuthini ukukhululwa?vroom
kubareadr
? Okwangoku siceba ukuphuhlisa zombini iipakethi ngokwahlukeneyo, kodwa mhlawumbi siya kuzidibanisa kwixesha elizayo. Enye yezinto ezingalunganga zokufunda ukonqenavroom
kukuba ezinye iingxaki ngedatha azinakuxelwa kwangaphambili, ngoko kufuneka ucinge malunga nendlela engcono kakhulu yokuzidibanisa.
Kweli nqaku siza kujonga zombini iipakethe zokulayisha idatha:
Ilayisha idatha kwi-R: iphakheji ye-vroom
# install.packages("vroom")
library(vroom)
# Чтение данных
## vroom
ga_nov <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Ilayisha idatha kwi-R: umfundi
# install.packages("readr")
library(readr)
# Чтение данных
## readr
ga_nov <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Kwiphakheji vroom
, nokuba yeyiphi ifomati yedatha ye-csv / tsv, ukulayisha kuqhutywa ngumsebenzi wegama elifanayo vroom()
, kwiphakheji readr
sisebenzisa umsebenzi owahlukileyo kwifomati nganye read_tsv()
и read_csv()
.
Ilayisha idatha kwi-R: data.table
В data.table
kukho umsebenzi wokulayisha idatha fread()
.
Ilayisha idatha kwi-R: iphakheji yedatha.table
# install.packages("data.table")
library(data.table)
## data.table
ga_nov <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Ilayisha idatha kwiPython: pandas
Ukuba sithelekisa kunye neepakethe ze-R, ngoko ke kule meko i-syntax isondele kakhulu pandas
iya kuba readr
, ngokuba pandas
inokucela idatha naphi na, kwaye kukho usapho lonke lwemisebenzi kule phakheji read_*()
.
read_csv()
read_excel()
read_sql()
read_json()
read_html()
Kwaye eminye imisebenzi emininzi eyilelwe ukufunda idatha kwiifomati ezahlukeneyo. Kodwa ngeenjongo zethu kwanele read_table()
okanye read_csv()
usebenzisa ingxoxo SEP ukukhankanya umahluli womhlathi.
Ilayisha idatha kwiPython: pandas
import pandas as pd
ga_nov = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")
Ukudala idataframes
Itheyibhile Titanic, esilayishile, kukho intsimi Ngesondo, egcina isazisi sesini somkhweli.
Kodwa kunikezelo lwedatha olulungele ngakumbi ngokwesini somkhweli, kufuneka usebenzise igama kunekhowudi yesini.
Ukwenza oku, siya kudala i-directory encinci, itafile apho kuya kubakho iikholomu ezi-2 kuphela (ikhowudi kunye negama lesini) kunye nemigca emi-2 ngokulandelanayo.
Ukwenza i-dataframe kwi-R: i-tidyverse, i-dplyr
Kumzekelo wekhowudi engezantsi, senza i-dataframe efunwayo usebenzisa umsebenzi tibble()
.
Ukwenza i-dataframe kwi-R: dplyr
## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
gender = c("female", "male"))
Ukudala i-dataframe kwi-R: idatha.table
Ukudala i-dataframe kwi-R: idatha.table
## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
gender = c("female", "male"))
Ukwenza i-dataframe kwiPython: i-pandas
В pandas
Ukwenziwa kwezakhelo kuqhutywa kwizigaba ezininzi, okokuqala sidala isichazi-magama, kwaye emva koko siguqule isichazi-magama sibe yidataframe.
Ukwenza i-dataframe kwiPython: i-pandas
# создаём дата фрейм
gender_dict = {'id': [1, 2],
'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)
Ukukhetha amaKholam
Iitheyibhile osebenza nazo zinokuqulatha iidozini okanye amakhulu eekholamu zedatha. Kodwa ukwenza uhlalutyo, njengomthetho, awudingi zonke iikholamu ezikhoyo kwitafile yomthombo.
Ke ngoko, omnye wemisebenzi yokuqala oya kuyenza ngetheyibhile yomthombo kukuyicoca yolwazi olungeyomfuneko kwaye ukhulule inkumbulo ehlala olu lwazi.
Ukukhetha iikholamu kwi-R: tidyverse, dplyr
I-Syntax dplyr
ifana kakhulu nolwimi lombuzo lwe SQL, ukuba uqhelene nayo uzakukhawuleza uyilawule le mpahla.
Ukukhetha iikholamu, sebenzisa umsebenzi select()
.
Ngezantsi yimizekelo yekhowudi onokuthi ngayo ukhethe iikholamu ngezi ndlela zilandelayo:
- Ukudwelisa amagama eekholam ezifunekayo
- Jonga kumagama ekholamu usebenzisa amabinzana aqhelekileyo
- Ngohlobo lwedatha okanye nayiphi na enye ipropati yedatha equlethwe kwikholomu
Ukukhetha iikholamu kwi-R: dplyr
# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)
Ukukhetha iikholamu kwi-R: data.table
Imisebenzi efanayo kwi data.table
zenziwa ngokwahlukileyo kancinci, ekuqaleni kwenqaku ndinike inkcazo yokuba zeziphi iingxoxo ezingaphakathi kwezibiyeli ezisikwere data.table
.
DT[i,j,by]
Kuphi:
ndi- apho, i.e. ukuhluzwa ngokwemiqolo
j - khetha|hlaziya|yenza, okt. ukukhetha iikholamu nokuziguqula
ngokwamaqela edatha
Ukukhetha iikholamu kwi-R: data.table
## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]
Eyahlukileyo .SD
ikuvumela ukuba ufikelele kuyo yonke imiqolo, kunye .SDcols
cofa imiqolo efunekayo usebenzisa iintetho eziqhelekileyo, okanye eminye imisebenzi ukucoca amagama eekholamu ozifunayo.
Ukukhetha iikholamu kwiPython, iipanda
Ukukhetha iikholamu ngamagama kwi pandas
kwanele ukunika uluhlu lwamagama abo. Kwaye ukukhetha okanye ukungabandakanyi iikholamu ngamagama usebenzisa amabinzana aqhelekileyo, kufuneka usebenzise imisebenzi drop()
и filter()
, kunye nengxoxo i-axis=1, apho ubonisa khona ukuba kuyimfuneko ukucubungula imihlathi kunemiqolo.
Ukukhetha indawo ngokohlobo lwedatha, sebenzisa umsebenzi select_dtypes()
, nakwiimpikiswano zibandakanya okanye nga ndakanyi dlula uluhlu lweentlobo zedata ezihambelana nemimandla ekufuneka uyikhethile.
Ukukhetha iikholamu kwiPython: pandas
# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])
Ukuhluza imiqolo
Umzekelo, itheyibhile yomthombo inokuba neminyaka emininzi yedatha, kodwa kufuneka uhlalutye kuphela inyanga edlulileyo. Kwakhona, imigca eyongezelelweyo iya kucothisa inkqubo yokucwangcisa idatha kwaye ivale imemori yePC.
Ukuhluza imiqolo kwi-R: tydyverse, dplyr
В dplyr
umsebenzi usetyenziselwa ukuhluza imiqolo filter()
. Kuthatha i-dataframe njengengxabano yokuqala, emva koko udwelise iimeko zokucoca.
Xa ubhala iintetho ezinengqiqo ukucoca itafile, kulo mzekelo, khankanya amagama ekholamu ngaphandle kokucaphula kwaye ngaphandle kokuchaza igama letafile.
Xa usebenzisa iintetho ezininzi ezinengqiqo ukucoca, sebenzisa aba basebenzi balandelayo:
- & okanye isiphumlisi - sengqiqweni KUNYE
- | - ingqiqo OKANYE
Ukuhluza imiqolo kwi-R: dplyr
# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)
Ukuhluza imiqolo kwi-R: data.table
Njengoko sele ndibhale ngasentla, ngo data.table
I-syntax yokuguqulwa kwedatha ifakwe kwizibiyeli zesikwere.
DT[i,j,by]
Kuphi:
ndi- apho, i.e. ukuhluzwa ngokwemiqolo
j - khetha|hlaziya|yenza, okt. ukukhetha iikholamu nokuziguqula
ngokwamaqela edatha
Ingxoxo isetyenziselwa ukuhluza imiqolo i, enendawo yokuqala kwizibiyeli zesikwere.
Imihlathi ifikeleleka ngokwentetho evakalayo ngaphandle kwamanqaku okucaphula kwaye ngaphandle kokuchaza igama letheyibhile.
Iintetho ezinengqiqo zihambelana enye kwenye ngendlela efanayo nakwi dplyr
nge & kunye | nabaqhubi.
Ukuhluza imiqolo kwi-R: data.table
## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]
Ukucoca imitya kwiPython: iipanda
Hluza ngokwemiqolo ngaphakathi pandas
iyafana nokucoca ngaphakathi data.table
, kwaye yenziwa kwizibiyeli zesikwere.
Kule meko, ukufikelela kwiikholomu kuqhutywa ngokuyimfuneko ngokubonisa igama ledatha yedatha; ke igama lekholomu lingabonakaliswa kumanqaku okucaphula kwizibiyeli zesikwere (mzekelo df['col_name']
), okanye ngaphandle kwezicatshulwa emva kwexesha (mzekelo df.col_name
).
Ukuba ufuna ukuhluza i-dataframe ngeemeko ezininzi, imeko nganye kufuneka ifakwe kwizibiyeli. Iimeko ezinengqiqo ziqhagamshelwe omnye komnye ngabaqhubi &
и |
.
Ukucoca imitya kwiPython: iipanda
# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]
Ukwahlulahlula kunye nokudityaniswa kwedatha
Enye yezona zinto zisetyenziswa ngokuqhelekileyo kuhlalutyo lwedatha kukwenza amaqela kunye nokudibanisa.
Isivakalisi sokwenza le misebenzi sithe saa kuzo zonke iipakethe esiziphononongayo.
Kule meko, siya kuthatha i-dataframe njengomzekelo Titanic, kwaye ubale inani kunye neendleko eziqhelekileyo zamatikiti ngokuxhomekeke kwiklasi ye-cabin.
Ukwahlulahlula kunye nokudityaniswa kwedatha kwi-R: i-tidyverse, i-dplyr
В dplyr
umsebenzi usetyenziselwa ukwenza amaqela group_by()
, kunye nokudibanisa summarise()
. Inyaniso, dplyr
kukho usapho lonke lwemisebenzi summarise_*()
, kodwa injongo yeli nqaku kukuthelekisa i-syntax esisiseko, ngoko asiyi kungena kwihlathi elinjalo.
Imisebenzi esisiseko yokudibanisa:
sum()
- ukushwankathelamin()
/max()
– ubuncinane kunye nexabiso eliphezulumean()
- umyingemedian()
- iphakathilength()
- ubuninzi
Ukwahlulahlula kunye nokudibanisa kwi-R: dplyr
## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
summarise(passangers = length(PassengerId),
avg_price = mean(Fare))
Ukusebenza group_by()
sigqithise itafile njengengxabano yokuqala Titanic, kwaye yabonisa indawo Iklasi, esiya kuthi ngayo sidibanise itafile yethu. Isiphumo salo msebenzi usebenzisa umsebenzisi %>%
igqithiswe njengengxoxo yokuqala kumsebenzi summarise()
, kwaye wongeze imihlaba emi-2 ngaphezulu: abakhweli и avg_xabiso. Okokuqala, sebenzisa umsebenzi length()
kubalwe inani lamatikiti, kwaye okwesibini usebenzisa umsebenzi mean()
ifumene ixabiso letikiti eliphakathi.
Ukwahlulahlula kunye nokudityaniswa kwedatha kwi-R: idatha.table
В data.table
Ingxoxo isetyenziselwa ukuhlanganisa j
enendawo yesibini kwizibiyeli ezisikwere, kunye neyokwenza amaqela by
okanye keyby
, ezinendawo yesithathu.
Uluhlu lwemisebenzi yokudibanisa kule meko iyafana naleyo ichazwe kwi dplyr
, ngokuba le yimisebenzi evela kwisiseko se-syntax ye-R.
Ukwahlulahlula kunye nokudibanisa kwi-R: idatha.table
## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
avg_price = mean(Fare)),
by = Pclass]
Ukwahlulahlula kunye nokudityaniswa kwedatha kwiPython: iipanda
Ukwenza amaqela pandas
iyelelene kwi dplyr
, kodwa udibaniso alufani ne dplyr
hayi kwi data.table
.
Ukwenza iqela, sebenzisa indlela groupby()
, apho kufuneka ugqithise uluhlu lwezintlu apho i-dataframe iya kudityaniswa khona.
Ukudibanisa ungasebenzisa indlela agg()
eyamkela isichazi-magama. Izitshixo zesichazi-magama yimiqolo apho uya kusebenzisa imisebenzi yokudibanisa, kwaye amaxabiso ngamagama emisebenzi yohlanganiso.
Imisebenzi yokudibanisa:
sum()
- ukushwankathelamin()
/max()
– ubuncinane kunye nexabiso eliphezulumean()
- umyingemedian()
- iphakathicount()
- ubuninzi
Umsebenzi reset_index()
kumzekelo ongezantsi isetyenziselwa ukuseta kwakhona izalathisi ezibekwe kwindlwane leyo pandas
okungagqibekanga emva kokuhlanganiswa kwedatha.
Uphawu ikuvumela ukuba uye kumgca olandelayo.
Ukwahlulahlula kunye nokudibanisa kwiPython: iipanda
# группировка и агрегация данных
titanic.groupby(["Pclass"]).
agg({'PassengerId': 'count', 'Fare': 'mean'}).
reset_index()
Ukudityaniswa okuthe nkqo kweetafile
Umsebenzi odibanisa iitafile ezimbini nangaphezulu zesakhiwo esifanayo. Idatha esiyilayishile iqulethe iitafile ga_nov и ga_dec. Ezi tafile ziyafana kwisakhiwo, okt. zinemihlathi efanayo, kunye neendidi zedatha kule miqolo.
Oku kulayishwa kwi-Google Analytics ngenyanga kaNovemba noDisemba, kweli candelo siya kudibanisa le datha kwitafile enye.
Ukudibanisa ngokuthe nkqo iitafile kwi-R: i-tidyverse, i-dplyr
В dplyr
Ungadibanisa iitafile ezimbini zibe nye usebenzisa umsebenzi bind_rows()
, idlulisa iitafile njengeengxoxo zayo.
Ukuhluza imiqolo kwi-R: dplyr
# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)
Ngokuthe nkqo ukujoyina iitafile kwi-R: idatha.table
Kwakhona akukho nto inzima, masisebenzise rbind()
.
Ukuhluza imiqolo kwi-R: data.table
## data.table
rbind(ga_nov, ga_dec)
Ukudibanisa ngokuthe nkqo iitafile kwiPython: iipanda
В pandas
umsebenzi usetyenziswa ukudibanisa iitafile concat()
, apho kufuneka ugqithise uluhlu lwezakhelo ukuzidibanisa.
Ukucoca imitya kwiPython: iipanda
# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])
Ukudibanisa okuthe tye kwetafile
Umsebenzi apho iikholamu ezivela kweyesibini zongezwa kwitafile yokuqala ngesitshixo. Ihlala isetyenziswa xa kutyetyiswa itheyibhile yenyani (umzekelo, itafile enedatha yokuthengisa) kunye nedatha ethile yereferensi (umzekelo, ixabiso lemveliso).
Kukho iintlobo ezininzi zokudibanisa:
Kwitafile elayishwe ngaphambili Titanic sinekholamu Ngesondo, ehambelana nekhowudi yesini somkhweli:
I-1 - ibhinqa
2 - indoda
Kwakhona, senze itafile - incwadi yereferensi ngesini. Kumboniso oluncedo ngakumbi wedatha kwisini sabakhweli, kufuneka songeze igama lesini kuluhlu ngesini etafileni Titanic.
Itheyibhile ethe tye ukujoyina kwi-R: i-tidyverse, i-dplyr
В dplyr
Kukho usapho lonke lwemisebenzi yokudibanisa okuthe tye:
inner_join()
left_join()
right_join()
full_join()
semi_join()
nest_join()
anti_join()
Eyona iqhelekileyo isetyenziswa kwindlela yam left_join()
.
Njengeengxoxo ezimbini zokuqala, imisebenzi edweliswe ngasentla ithatha iitheyibhile ezimbini ukuzidibanisa, kunye nengxoxo yesithathu by kufuneka ukhankanye imiqolo yokudibanisa.
Itheyibhile ethe tye ukujoyina kwi-R: dplyr
# объединяем таблицы
left_join(titanic, gender,
by = c("Sex" = "id"))
Ukudityaniswa okuthe tye kwetafile kwi-R: idatha.table
В data.table
Kufuneka udibanise iitafile ngeqhosha usebenzisa umsebenzi merge()
.
Iingxoxo zokudibanisa () umsebenzi kwidatha.table
- x, y — Iitheyibhile zokudibanisa
- ngo - Ikholamu elisisitshixo sokudibanisa ukuba sinegama elifanayo kuzo zombini iitafile
- ngo.x, ngo.y — Amagama eekholamu ezidityanisiweyo, ukuba zinamagama ahlukeneyo kwiitheyibhile
- zonke, zonke, zonke.x, zonke.y — Dibanisa udidi, zonke ziya kubuyisela yonke imiqolo evela kwiitheyibhile zombini, zonke.x zihambelana nomsebenzi we-LEFT JOIN (iya kushiya yonke imiqolo yetafile yokuqala), all.y — ihambelana UKUSEBENZA KWAKULUNGILEYO (iya kushiya yonke imiqolo yetafile yesibini).
Ukudityaniswa okuthe tye kwetafile kwi-R: idatha.table
# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)
Itheyibhile ethe tye ukujoyina kwiPython: iipanda
Kanye nakwi data.table
ngaphakathi pandas
umsebenzi usetyenziswa ukudibanisa iitafile merge()
.
Iingxoxo zodibaniso () umsebenzi kwiipanda
- njani — Uhlobo loqhagamshelo: ekhohlo, ekunene, ngaphandle, ngaphakathi
- on — Ikholamu elisisitshixo ukuba inegama elifanayo kuzo zombini iitafile
- left_on, right_on — Amagama eekholamu eziphambili, ukuba zinamagama ahlukeneyo kwiitheyibhile
Itheyibhile ethe tye ukujoyina kwiPython: iipanda
# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")
Imisebenzi esisiseko yefestile kunye neentsika ezibaliweyo
Imisebenzi yefestile iyafana ngentsingiselo kwimisebenzi yokudibanisa, kwaye ikwasetyenziswa rhoqo kuhlalutyo lwedatha. Kodwa ngokungafaniyo nemisebenzi yohlanganiso, imisebenzi yefestile ayitshintshi inani lemiqolo yedata ephumayo.
Ngokusisiseko, usebenzisa umsebenzi wefestile, sahlula i-data engenayo ibe ngamacandelo ngokwemigaqo ethile, okt. ngexabiso lomhlaba, okanye imihlaba emininzi. Kwaye senza imisebenzi ye-arithmetic kwifestile nganye. Isiphumo sale misebenzi siya kubuyiselwa kumgca ngamnye, okt. ngaphandle kokutshintsha inani elipheleleyo lemiqolo kwitheyibhile.
Umzekelo, makhe sithathe itafile Titanic. Sinokubala ukuba yeyiphi ipesenti yeendleko zetikiti ngalinye ngaphakathi kweklasi yekhabhinethi yayo.
Ukwenza oku, kufuneka sifumane kumgca ngamnye ixabiso elipheleleyo letikiti leklasi yangoku yekhabhin apho itikiti elikulo mgca likulo, emva koko sahlule iindleko zetikiti ngalinye ngexabiso lilonke lawo onke amatikiti eklasi enye yekhabhini. .
Imisebenzi yefestile kwi-R: i-tidyverse, i-dplyr
Ukongeza imiqolo emitsha, ngaphandle kokusebenzisa amaqela erowu, kwi dplyr
inika umsebenzi mutate()
.
Ungayisombulula ingxaki echazwe ngasentla ngokucwangcisa idatha ngentsimi Iklasi kunye nokushwankathela intsimi kwikholamu entsha ukwenza. Emva koko, susa uluhlu lwetafile kwaye wahlule amaxabiso entsimi ukwenza kwinto eyenzekileyo kwinyathelo langaphambili.
Imisebenzi yefestile kwi-R: dplyr
group_by(titanic, Pclass) %>%
mutate(Pclass_cost = sum(Fare)) %>%
ungroup() %>%
mutate(ticket_fare_rate = Fare / Pclass_cost)
Imisebenzi yefestile kwi-R: data.table
I-algorithm yesisombululo ihlala ifana ne-in dplyr
, sidinga ukwahlula itafile kwiifestile ngentsimi Iklasi. Imveliso kwikholamu entsha isixa seqela elihambelana nomqolo ngamnye, kwaye wongeze ikholamu apho sibala isabelo seendleko zetikiti ngalinye kwiqela lalo.
Ukongeza iikholamu ezintsha kwi data.table
umqhubi okhoyo :=
. Ngezantsi ngumzekelo wokusombulula ingxaki usebenzisa ipakethe data.table
Imisebenzi yefestile kwi-R: data.table
titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost),
by = Pclass]
Imisebenzi yefestile kwiPython: iipanda
Enye indlela yokongeza ikholamu entsha kwi pandas
- sebenzisa umsebenzi assign()
. Ukushwankathela iindleko zamatikiti ngeklasi yekhabhinethi, ngaphandle kwemigca yeqela, siya kusebenzisa umsebenzi transform()
.
Ngezantsi umzekelo wesisombululo apho songeza kwitafile Titanic imiqolo emi-2 efanayo.
Imisebenzi yefestile kwiPython: iipanda
titanic.assign(Pclass_cost = titanic.groupby('Pclass').Fare.transform(sum),
ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])
Itheyibhile yembalelwano yemisebenzi kunye neendlela
Ngezantsi itheyibhile yembalelwano phakathi kweendlela zokwenza imisebenzi eyahlukeneyo ngedatha kwiipakethi esiziqwalasele.
inkcazelo
icocekile
idatha yedatha
pandas
Ilayisha idatha
vroom()
/ readr::read_csv()
/ readr::read_tsv()
fread()
read_csv()
Ukudala idataframes
tibble()
data.table()
dict()
+ from_dict()
Ukukhetha amaKholam
select()
impikiswano j, indawo yesibini kwizibiyeli zesikwere
sigqithisa uluhlu lweentsika ezifunekayo kwizibiyeli ezisikwere / drop()
/ filter()
/ select_dtypes()
Ukuhluza imiqolo
filter()
impikiswano i, indawo yokuqala kwizibiyeli zesikwere
Sidwelisa iimeko zokucoca kwizibiyeli zesikwere / filter()
Ukwahlulahlula kunye nokudityaniswa
group_by()
+ summarise()
ngxabano j + by
groupby()
+ agg()
Umanyano oluthe nkqo lwetafile (UNION)
bind_rows()
rbind()
concat()
Ukudibanisa okuthe tye kwetafile (JOIN)
left_join()
/ *_join()
merge()
merge()
Imisebenzi esisiseko yefestile kunye nokongeza iikholamu ezibaliweyo
group_by()
+ mutate()
impikiswano j usebenzisa umsebenzisi :=
+ ingxoxo by
transform()
+ assign()
isiphelo
Mhlawumbi kwinqaku endilichazileyo hayi eyona ndlela ilungileyo yokuphunyezwa kokusetyenzwa kwedatha, ke ndiya kuvuya ukuba ulungisa iimpazamo zam kwizimvo, okanye ukongeza nje ulwazi olunikwe kwinqaku kunye nezinye iindlela zokusebenza ngedatha kwi-R / Python.
Njengoko ndibhale ngasentla, injongo yenqaku yayingekokunyanzelisa uluvo lomntu malunga nokuba luluphi ulwimi olungcono, kodwa ukwenza lula ithuba lokufunda zombini iilwimi, okanye, ukuba kuyimfuneko, ukufuduka phakathi kwazo.
Ukuba ulithandile inqaku, ndiya kuvuya ukuba nababhalisi abatsha kum
Poll
Yeyiphi ipakethe kwezi zilandelayo oyisebenzisayo emsebenzini wakho?
Kwizimvo ungabhala isizathu sokukhetha kwakho.
Ngabasebenzisi ababhalisiweyo kuphela abanokuthatha inxaxheba kuphando.
Yeyiphi iphakheji yokucwangcisa idatha oyisebenzisayo (ungakhetha iinketho ezininzi)
-
45,2%icocekile19
-
33,3%idatha.table14
-
54,8%iipanda23
Bangama-42 abasebenzisi abavotileyo. Abasebenzisi abali-9 abakhange.
umthombo: www.habr.com