Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Ngokukhangela i-R okanye iPython kwi-Intanethi, uya kufumana izigidi zamanqaku kunye neekhilomitha zeengxoxo ngesihloko sokuba yeyiphi eyona ingcono, ngokukhawuleza kwaye ilungele ukusebenza ngedatha. Kodwa ngelishwa, onke la manqaku kunye neengxabano azikho luncedo kakhulu.

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Injongo yeli nqaku kukuthelekisa ubuchule bokucwangcisa idatha kwiipakethe ezidumileyo kuzo zombini iilwimi. Kwaye uncede abafundi bafunde ngokukhawuleza into abangayaziyo. Kwabo babhala kwiPython, fumana indlela yokwenza into efanayo kwi-R, kwaye ngokuphambene.

Ngexesha lenqaku siza kuhlalutya i-syntax yeepakethe ezidumileyo kwi-R. Ezi ziiphakheji ezibandakanyiweyo kwithala leencwadi tidyversekunye nephakheji data.table. Kwaye thelekisa i-syntax yabo kunye pandas, iphakheji yohlalutyo lwedatha ethandwa kakhulu kwiPython.

Siza kuhamba inyathelo ngenyathelo kuyo yonke indlela yohlalutyo lwedatha ukusuka ekuyilayisheni ukuya ekwenzeni imisebenzi yohlalutyo lwefestile usebenzisa iPython kunye neR.

Iziqulatho

Eli nqaku linokusetyenziswa njengephepha lokukopela ukuba ulibele ukwenza umsebenzi wokucubungula idatha kwenye yeepakethi eziqwalaselwayo.

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

  1. Umahluko ophambili wesintaksi phakathi kwe-R kunye nePython
    1.1. Ukufikelela kwiPackage Functions
    1.2. Isabelo
    1.3. Isalathiso
    1.4. Iindlela kunye ne-OOP
    1.5. Imibhobho
    1.6. Ulwakhiwo lwedatha
  2. Amagama ambalwa malunga neepakethi esiza kuzisebenzisa
    2.1. icocekile
    2.2. idatha yedatha
    2.3. pandas
  3. Kufakelwa iipakethe
  4. Ilayisha idatha
  5. Ukudala idataframes
  6. Ukukhetha iiKholamu ozifunayo
  7. Ukuhluza imiqolo
  8. Ukwahlulahlula kunye nokudityaniswa
  9. Umanyano oluthe nkqo lwetafile (UNION)
  10. Ukudibanisa okuthe tye kwetafile (JOIN)
  11. Imisebenzi esisiseko yefestile kunye neentsika ezibaliweyo
  12. Itheyibhile yembalelwano phakathi kweendlela zokucwangcisa idatha kwi-R kunye nePython
  13. isiphelo
  14. Uphando olufutshane malunga nokuba yeyiphi ipakethe oyisebenzisayo

Ukuba unomdla kuhlalutyo lwedatha, ungafumana yam yilelegram и youtube imijelo. Uninzi lomxholo lunikezelwe kulwimi R.

Umahluko ophambili wesintaksi phakathi kwe-R kunye nePython

Ukwenza kube lula kuwe ukuba utshintshe kwi-Python ukuya kwi-R, okanye ngokuphambene noko, ndiya kunika iingongoma ezimbalwa eziphambili okufuneka uzibeke ingqalelo.

Ukufikelela kwiPackage Functions

Nje ukuba ipakethe ilayishwe kwi-R, awudingi ukukhankanya igama lephakheji ukufikelela kwimisebenzi yayo. Kwiimeko ezininzi oku akuqhelekanga ku-R, kodwa kwamkelekile. Awunyanzelekanga ukuba ungenise iphakheji konke konke ukuba ufuna enye yemisebenzi yayo kwikhowudi yakho, kodwa yibize ngokulula ngokuchaza igama lepakethi kunye negama lomsebenzi. Umahluli phakathi kwephakheji kunye namagama omsebenzi kwi-R yikholoni ephindwe kabini. package_name::function_name().

KwiPython, ngokuchaseneyo, kuthathwa njengeklasikhi ukubiza imisebenzi yephakheji ngokucacisa ngokucacileyo igama layo. Xa iphakheji ikhutshelwa, idla ngokunikwa igama elifutshane, umz. pandas ngokuqhelekileyo kusetyenziswa igama elinguzenzele pd. Umsebenzi wephakheji ufikeleleka ngechaphaza package_name.function_name().

Isabelo

Kwi-R, kuqhelekile ukusebenzisa utolo ukunika ixabiso kwinto. obj_name <- value, nangona uphawu olunye olulinganayo luvumelekile, uphawu olunye olulinganayo ku R lusetyenziswa ikakhulu ukudlulisa amaxabiso kwiimpikiswano zokusebenza.

KwiPython, isabelo senziwa kuphela ngophawu olunye olulinganayo obj_name = value.

Isalathiso

Kukwakho umahluko omkhulu apha. Ku-R, isalathisi siqala kwenye kwaye siquka zonke izinto ezikhankanyiweyo kuluhlu lwesiphumo,

kwiPython, isalathiso siqala kwiqanda kwaye uluhlu olukhethiweyo aluquki into yokugqibela exeliweyo kwisalathiso. Ngoko uyilo x[i:j] kwiPython ayisayi kubandakanya i-j element.

Kukho kwakhona iyantlukwano kwisalathiso esilandulayo, kubhalo luka-R x[-1] izakubuyisela zonke izinto zevektha ngaphandle kweyokugqibela. KwiPython, ubhalo olufanayo luya kubuyisela kuphela into yokugqibela.

Iindlela kunye ne-OOP

R iphumeza i-OOP ngendlela yayo, ndabhala ngale nto kwinqaku "OOP ngolwimi lwe-R (icandelo 1): Iiklasi ze-S3". Ngokubanzi, u-R lulwimi olusebenzayo, kwaye yonke into ekulo yakhiwe kwimisebenzi. Ke ngoko, umzekelo, kubasebenzisi be-Excel, yiya ku tydiverse kuya kuba lula kunokuba pandas. Nangona oku kunokuba luluvo lwam oluzimeleyo.

Ngamafutshane, izinto ezikwi-R azinazo iindlela (ukuba sithetha ngeeklasi ze-S3, kodwa kukho ezinye izinto ze-OOP ezingaxhaphakanga kakhulu). Kukho imisebenzi jikelele kuphela eqhuba ngokwahlukileyo ngokuxhomekeke kudidi lwento.

Imibhobho

Mhlawumbi eli ligama lika pandas Ayizukulunga ngokupheleleyo, kodwa ndiya kuzama ukucacisa intsingiselo.

Ukuze ungagcini izibalo eziphakathi kwaye ungavelisi izinto ezingadingekile kwindawo yokusebenza, ungasebenzisa uhlobo lombhobho. Ezo. dlulisela isiphumo sokubala ukusuka komnye umsebenzi ukuya kolandelayo, kwaye musa ukugcina iziphumo eziphakathi.

Masithathe lo mzekelo ulandelayo wekhowudi, apho sigcina izibalo eziphakathi kwizinto ezahlukeneyo:

temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )

Senze imisebenzi emi-3 ngokulandelelana, kwaye umphumo ngamnye wagcinwa kwinto eyahlukileyo. Kodwa enyanisweni, asizifuni ezi zinto ziphakathi.

Okanye kubi kakhulu, kodwa kuqheleke ngakumbi kubasebenzisi be-Excel.

obj  <- func3(func2(func1()))

Kule meko, asigcinanga iziphumo zokubala eziphakathi, kodwa ikhowudi yokufunda enemisebenzi efakwe kwindlwane ayilunganga kakhulu.

Siza kujonga iindlela ezininzi zokucwangcisa idatha kwi-R, kwaye zenza imisebenzi efanayo ngeendlela ezahlukeneyo.

Imibhobho kwithala leencwadi tidyverse iphunyezwe ngumsebenzisi %>%.

obj <- func1() %>% 
            func2() %>%
            func3()

Ngaloo ndlela sithatha umphumo womsebenzi func1() kwaye uyidlulise njengengxabano yokuqala func2(), emva koko siphumelele isiphumo solu balo njengengxabano yokuqala func3(). Kwaye ekugqibeleni, sibhala zonke izibalo ezenziwe kwinto obj <-.

Konke oku kungasentla kuboniswe ngcono kunamagama ngale meme:
Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

В data.table amatyathanga asetyenziswa ngendlela efanayo.

newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]

Kwizibiyeli zesikwere ngasinye ungasebenzisa isiphumo sokusebenza kwangaphambili.

В pandas imisebenzi enjalo yahlulwe ngechaphaza.

obj = df.fun1().fun2().fun3()

Ezo. sithatha itafile yethu df kwaye usebenzise indlela yakhe fun1(), emva koko sisebenzise indlela kwisiphumo esifunyenweyo fun2()emva fun3(). Isiphumo sesiphumo sigcinwa kwinto Iinjongo .

Ulwakhiwo lwedatha

Izakhiwo zedatha kwi-R kunye nePython ziyafana, kodwa zinamagama ahlukeneyo.

inkcazelo
Igama kwi-R
Igama kwiPython/pandas

Isakhiwo setafile
idatha.isakhelo, idatha.itheyibhile, ithibhule
DataFrame

Uluhlu olune-dimensional enye lwamaxabiso
Vector
Uchungechunge kwiipanda okanye uluhlu kwiPython ecocekileyo

Ulwakhiwo olunamanqanaba amaninzi angekho kwitheyibhile
Uluhlu
Isichazi-magama (dict)

Siza kujonga ezinye iimpawu kunye nomahluko kwi-syntax engezantsi.

Amagama ambalwa malunga neepakethi esiza kuzisebenzisa

Okokuqala, ndiza kukuxelela kancinci malunga neephakheji oza kuqhelana nazo ngeli nqaku.

icocekile

Iwebhusayithi esemthethweni: tidyverse.org
Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva
ilayibrari tidyverse ibhalwe nguHedley Wickham, iNzululwazi yoPhando oluPhezulu eRStudio. tidyverse iqulethe isethi ekhangayo yeepakethe ezenza lula ukusetyenzwa kwedatha, i-5 yazo ifakwe kwi-10 ephezulu yokukhuphela ukusuka kwindawo yokugcina i-CRAN.

Undoqo wethala leencwadi uqulathe ezi phakheji zilandelayo: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats. Nganye kwezi phakheji ijolise ekusombululeni ingxaki ethile. Umzekelo dplyr yenzelwe ukuguqula idatha, tidyr ukuzisa idatha kwifom ecocekileyo, stringr yenza lula ukusebenza ngeentambo, kunye ggplot2 sesinye sezona zixhobo zidumileyo zokubonwa kwedatha.

inzuzo tidyverse bulula kwaye kulula ukufunda isivakalisi, esifana ngeendlela ezininzi nolwimi lombuzo lweSQL.

idatha yedatha

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomvaIwebhusayithi esemthethweni: r-datatable.com

Ngu data.table nguMat Dole we-H2O.ai.

Ukukhutshwa kokuqala kwethala leencwadi kwenzeka ngo-2006.

I-syntax yephakheji ayilungelekanga njengaku tidyverse kwaye ikhumbuza ngakumbi kwiifayile zedatha zakudala kwi-R, kodwa kwangaxeshanye zandiswe kakhulu ekusebenzeni.

Zonke iinguqulelo ezinetheyibhile kule phakheji zichazwe kwizibiyeli ezisikwere, kwaye ukuba uguqulela isintaksi. data.table kwiSQL, ufumana into enje: data.table[ WHERE, SELECT, GROUP BY ]

Amandla ale phakheji sisantya sokucwangcisa amanani amakhulu edatha.

pandas

Iwebhusayithi esemthethweni: pandas.pydata.org Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Igama lethala leencwadi livela kwigama le-econometric elithi "idatha yephaneli", esetyenziselwa ukuchaza iiseti zolwazi ezicwangcisiweyo ezininzi.

Ngu pandas nguWes McKinney waseMelika.

Xa kuziwa kuhlalutyo lwedatha kwiPython, ngokulinganayo pandas Hayi. I-multifunctional, iphakheji ephezulu evumela ukuba wenze nayiphi na inkohliso ngedatha, ukusuka ekulayisheni idatha ukusuka kuwo nawuphi na umthombo ukuya kumbono wayo.

Ukufakela iipakethe ezongezelelweyo

Iiphakheji ezixutyushwa kweli nqaku azifakwanga kwisiseko se-R kunye nePython yokusasazwa. Nangona kukho i-caveat encinci, ukuba ufakele ukuhanjiswa kwe-Anaconda, emva koko faka ukongeza pandas a yi funeki.

Ukufaka iipakethe kwi-R

Ukuba uvule imeko-bume yophuhliso lweRStudio kanye, mhlawumbi sele uyayazi indlela yokufaka iphakheji efunekayo kwi-R. Ukufakela iipakethe, sebenzisa umyalelo oqhelekileyo. install.packages() ngokuyiqhuba ngqo kwi-R ngokwayo.

# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")

Emva kokufakela, iipakethi kufuneka zixhunywe, apho kwiimeko ezininzi kusetyenziswa umyalelo library().

# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)

Ukufaka iipakethe kwiPython

Ke, ukuba unePython ecocekileyo efakiweyo, ke pandas kufuneka uyifake ngesandla. Vula umgca womyalelo, okanye i-terminal, ngokuxhomekeke kwinkqubo yakho yokusebenza kwaye ufake lo myalelo ulandelayo.

pip install pandas

Emva koko sibuyela kwiPython kwaye singenise iphakheji efakiweyo ngomyalelo import.

import pandas as pd

Ilayisha idatha

Ukumbiwa kwedatha yenye yezona nyathelo zibalulekileyo kuhlalutyo lwedatha. Zombini iPython kunye ne-R, ukuba ziyafuneka, zibonelela ngamathuba abanzi okufumana idatha kuyo nayiphi na imithombo: iifayile zendawo, iifayile ezivela kwi-Intanethi, iiwebhusayithi, zonke iintlobo zedatha.

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Kulo lonke inqaku siza kusebenzisa iiseti zedatha ezininzi:

  1. Ukukhuphela kabini kwi-Google Analytics.
  2. Iseti yedatha yabakhweli beTitanic.

Yonke idatha ikum GitHub ngohlobo lweefayile ze-csv kunye ne-tsv. Siza kuzicela phi?

Ilayisha idatha kwi-R: i-tidyverse, i-vroom, i-reader

Ukulayisha idatha kwithala leencwadi tidyverse Kukho iipakethe ezimbini: vroom, readr. vroom yangoku ngakumbi, kodwa kwixesha elizayo iipakethe zinokudityaniswa.

Caphula kwi amaxwebhu asemthethweni vroom.

vroom vs umfundi
Yintoni ukukhululwa kwe vroom kuthetha ukuba readr? Okwangoku siceba ukuvumela iipakethe ezimbini ukuba zivele ngokwahlukeneyo, kodwa mhlawumbi siya kumanyanisa iipakethe kwixesha elizayo. Enye into engalunganga ekufundeni kwe-vroom yingxaki ethile yedatha ayinakuxelwa ngaphambili, ngoko ke indlela engcono kakhulu yokuzimanyanisa ifuna ingcamango ethile.

vroom vs umfundi
Kuthetha ukuthini ukukhululwa? vroom kuba readr? Okwangoku siceba ukuphuhlisa zombini iipakethi ngokwahlukeneyo, kodwa mhlawumbi siya kuzidibanisa kwixesha elizayo. Enye yezinto ezingalunganga zokufunda ukonqena vroom kukuba ezinye iingxaki ngedatha azinakuxelwa kwangaphambili, ngoko kufuneka ucinge malunga nendlela engcono kakhulu yokuzidibanisa.

Kweli nqaku siza kujonga zombini iipakethe zokulayisha idatha:

Ilayisha idatha kwi-R: iphakheji ye-vroom

# install.packages("vroom")
library(vroom)

# Чтение данных
## vroom
ga_nov  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Ilayisha idatha kwi-R: umfundi

# install.packages("readr")
library(readr)

# Чтение данных
## readr
ga_nov  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Kwiphakheji vroom, nokuba yeyiphi ifomati yedatha ye-csv / tsv, ukulayisha kuqhutywa ngumsebenzi wegama elifanayo vroom(), kwiphakheji readr sisebenzisa umsebenzi owahlukileyo kwifomati nganye read_tsv() и read_csv().

Ilayisha idatha kwi-R: data.table

В data.table kukho umsebenzi wokulayisha idatha fread().

Ilayisha idatha kwi-R: iphakheji yedatha.table

# install.packages("data.table")
library(data.table)

## data.table
ga_nov  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Ilayisha idatha kwiPython: pandas

Ukuba sithelekisa kunye neepakethe ze-R, ngoko ke kule meko i-syntax isondele kakhulu pandas iya kuba readr, ngokuba pandas inokucela idatha naphi na, kwaye kukho usapho lonke lwemisebenzi kule phakheji read_*().

  • read_csv()
  • read_excel()
  • read_sql()
  • read_json()
  • read_html()

Kwaye eminye imisebenzi emininzi eyilelwe ukufunda idatha kwiifomati ezahlukeneyo. Kodwa ngeenjongo zethu kwanele read_table() okanye read_csv() usebenzisa ingxoxo SEP ukukhankanya umahluli womhlathi.

Ilayisha idatha kwiPython: pandas

import pandas as pd

ga_nov  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")

Ukudala idataframes

Itheyibhile Titanic, esilayishile, kukho intsimi Ngesondo, egcina isazisi sesini somkhweli.

Kodwa kunikezelo lwedatha olulungele ngakumbi ngokwesini somkhweli, kufuneka usebenzise igama kunekhowudi yesini.

Ukwenza oku, siya kudala i-directory encinci, itafile apho kuya kubakho iikholomu ezi-2 kuphela (ikhowudi kunye negama lesini) kunye nemigca emi-2 ngokulandelanayo.

Ukwenza i-dataframe kwi-R: i-tidyverse, i-dplyr

Kumzekelo wekhowudi engezantsi, senza i-dataframe efunwayo usebenzisa umsebenzi tibble() .

Ukwenza i-dataframe kwi-R: dplyr

## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
                 gender = c("female", "male"))

Ukudala i-dataframe kwi-R: idatha.table

Ukudala i-dataframe kwi-R: idatha.table

## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
                    gender = c("female", "male"))

Ukwenza i-dataframe kwiPython: i-pandas

В pandas Ukwenziwa kwezakhelo kuqhutywa kwizigaba ezininzi, okokuqala sidala isichazi-magama, kwaye emva koko siguqule isichazi-magama sibe yidataframe.

Ukwenza i-dataframe kwiPython: i-pandas

# создаём дата фрейм
gender_dict = {'id': [1, 2],
               'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)

Ukukhetha amaKholam

Iitheyibhile osebenza nazo zinokuqulatha iidozini okanye amakhulu eekholamu zedatha. Kodwa ukwenza uhlalutyo, njengomthetho, awudingi zonke iikholamu ezikhoyo kwitafile yomthombo.

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Ke ngoko, omnye wemisebenzi yokuqala oya kuyenza ngetheyibhile yomthombo kukuyicoca yolwazi olungeyomfuneko kwaye ukhulule inkumbulo ehlala olu lwazi.

Ukukhetha iikholamu kwi-R: tidyverse, dplyr

I-Syntax dplyr ifana kakhulu nolwimi lombuzo lwe SQL, ukuba uqhelene nayo uzakukhawuleza uyilawule le mpahla.

Ukukhetha iikholamu, sebenzisa umsebenzi select().

Ngezantsi yimizekelo yekhowudi onokuthi ngayo ukhethe iikholamu ngezi ndlela zilandelayo:

  • Ukudwelisa amagama eekholam ezifunekayo
  • Jonga kumagama ekholamu usebenzisa amabinzana aqhelekileyo
  • Ngohlobo lwedatha okanye nayiphi na enye ipropati yedatha equlethwe kwikholomu

Ukukhetha iikholamu kwi-R: dplyr

# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)

Ukukhetha iikholamu kwi-R: data.table

Imisebenzi efanayo kwi data.table zenziwa ngokwahlukileyo kancinci, ekuqaleni kwenqaku ndinike inkcazo yokuba zeziphi iingxoxo ezingaphakathi kwezibiyeli ezisikwere data.table.

DT[i,j,by]

Kuphi:
ndi- apho, i.e. ukuhluzwa ngokwemiqolo
j - khetha|hlaziya|yenza, okt. ukukhetha iikholamu nokuziguqula
ngokwamaqela edatha

Ukukhetha iikholamu kwi-R: data.table

## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]

Eyahlukileyo .SD ikuvumela ukuba ufikelele kuyo yonke imiqolo, kunye .SDcols cofa imiqolo efunekayo usebenzisa iintetho eziqhelekileyo, okanye eminye imisebenzi ukucoca amagama eekholamu ozifunayo.

Ukukhetha iikholamu kwiPython, iipanda

Ukukhetha iikholamu ngamagama kwi pandas kwanele ukunika uluhlu lwamagama abo. Kwaye ukukhetha okanye ukungabandakanyi iikholamu ngamagama usebenzisa amabinzana aqhelekileyo, kufuneka usebenzise imisebenzi drop() и filter(), kunye nengxoxo i-axis=1, apho ubonisa khona ukuba kuyimfuneko ukucubungula imihlathi kunemiqolo.

Ukukhetha indawo ngokohlobo lwedatha, sebenzisa umsebenzi select_dtypes(), nakwiimpikiswano zibandakanya okanye nga ndakanyi dlula uluhlu lweentlobo zedata ezihambelana nemimandla ekufuneka uyikhethile.

Ukukhetha iikholamu kwiPython: pandas

# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])

Ukuhluza imiqolo

Umzekelo, itheyibhile yomthombo inokuba neminyaka emininzi yedatha, kodwa kufuneka uhlalutye kuphela inyanga edlulileyo. Kwakhona, imigca eyongezelelweyo iya kucothisa inkqubo yokucwangcisa idatha kwaye ivale imemori yePC.

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Ukuhluza imiqolo kwi-R: tydyverse, dplyr

В dplyr umsebenzi usetyenziselwa ukuhluza imiqolo filter(). Kuthatha i-dataframe njengengxabano yokuqala, emva koko udwelise iimeko zokucoca.

Xa ubhala iintetho ezinengqiqo ukucoca itafile, kulo mzekelo, khankanya amagama ekholamu ngaphandle kokucaphula kwaye ngaphandle kokuchaza igama letafile.

Xa usebenzisa iintetho ezininzi ezinengqiqo ukucoca, sebenzisa aba basebenzi balandelayo:

  • & okanye isiphumlisi - sengqiqweni KUNYE
  • | - ingqiqo OKANYE

Ukuhluza imiqolo kwi-R: dplyr

# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)

Ukuhluza imiqolo kwi-R: data.table

Njengoko sele ndibhale ngasentla, ngo data.table I-syntax yokuguqulwa kwedatha ifakwe kwizibiyeli zesikwere.

DT[i,j,by]

Kuphi:
ndi- apho, i.e. ukuhluzwa ngokwemiqolo
j - khetha|hlaziya|yenza, okt. ukukhetha iikholamu nokuziguqula
ngokwamaqela edatha

Ingxoxo isetyenziselwa ukuhluza imiqolo i, enendawo yokuqala kwizibiyeli zesikwere.

Imihlathi ifikeleleka ngokwentetho evakalayo ngaphandle kwamanqaku okucaphula kwaye ngaphandle kokuchaza igama letheyibhile.

Iintetho ezinengqiqo zihambelana enye kwenye ngendlela efanayo nakwi dplyr nge & kunye | nabaqhubi.

Ukuhluza imiqolo kwi-R: data.table

## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]

Ukucoca imitya kwiPython: iipanda

Hluza ngokwemiqolo ngaphakathi pandas iyafana nokucoca ngaphakathi data.table, kwaye yenziwa kwizibiyeli zesikwere.

Kule meko, ukufikelela kwiikholomu kuqhutywa ngokuyimfuneko ngokubonisa igama ledatha yedatha; ke igama lekholomu lingabonakaliswa kumanqaku okucaphula kwizibiyeli zesikwere (mzekelo df['col_name']), okanye ngaphandle kwezicatshulwa emva kwexesha (mzekelo df.col_name).

Ukuba ufuna ukuhluza i-dataframe ngeemeko ezininzi, imeko nganye kufuneka ifakwe kwizibiyeli. Iimeko ezinengqiqo ziqhagamshelwe omnye komnye ngabaqhubi & и |.

Ukucoca imitya kwiPython: iipanda

# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]

Ukwahlulahlula kunye nokudityaniswa kwedatha

Enye yezona zinto zisetyenziswa ngokuqhelekileyo kuhlalutyo lwedatha kukwenza amaqela kunye nokudibanisa.

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Isivakalisi sokwenza le misebenzi sithe saa kuzo zonke iipakethe esiziphononongayo.

Kule meko, siya kuthatha i-dataframe njengomzekelo Titanic, kwaye ubale inani kunye neendleko eziqhelekileyo zamatikiti ngokuxhomekeke kwiklasi ye-cabin.

Ukwahlulahlula kunye nokudityaniswa kwedatha kwi-R: i-tidyverse, i-dplyr

В dplyr umsebenzi usetyenziselwa ukwenza amaqela group_by(), kunye nokudibanisa summarise(). Inyaniso, dplyr kukho usapho lonke lwemisebenzi summarise_*(), kodwa injongo yeli nqaku kukuthelekisa i-syntax esisiseko, ngoko asiyi kungena kwihlathi elinjalo.

Imisebenzi esisiseko yokudibanisa:

  • sum() - ukushwankathela
  • min() / max() – ubuncinane kunye nexabiso eliphezulu
  • mean() - umyinge
  • median() - iphakathi
  • length() - ubuninzi

Ukwahlulahlula kunye nokudibanisa kwi-R: dplyr

## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
  summarise(passangers = length(PassengerId),
            avg_price  = mean(Fare))

Ukusebenza group_by() sigqithise itafile njengengxabano yokuqala Titanic, kwaye yabonisa indawo Iklasi, esiya kuthi ngayo sidibanise itafile yethu. Isiphumo salo msebenzi usebenzisa umsebenzisi %>% igqithiswe njengengxoxo yokuqala kumsebenzi summarise(), kwaye wongeze imihlaba emi-2 ngaphezulu: abakhweli и avg_xabiso. Okokuqala, sebenzisa umsebenzi length() kubalwe inani lamatikiti, kwaye okwesibini usebenzisa umsebenzi mean() ifumene ixabiso letikiti eliphakathi.

Ukwahlulahlula kunye nokudityaniswa kwedatha kwi-R: idatha.table

В data.table Ingxoxo isetyenziselwa ukuhlanganisa j enendawo yesibini kwizibiyeli ezisikwere, kunye neyokwenza amaqela by okanye keyby, ezinendawo yesithathu.

Uluhlu lwemisebenzi yokudibanisa kule meko iyafana naleyo ichazwe kwi dplyr, ngokuba le yimisebenzi evela kwisiseko se-syntax ye-R.

Ukwahlulahlula kunye nokudibanisa kwi-R: idatha.table

## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
            avg_price  = mean(Fare)),
        by = Pclass]

Ukwahlulahlula kunye nokudityaniswa kwedatha kwiPython: iipanda

Ukwenza amaqela pandas iyelelene kwi dplyr, kodwa udibaniso alufani ne dplyr hayi kwi data.table.

Ukwenza iqela, sebenzisa indlela groupby(), apho kufuneka ugqithise uluhlu lwezintlu apho i-dataframe iya kudityaniswa khona.

Ukudibanisa ungasebenzisa indlela agg()eyamkela isichazi-magama. Izitshixo zesichazi-magama yimiqolo apho uya kusebenzisa imisebenzi yokudibanisa, kwaye amaxabiso ngamagama emisebenzi yohlanganiso.

Imisebenzi yokudibanisa:

  • sum() - ukushwankathela
  • min() / max() – ubuncinane kunye nexabiso eliphezulu
  • mean() - umyinge
  • median() - iphakathi
  • count() - ubuninzi

Umsebenzi reset_index() kumzekelo ongezantsi isetyenziselwa ukuseta kwakhona izalathisi ezibekwe kwindlwane leyo pandas okungagqibekanga emva kokuhlanganiswa kwedatha.

Uphawu ikuvumela ukuba uye kumgca olandelayo.

Ukwahlulahlula kunye nokudibanisa kwiPython: iipanda

# группировка и агрегация данных
titanic.groupby(["Pclass"]).
    agg({'PassengerId': 'count', 'Fare': 'mean'}).
        reset_index()

Ukudityaniswa okuthe nkqo kweetafile

Umsebenzi odibanisa iitafile ezimbini nangaphezulu zesakhiwo esifanayo. Idatha esiyilayishile iqulethe iitafile ga_nov и ga_dec. Ezi tafile ziyafana kwisakhiwo, okt. zinemihlathi efanayo, kunye neendidi zedatha kule miqolo.

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Oku kulayishwa kwi-Google Analytics ngenyanga kaNovemba noDisemba, kweli candelo siya kudibanisa le datha kwitafile enye.

Ukudibanisa ngokuthe nkqo iitafile kwi-R: i-tidyverse, i-dplyr

В dplyr Ungadibanisa iitafile ezimbini zibe nye usebenzisa umsebenzi bind_rows(), idlulisa iitafile njengeengxoxo zayo.

Ukuhluza imiqolo kwi-R: dplyr

# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)

Ngokuthe nkqo ukujoyina iitafile kwi-R: idatha.table

Kwakhona akukho nto inzima, masisebenzise rbind().

Ukuhluza imiqolo kwi-R: data.table

## data.table
rbind(ga_nov, ga_dec)

Ukudibanisa ngokuthe nkqo iitafile kwiPython: iipanda

В pandas umsebenzi usetyenziswa ukudibanisa iitafile concat(), apho kufuneka ugqithise uluhlu lwezakhelo ukuzidibanisa.

Ukucoca imitya kwiPython: iipanda

# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])

Ukudibanisa okuthe tye kwetafile

Umsebenzi apho iikholamu ezivela kweyesibini zongezwa kwitafile yokuqala ngesitshixo. Ihlala isetyenziswa xa kutyetyiswa itheyibhile yenyani (umzekelo, itafile enedatha yokuthengisa) kunye nedatha ethile yereferensi (umzekelo, ixabiso lemveliso).

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Kukho iintlobo ezininzi zokudibanisa:

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Kwitafile elayishwe ngaphambili Titanic sinekholamu Ngesondo, ehambelana nekhowudi yesini somkhweli:

I-1 - ibhinqa
2 - indoda

Kwakhona, senze itafile - incwadi yereferensi ngesini. Kumboniso oluncedo ngakumbi wedatha kwisini sabakhweli, kufuneka songeze igama lesini kuluhlu ngesini etafileni Titanic.

Itheyibhile ethe tye ukujoyina kwi-R: i-tidyverse, i-dplyr

В dplyr Kukho usapho lonke lwemisebenzi yokudibanisa okuthe tye:

  • inner_join()
  • left_join()
  • right_join()
  • full_join()
  • semi_join()
  • nest_join()
  • anti_join()

Eyona iqhelekileyo isetyenziswa kwindlela yam left_join().

Njengeengxoxo ezimbini zokuqala, imisebenzi edweliswe ngasentla ithatha iitheyibhile ezimbini ukuzidibanisa, kunye nengxoxo yesithathu by kufuneka ukhankanye imiqolo yokudibanisa.

Itheyibhile ethe tye ukujoyina kwi-R: dplyr

# объединяем таблицы
left_join(titanic, gender,
          by = c("Sex" = "id"))

Ukudityaniswa okuthe tye kwetafile kwi-R: idatha.table

В data.table Kufuneka udibanise iitafile ngeqhosha usebenzisa umsebenzi merge().

Iingxoxo zokudibanisa () umsebenzi kwidatha.table

  • x, y — Iitheyibhile zokudibanisa
  • ngo - Ikholamu elisisitshixo sokudibanisa ukuba sinegama elifanayo kuzo zombini iitafile
  • ngo.x, ngo.y — Amagama eekholamu ezidityanisiweyo, ukuba zinamagama ahlukeneyo kwiitheyibhile
  • zonke, zonke, zonke.x, zonke.y — Dibanisa udidi, zonke ziya kubuyisela yonke imiqolo evela kwiitheyibhile zombini, zonke.x zihambelana nomsebenzi we-LEFT JOIN (iya kushiya yonke imiqolo yetafile yokuqala), all.y — ihambelana UKUSEBENZA KWAKULUNGILEYO (iya kushiya yonke imiqolo yetafile yesibini).

Ukudityaniswa okuthe tye kwetafile kwi-R: idatha.table

# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)

Itheyibhile ethe tye ukujoyina kwiPython: iipanda

Kanye nakwi data.tablengaphakathi pandas umsebenzi usetyenziswa ukudibanisa iitafile merge().

Iingxoxo zodibaniso () umsebenzi kwiipanda

  • njani — Uhlobo loqhagamshelo: ekhohlo, ekunene, ngaphandle, ngaphakathi
  • on — Ikholamu elisisitshixo ukuba inegama elifanayo kuzo zombini iitafile
  • left_on, right_on — Amagama eekholamu eziphambili, ukuba zinamagama ahlukeneyo kwiitheyibhile

Itheyibhile ethe tye ukujoyina kwiPython: iipanda

# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")

Imisebenzi esisiseko yefestile kunye neentsika ezibaliweyo

Imisebenzi yefestile iyafana ngentsingiselo kwimisebenzi yokudibanisa, kwaye ikwasetyenziswa rhoqo kuhlalutyo lwedatha. Kodwa ngokungafaniyo nemisebenzi yohlanganiso, imisebenzi yefestile ayitshintshi inani lemiqolo yedata ephumayo.

Loluphi ulwimi olunokukhetha ukusebenza ngedatha-R okanye iPython? Zombini! Ukufuduka ukusuka kwi-panda ukuya kwi-tidyverse kunye nedatha.table kunye nomva

Ngokusisiseko, usebenzisa umsebenzi wefestile, sahlula i-data engenayo ibe ngamacandelo ngokwemigaqo ethile, okt. ngexabiso lomhlaba, okanye imihlaba emininzi. Kwaye senza imisebenzi ye-arithmetic kwifestile nganye. Isiphumo sale misebenzi siya kubuyiselwa kumgca ngamnye, okt. ngaphandle kokutshintsha inani elipheleleyo lemiqolo kwitheyibhile.

Umzekelo, makhe sithathe itafile Titanic. Sinokubala ukuba yeyiphi ipesenti yeendleko zetikiti ngalinye ngaphakathi kweklasi yekhabhinethi yayo.

Ukwenza oku, kufuneka sifumane kumgca ngamnye ixabiso elipheleleyo letikiti leklasi yangoku yekhabhin apho itikiti elikulo mgca likulo, emva koko sahlule iindleko zetikiti ngalinye ngexabiso lilonke lawo onke amatikiti eklasi enye yekhabhini. .

Imisebenzi yefestile kwi-R: i-tidyverse, i-dplyr

Ukongeza imiqolo emitsha, ngaphandle kokusebenzisa amaqela erowu, kwi dplyr inika umsebenzi mutate().

Ungayisombulula ingxaki echazwe ngasentla ngokucwangcisa idatha ngentsimi Iklasi kunye nokushwankathela intsimi kwikholamu entsha ukwenza. Emva koko, susa uluhlu lwetafile kwaye wahlule amaxabiso entsimi ukwenza kwinto eyenzekileyo kwinyathelo langaphambili.

Imisebenzi yefestile kwi-R: dplyr

group_by(titanic, Pclass) %>%
  mutate(Pclass_cost = sum(Fare)) %>%
  ungroup() %>%
  mutate(ticket_fare_rate = Fare / Pclass_cost)

Imisebenzi yefestile kwi-R: data.table

I-algorithm yesisombululo ihlala ifana ne-in dplyr, sidinga ukwahlula itafile kwiifestile ngentsimi Iklasi. Imveliso kwikholamu entsha isixa seqela elihambelana nomqolo ngamnye, kwaye wongeze ikholamu apho sibala isabelo seendleko zetikiti ngalinye kwiqela lalo.

Ukongeza iikholamu ezintsha kwi data.table umqhubi okhoyo :=. Ngezantsi ngumzekelo wokusombulula ingxaki usebenzisa ipakethe data.table

Imisebenzi yefestile kwi-R: data.table

titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost), 
        by = Pclass]

Imisebenzi yefestile kwiPython: iipanda

Enye indlela yokongeza ikholamu entsha kwi pandas - sebenzisa umsebenzi assign(). Ukushwankathela iindleko zamatikiti ngeklasi yekhabhinethi, ngaphandle kwemigca yeqela, siya kusebenzisa umsebenzi transform().

Ngezantsi umzekelo wesisombululo apho songeza kwitafile Titanic imiqolo emi-2 efanayo.

Imisebenzi yefestile kwiPython: iipanda

titanic.assign(Pclass_cost      =  titanic.groupby('Pclass').Fare.transform(sum),
               ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])

Itheyibhile yembalelwano yemisebenzi kunye neendlela

Ngezantsi itheyibhile yembalelwano phakathi kweendlela zokwenza imisebenzi eyahlukeneyo ngedatha kwiipakethi esiziqwalasele.

inkcazelo
icocekile
idatha yedatha
pandas

Ilayisha idatha
vroom()/ readr::read_csv() / readr::read_tsv()
fread()
read_csv()

Ukudala idataframes
tibble()
data.table()
dict() + from_dict()

Ukukhetha amaKholam
select()
impikiswano j, indawo yesibini kwizibiyeli zesikwere
sigqithisa uluhlu lweentsika ezifunekayo kwizibiyeli ezisikwere / drop() / filter() / select_dtypes()

Ukuhluza imiqolo
filter()
impikiswano i, indawo yokuqala kwizibiyeli zesikwere
Sidwelisa iimeko zokucoca kwizibiyeli zesikwere / filter()

Ukwahlulahlula kunye nokudityaniswa
group_by() + summarise()
ngxabano j + by
groupby() + agg()

Umanyano oluthe nkqo lwetafile (UNION)
bind_rows()
rbind()
concat()

Ukudibanisa okuthe tye kwetafile (JOIN)
left_join() / *_join()
merge()
merge()

Imisebenzi esisiseko yefestile kunye nokongeza iikholamu ezibaliweyo
group_by() + mutate()
impikiswano j usebenzisa umsebenzisi := + ingxoxo by
transform() + assign()

isiphelo

Mhlawumbi kwinqaku endilichazileyo hayi eyona ndlela ilungileyo yokuphunyezwa kokusetyenzwa kwedatha, ke ndiya kuvuya ukuba ulungisa iimpazamo zam kwizimvo, okanye ukongeza nje ulwazi olunikwe kwinqaku kunye nezinye iindlela zokusebenza ngedatha kwi-R / Python.

Njengoko ndibhale ngasentla, injongo yenqaku yayingekokunyanzelisa uluvo lomntu malunga nokuba luluphi ulwimi olungcono, kodwa ukwenza lula ithuba lokufunda zombini iilwimi, okanye, ukuba kuyimfuneko, ukufuduka phakathi kwazo.

Ukuba ulithandile inqaku, ndiya kuvuya ukuba nababhalisi abatsha kum youtube и yocingo imijelo.

Poll

Yeyiphi ipakethe kwezi zilandelayo oyisebenzisayo emsebenzini wakho?

Kwizimvo ungabhala isizathu sokukhetha kwakho.

Ngabasebenzisi ababhalisiweyo kuphela abanokuthatha inxaxheba kuphando. Ngena, ndiyacela.

Yeyiphi iphakheji yokucwangcisa idatha oyisebenzisayo (ungakhetha iinketho ezininzi)

  • 45,2%icocekile19

  • 33,3%idatha.table14

  • 54,8%iipanda23

Bangama-42 abasebenzisi abavotileyo. Abasebenzisi abali-9 abakhange.

umthombo: www.habr.com

Yongeza izimvo