Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Ngokucinga i-R noma i-Python ku-inthanethi, uzothola izigidi zendatshana namakhilomitha ezingxoxo ngesihloko sokuthi iyiphi engcono, eshesha futhi elungele ukusebenza ngedatha. Kodwa ngeshwa, zonke lezi zihloko nezingxabano azisizi kakhulu.

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Inhloso yalesi sihloko ukuqhathanisa amasu ayisisekelo okucubungula idatha kumaphakheji aziwa kakhulu azo zombili izilimi. Futhi usize abafundi bafunde ngokushesha okuthile abangakakwazi. Kulabo ababhala ngePython, thola ukuthi ungayenza kanjani into efanayo ku-R, futhi ngokuphambene nalokho.

Phakathi nesihloko sizohlaziya i-syntax yamaphakheji adume kakhulu ku-R. Lawa amaphakheji afakwe kulabhulali. tidyversekanye nephakheji data.table. Futhi qhathanisa i-syntax yabo ne pandas, iphakheji yokuhlaziya idatha edume kakhulu ePython.

Sizohamba ngesinyathelo ngesinyathelo kuyo yonke indlela yokuhlaziya idatha ukusuka ekuyilayisheni kuye ekwenzeni imisebenzi yewindi lokuhlaziya kusetshenziswa iPython ne-R.

Okuqukethwe

Lesi sihloko singasetshenziswa njengeshidi lokukopela uma ukhohliwe indlela yokwenza umsebenzi wokucubungula idatha kwelinye lamaphakheji acatshangelwayo.

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

  1. Umehluko omkhulu we-syntax phakathi kwe-R ne-Python
    1.1. Ukufinyelela Imisebenzi Yephakheji
    1.2. Isabelo
    1.3. Ukwenza inkomba
    1.4. Izindlela kanye ne-OOP
    1.5. Amapayipi
    1.6. Izakhiwo zedatha
  2. Amagama ambalwa mayelana namaphakheji esizowasebenzisa
    2.1. ihlanzekile
    2.2. idatha.table
    2.3. pandas
  3. Ifaka amaphakheji
  4. Ilayisha Idatha
  5. Idala ama-dataframes
  6. Ukukhetha Amakholomu Owadingayo
  7. Ihlunga imigqa
  8. Ukuhlanganisa nokuhlanganisa
  9. Inyunyana yamathebula eqondile (UNION)
  10. Ukuhlanganisa amathebula okuvundlile (JOIN)
  11. Imisebenzi yefasitela eyisisekelo namakholomu abaliwe
  12. Ithebula lezincwadi phakathi kwezindlela zokucubungula idatha ku-R nePython
  13. isiphetho
  14. Inhlolovo emfushane mayelana nephakheji oyisebenzisayo

Uma unentshisekelo ekuhlaziyweni kwedatha, ungathola okwami telegram и youtube iziteshi. Iningi lokuqukethwe linikezelwe olimini lwe-R.

Umehluko omkhulu we-syntax phakathi kwe-R ne-Python

Ukwenza kube lula kuwe ukuthi ushintshe usuka ku-Python uye ku-R, noma ngokuphambene nalokho, ngizokunikeza amaphuzu ambalwa asemqoka okudingeka uwanake.

Ukufinyelela Imisebenzi Yephakheji

Uma iphakheji selilayishwe ku-R, awudingi ukucacisa igama lephakheji ukuze ufinyelele imisebenzi yalo. Ezimweni eziningi lokhu akwamukelwe ku-R, kodwa kuyamukeleka. Akumele ungenise iphakheji nhlobo uma udinga enye yemisebenzi yayo kukhodi yakho, kodwa mane uyibize ngokucacisa igama lephakheji kanye negama lomsebenzi. Isihlukanisi phakathi kwephakheji namagama omsebenzi ku-R iyikholoni ekabili. package_name::function_name().

Ku-Python, ngokuphambene nalokho, kuthathwa njenge-classical ukubiza imisebenzi yephakheji ngokucacisa igama layo. Uma iphasela lilandwa, ngokuvamile linikezwa igama elifushanisiwe, isb. pandas ngokuvamile kusetshenziswa igama-mbumbulu pd. Umsebenzi wephakheji ufinyelelwa ngechashazi package_name.function_name().

Isabelo

Ku-R, kuvamile ukusebenzisa umcibisholo ukunikeza inani entweni. obj_name <- value, nakuba uphawu olulodwa lokulingana luvunyelwe, uphawu olulodwa lokulingana ku-R lusetshenziswa ngokuyinhloko ukudlulisa amanani kuma-agumenti okusebenza.

Ku-Python, isabelo senziwa kuphela ngophawu olulodwa olulinganayo obj_name = value.

Ukwenza inkomba

Kukhona futhi umehluko omkhulu lapha. Ku-R, ukukhomba kuqala kokukodwa futhi kufaka zonke izici ezishiwo ebangeni eliwumphumela,

Ku-Python, ukukhomba kuqala ku-zero futhi ububanzi obukhethiwe abufaki ingxenye yokugcina eshiwo kunkomba. Ngakho design x[i:j] kuPython ngeke ifake isici sika-j.

Kukhona futhi umehluko ekukhombeni okunegethivu, ku-R notation x[-1] izobuyisela zonke izici zevekhtha ngaphandle kweyokugcina. Ku-Python, inothi efanayo izobuyisela kuphela into yokugcina.

Izindlela kanye ne-OOP

R isebenzisa i-OOP ngendlela yayo, ngibhale ngalokhu esihlokweni "OOP ngolimi lwe-R (ingxenye 1): amakilasi e-S3". Ngokuvamile, u-R uwulimi olusebenzayo, futhi yonke into ekulo yakhiwe ngemisebenzi. Ngakho-ke, ngokwesibonelo, kubasebenzisi be-Excel, yiya ku tydiverse kuzoba lula kunalokho pandas. Nakuba lokhu kungase kube umbono wami subjective.

Ngamafuphi, izinto eziku-R azinazo izindlela (uma sikhuluma ngamakilasi e-S3, kodwa kukhona okunye ukuqaliswa kwe-OOP okungajwayelekile kakhulu). Kukhona imisebenzi ejwayelekile kuphela eyicubungula ngendlela ehlukile kuye ngesigaba sento.

Amapayipi

Mhlawumbe leli igama layo pandas Ngeke ilunge ngokuphelele, kodwa ngizozama ukuchaza incazelo.

Ukuze ungalondolozi izibalo eziphakathi futhi ungakhiqizi izinto ezingadingekile endaweni yokusebenza, ungasebenzisa uhlobo lwepayipi. Labo. dlulisa umphumela wesibalo usuka komunye umsebenzi uye komunye, futhi ungagcini imiphumela emaphakathi.

Ake sithathe isibonelo sekhodi esilandelayo, lapho sigcina khona izibalo ezimaphakathi ezintweni ezihlukene:

temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )

Senze imisebenzi emi-3 ngokulandelana, futhi umphumela ngamunye walondolozwa entweni ehlukile. Kodwa eqinisweni, asizidingi lezi zinto eziphakathi nendawo.

Noma okubi nakakhulu, kodwa okujwayeleke kakhulu kubasebenzisi be-Excel.

obj  <- func3(func2(func1()))

Kulokhu, asizange silondoloze imiphumela yokubala emaphakathi, kodwa ikhodi yokufunda enemisebenzi efakwe esidlekeni ayilula neze.

Sizobheka izindlela ezimbalwa zokucubungula idatha ku-R, futhi zenza imisebenzi efanayo ngezindlela ezahlukene.

Amapayipi kulabhulali tidyverse kwenziwe ngu-opharetha %>%.

obj <- func1() %>% 
            func2() %>%
            func3()

Ngakho sithatha umphumela womsebenzi func1() futhi uyidlulise njengengxabano yokuqala ku func2(), bese sidlulisela umphumela walesi sibalo njengengxabano yokuqala func3(). Futhi ekugcineni, sibhala zonke izibalo ezenziwe entweni obj <-.

Konke lokhu okungenhla kuboniswe kangcono kunamagama ngale meme:
Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

В data.table amaketanga asetshenziswa ngendlela efanayo.

newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]

Kubakaki wesikwele ngasinye ungasebenzisa umphumela wokusebenza kwangaphambilini.

В pandas imisebenzi enjalo ihlukaniswa ichashazi.

obj = df.fun1().fun2().fun3()

Labo. sithatha itafula lethu df futhi asebenzise indlela yakhe fun1(), bese sisebenzisa indlela kumphumela otholiwe fun2(), ngemva fun3(). Umphumela ugcinwa entweni into .

Izakhiwo zedatha

Izakhiwo zedatha ku-R nePython ziyefana, kodwa zinamagama ahlukene.

Incazelo
Igama ku-R
Igama ku-Python/pandas

Isakhiwo sethebula
idatha.uhlaka, idatha.ithebula, ithebula
IdathaFrame

Uhlu olunohlangothi olulodwa lwamanani
I-Vector
Uchungechunge kuma-panda noma uhlu ku-Python ehlanzekile

Isakhiwo esinamaleveli amaningi esingewona amathebula
Uhlu
Isichazamazwi (dict)

Sizobheka ezinye izici nomehluko ku-syntax ngezansi.

Amagama ambalwa mayelana namaphakheji esizowasebenzisa

Okokuqala, ngizokutshela kancane mayelana namaphakheji ozowajwayela ngalesi sihloko.

ihlanzekile

Iwebhusayithi esemthethweni: tidyverse.org
Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva
umtapo tidyverse ibhalwe nguHedley Wickham, Senior Research Scientist kwaRStudio. tidyverse iqukethe isethi ebabazekayo yamaphakeji enza ukucubungula idatha kube lula, angu-5 awo afakwe kokungu-10 okulandwayo okuphezulu okuvela endaweni yokugcina ye-CRAN.

Ingqikithi yomtapo wolwazi iqukethe amaphakheji alandelayo: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats. Ngayinye yalawa maphakheji ihloselwe ukuxazulula inkinga ethile. Ngokwesibonelo dplyr yakhelwe ukukhohlisa idatha, tidyr ukuletha idatha ngendlela ehlanzekile, stringr yenza kube lula ukusebenza ngezintambo, futhi ggplot2 ingelinye lamathuluzi okubona idatha aziwa kakhulu.

inzuzo tidyverse iwukulula futhi kulula ukuyifunda i-syntax, efana ngezindlela eziningi nolimi lwemibuzo ye-SQL.

idatha.table

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuvaIwebhusayithi esemthethweni: r-datatable.com

Ngu data.table nguMat Dole we-H2O.ai.

Ukukhishwa kokuqala komtapo wezincwadi kwenzeka ngo-2006.

I-syntax yephakheji ayilula njengaku tidyverse futhi ikhumbuza kakhudlwana ozimele bedatha bakudala ku-R, kodwa ngesikhathi esifanayo banwetshwa ngokuphawulekayo ekusebenzeni.

Konke ukukhohlisa okunethebula kule phakheji kuchazwa kubakaki abayisikwele, futhi uma uhumusha i-syntax data.table ku-SQL, uthola okuthile okufana nalokhu: data.table[ WHERE, SELECT, GROUP BY ]

Amandla ale phakheji ijubane lokucubungula inani elikhulu ledatha.

pandas

Iwebhusayithi esemthethweni: pandas.pydata.org Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Igama lomtapo wolwazi livela egameni lezomnotho elithi “idatha yephaneli”, elisetshenziselwa ukuchaza amasethi olwazi olwakhiwe ngezinhlangothi eziningi.

Ngu pandas nguWes McKinney waseMelika.

Uma kukhulunywa ngokuhlaziywa kwedatha kuPython, kuyalingana pandas Cha. Iphakheji enemisebenzi eminingi, yezinga eliphezulu ekuvumela ukuthi wenze noma yikuphi ukukhohlisa ngedatha, kusukela ekulayisheni idatha kusuka kunoma iyiphi imithombo ukuya ekuyiboneni ngeso lengqondo.

Ifaka amaphakheji engeziwe

Amaphakheji okuxoxwe ngawo kulesi sihloko awafakiwe ekusabalazweni okuyisisekelo kwe-R ne-Python. Nakuba kukhona i-caveat encane, uma ufake ukusatshalaliswa kwe-Anaconda, bese ufaka futhi pandas akudingekile.

Ifaka amaphakheji ku-R

Uma uvule indawo yokuthuthukisa i-RStudio okungenani kanye, cishe usuyazi kakade indlela yokufaka iphakheji edingekayo ku-R. Ukuze ufake amaphakheji, sebenzisa umyalo ojwayelekile. install.packages() ngokuyiqhuba ngqo ku-R ngokwayo.

# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")

Ngemuva kokufakwa, amaphakheji adinga ukuxhunywa, okuvame ukusetshenziswa kuwo umyalo library().

# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)

Ukufaka amaphakheji kuPython

Ngakho-ke, uma unePython ehlanzekile efakiwe, ke pandas udinga ukuyifaka ngesandla. Vula umugqa womyalo, noma ukuphela, kuye ngohlelo lwakho lokusebenza bese ufaka umyalo olandelayo.

pip install pandas

Bese sibuyela kuPython bese singenisa iphakheji efakiwe ngomyalo import.

import pandas as pd

Ilayisha Idatha

Ukumbiwa kwedatha kungenye yezinyathelo ezibaluleke kakhulu ekuhlaziyeni idatha. Kokubili i-Python ne-R, uma uthanda, ikunikeza amathuba abanzi okuthola idatha kunoma iyiphi imithombo: amafayela endawo, amafayela avela ku-inthanethi, amawebhusayithi, zonke izinhlobo zolwazi.

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Kuso sonke isiqephu sendatshana sizosebenzisa amasethi edatha ambalwa:

  1. Okubili okulandiwe okuvela ku-Google Analytics.
  2. Isethi yedatha ye-Titanic Passenger.

Yonke idatha ikuyami GitHub ngendlela yamafayela e-csv kanye ne-tsv. Sizowacela kuphi?

Ilayisha idatha ku-R: i-tidyverse, i-vroom, isifundi

Ukulayisha idatha kulabhulali tidyverse Kunamaphakheji amabili: vroom, readr. vroom yesimanjemanje, kodwa esikhathini esizayo amaphakheji angase ahlanganiswe.

Caphula kusuka imibhalo esemthethweni vroom.

vroom vs umfundi
Yini ukukhululwa kwe vroom kusho ukuthi readr? Okwamanje sihlela ukuvumela amaphakheji amabili aguquke ngokuhlukana, kodwa cishe sizohlanganisa amaphakheji esikhathini esizayo. Okunye okungalungile ekufundeni kokuvilapha kwe-vroom ukuthi izinkinga zedatha ezithile azikwazi ukubikwa ngaphambili, ngakho-ke indlela engcono kakhulu yokuzihlanganisa idinga umcabango othile.

vroom vs umfundi
Kusho ukuthini ukukhululwa? vroom ngoba readr? Okwamanje sihlela ukuthuthukisa womabili amaphakheji ngokwehlukana, kodwa cishe sizowahlanganisa esikhathini esizayo. Okukodwa kokubi kokuvilapha ukufunda vroom ukuthi ezinye izinkinga zedatha azikwazi ukubikwa kusengaphambili, ngakho-ke udinga ukucabanga ukuthi ungazihlanganisa kanjani kangcono.

Kulesi sihloko sizobheka womabili amaphakheji okulayisha idatha:

Ilayisha idatha ku-R: iphakheji ye-vroom

# install.packages("vroom")
library(vroom)

# Чтение данных
## vroom
ga_nov  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Ilayisha idatha ku-R: umfundi

# install.packages("readr")
library(readr)

# Чтение данных
## readr
ga_nov  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Ephaketheni vroom, kungakhathaliseki ifomethi yedatha ye-csv / tsv, ukulayisha kwenziwa umsebenzi wegama elifanayo vroom(), ephaketheni readr sisebenzisa umsebenzi ohlukile wefomethi ngayinye read_tsv() и read_csv().

Ilayisha idatha ku-R: data.table

В data.table kunomsebenzi wokulayisha idatha fread().

Ilayisha idatha ku-R: data.table package

# install.packages("data.table")
library(data.table)

## data.table
ga_nov  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Ilayisha idatha ku-Python: i-pandas

Uma siqhathanisa namaphakheji we-R, khona-ke kulokhu i-syntax iseduze kakhulu pandas kuyoba readr, ngoba pandas ingacela idatha noma yikuphi, futhi kunomndeni wonke wemisebenzi kule phakheji read_*().

  • read_csv()
  • read_excel()
  • read_sql()
  • read_json()
  • read_html()

Neminye imisebenzi eminingi edizayinelwe ukufunda idatha evela kumafomethi ahlukahlukene. Kodwa ngezinhloso zethu kwanele read_table() noma read_csv() ngokusebenzisa ukuphikisana Septhemba ukucacisa isihlukanisi sekholomu.

Ilayisha idatha ku-Python: i-pandas

import pandas as pd

ga_nov  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")

Idala ama-dataframes

Ithebula titanic, esilayishile, kukhona insimu Sex, egcina isihlonzi sobulili somgibeli.

Kodwa ukuze uthole isethulo esilula kakhulu sedatha ngokobulili bomgibeli, kufanele usebenzise igama kunekhodi yobulili.

Ukuze senze lokhu, sizokwakha inkomba encane, ithebula lapho kuzoba khona amakholomu angu-2 kuphela (ikhodi negama lobulili) kanye nemigqa emi-2, ngokulandelana.

Ukudala uhlaka lwedatha ku-R: tidyverse, dplyr

Esibonelweni sekhodi esingezansi, sakha uhlaka lwedatha esilufunayo sisebenzisa umsebenzi tibble() .

Ukudala uhlaka lwedatha ku-R: dplyr

## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
                 gender = c("female", "male"))

Ukudala uhlaka lwedatha ku-R: data.table

Ukudala uhlaka lwedatha ku-R: data.table

## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
                    gender = c("female", "male"))

Ukudala i-dataframe ku-Python: i-pandas

В pandas Ukwakhiwa kwamafreyimu kwenziwa ngezigaba ezimbalwa, okokuqala sakha isichazamazwi, bese siguqula isichazamazwi sibe i-dataframe.

Ukudala i-dataframe ku-Python: i-pandas

# создаём дата фрейм
gender_dict = {'id': [1, 2],
               'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)

Ukukhetha Amakholomu

Amathebula osebenza nawo angase aqukathe amashumi noma amakhulu amakholomu edatha. Kodwa ukwenza ukuhlaziya, njengomthetho, awudingi wonke amakholomu atholakala kuthebula lomthombo.

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Ngakho-ke, omunye wemisebenzi yokuqala ozoyenza ngethebula lomthombo ukuwususa olwazini olungadingekile futhi ukhulule inkumbulo etholakala kulolu lwazi.

Ukukhetha amakholomu kokuthi R: tidyverse, dplyr

I-syntax dplyr ifana kakhulu nolimi lombuzo lwe-SQL, uma ujwayelene nalo uzosheshe uphumelele le phakheji.

Ukukhetha amakholomu, sebenzisa umsebenzi select().

Ngezansi kunezibonelo zekhodi ongakhetha ngayo amakholomu ngezindlela ezilandelayo:

  • Ukufaka kuhlu amagama amakholomu adingekayo
  • Bheka amagama ekholomu usebenzisa izinkulumo ezivamile
  • Ngohlobo lwedatha nanoma iyiphi enye impahla yedatha equkethwe kukholomu

Ukukhetha amakholomu kokuthi R: dplyr

# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)

Ikhetha amakholomu kokuthi R: data.table

Imisebenzi efanayo ku data.table zenziwa ngokuhlukile kancane, ekuqaleni kwesihloko nginikeze incazelo yokuthi yiziphi izimpikiswano ezingaphakathi kubakaki abayisikwele data.table.

DT[i,j,by]

Kuphi:
i - kuphi, i.e. ukuhlunga ngemigqa
j - khetha|buyekeza|yenza, i.e. ukukhetha amakholomu nokuwaguqula
ngokuqoqa idatha

Ikhetha amakholomu kokuthi R: data.table

## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]

Iyaguquguquka .SD ikuvumela ukuthi ufinyelele wonke amakholomu, futhi .SDcols hlunga amakholomu adingekayo usebenzisa izinkulumo ezivamile, noma eminye imisebenzi ukuze uhlunge amagama amakholomu owadingayo.

Ukukhetha amakholomu kuPython, pandas

Ukuze ukhethe amakholomu ngamagama phakathi pandas kwanele ukunikeza uhlu lwamagama abo. Futhi ukuze ukhethe noma ukhiphe amakholomu ngamagama usebenzisa izinkulumo ezivamile, udinga ukusebenzisa imisebenzi drop() и filter(), kanye nengxabano i-eksisi=1, obonisa ngayo ukuthi kuyadingeka ukucubungula amakholomu kunemigqa.

Ukukhetha inkambu ngohlobo lwedatha, sebenzisa umsebenzi select_dtypes(), nasezingxabanweni Faka noma ungafaki dlulisa uhlu lwezinhlobo zedatha oluhambisana nokuthi yiziphi izinkambu okudingeka ukhethe.

Ukukhetha amakholomu kuPython: ama-pandas

# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])

Ihlunga imigqa

Isibonelo, ithebula lomthombo lingase liqukathe iminyaka embalwa yedatha, kodwa udinga kuphela ukuhlaziya inyanga edlule. Futhi, imigqa eyengeziwe izonciphisa inqubo yokucubungula idatha futhi ivale imemori ye-PC.

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Ihlunga imigqa ngo-R: tydyverse, dplyr

В dplyr umsebenzi usetshenziselwa ukuhlunga imigqa filter(). Kudingeka i-dataframe njengengxabano yokuqala, bese ubhala izimo zokuhlunga.

Lapho ubhala izinkulumo ezinengqondo zokuhlunga ithebula, kulokhu, cacisa amagama ekholomu ngaphandle kwezingcaphuno futhi ngaphandle kokumemezela igama lethebula.

Uma usebenzisa izinkulumo eziningi ezinengqondo ukuze uhlunge, sebenzisa ama-opharetha alandelayo:

  • & noma ukhefana - kunengqondo KANYE
  • | - okunengqondo NOMA

Ihlunga imigqa ku-R: dplyr

# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)

Ihlunga imigqa ku-R: data.table

Njengoba ngike ngabhala ngenhla, ngo data.table I-syntax yokuguqulwa kwedatha ifakwe kubakaki abayisikwele.

DT[i,j,by]

Kuphi:
i - kuphi, i.e. ukuhlunga ngemigqa
j - khetha|buyekeza|yenza, i.e. ukukhetha amakholomu nokuwaguqula
ngokuqoqa idatha

I-agumenti isetshenziselwa ukuhlunga imigqa i, enendawo yokuqala kubakaki abayisikwele.

Amakholomu afinyelelwa ngezinkulumo ezinengqondo ngaphandle kwezimpawu zokucaphuna futhi ngaphandle kokucacisa igama lethebula.

Izinkulumo ezinengqondo zihlobene ngendlela efanayo ne-in dplyr ngokusebenzisa & kanye | nama-opharetha.

Ihlunga imigqa ku-R: data.table

## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]

Ukuhlunga izintambo kuPython: ama-pandas

Hlunga ngemigqa phakathi pandas okufana nokuhlunga data.table, futhi kwenziwa kubakaki abayisikwele.

Kulokhu, ukufinyelela kumakholomu kwenziwa ngempela ngokubonisa igama lohlaka lwedatha; khona-ke igama lekholomu lingabuye likhonjiswe kumamaki okucaphuna kubakaki abayisikwele (isibonelo df['col_name']), noma ngaphandle kwezingcaphuno ngemva kwesikhathi (isibonelo df.col_name).

Uma udinga ukuhlunga uhlaka lwedatha ngemibandela embalwa, umbandela ngamunye kufanele ubekwe kubakaki. Izimo ezinengqondo zixhunywe komunye nomunye ngama-opharetha & и |.

Ukuhlunga izintambo kuPython: ama-pandas

# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]

Ukuqoqa nokuhlanganisa idatha

Omunye wemisebenzi esetshenziswa kakhulu ekuhlaziyeni idatha ukuqoqa nokuhlanganisa.

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

I-syntax yokwenza le misebenzi isabalele kuwo wonke amaphakheji esiwabuyekezayo.

Kulokhu, sizothatha ifremu yedatha njengesibonelo titanic, futhi ubale inombolo nesilinganiso sezindleko zamathikithi kuye ngesigaba sekhabhinethi.

Ukuqoqa nokuhlanganisa idatha ku-R: tidyverse, dplyr

В dplyr umsebenzi usetshenziselwa ukuhlanganisa group_by(), kanye nokuhlanganisa summarise(). Empeleni, dplyr kunomndeni wonke wemisebenzi summarise_*(), kodwa injongo yalesi sihloko iwukuqhathanisa i-syntax eyisisekelo, ngakho ngeke siye ehlathini elinjalo.

Imisebenzi yokuhlanganisa eyisisekelo:

  • sum() - ukuhlanganisa
  • min() / max() – inani eliphansi neliphezulu
  • mean() - isilinganiso
  • median() - uphakathi
  • length() - ubuningi

Ukuqoqa nokuhlanganisa ku-R: dplyr

## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
  summarise(passangers = length(PassengerId),
            avg_price  = mean(Fare))

Ukuze usebenze group_by() sidlule itafula njengengxabano yokuqala titanic, bese ikhombisa insimu I-Pclass, esizohlanganisa ngalo itafula lethu. Umphumela walo msebenzi usebenzisa opharetha %>% kuphasiswe njengokuphikisana kokuqala komsebenzi summarise(), futhi wengeza ezinye izinkambu ezi-2: abagibeli и isilinganiso_inani. Okokuqala, sebenzisa umsebenzi length() kubalwe inani lamathikithi, futhi kwesibili kusetshenziswa umsebenzi mean() ithole inani lentengo yethikithi.

Ukuqoqa nokuhlanganisa idatha ku-R: data.table

В data.table i-agumenti isetshenziselwa ukuhlanganisa j enendawo yesibili kubakaki abayisikwele, kanye nokuhlanganisa by noma keyby, ezinesikhundla sesithathu.

Uhlu lwemisebenzi yokuhlanganisa kuleli cala luyafana nalolo oluchazwe ku dplyr, ngoba lena imisebenzi evela ku-syntax eyisisekelo ye-R.

Ukuqoqa nokuhlanganisa ku-R: idatha.table

## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
            avg_price  = mean(Fare)),
        by = Pclass]

Ukuqoqa nokuhlanganisa idatha ku-Python: ama-panda

Ukuqoqana pandas fana no dplyr, kodwa ukuhlanganisa akufani ne dplyr lutho neze data.table.

Ukuze wenze iqembu, sebenzisa indlela groupby(), lapho udinga ukudlulisa uhlu lwamakholomu lapho uhlaka lwedatha luzohlanganiswa khona.

Ukuhlanganisa ungasebenzisa indlela agg()eyamukela isichazamazwi. Okhiye besichazamazwi amakholomu lapho uzosebenzisa khona imisebenzi yokuhlanganisa, futhi amanani angamagama emisebenzi yokuhlanganisa.

Imisebenzi yokuhlanganisa:

  • sum() - ukuhlanganisa
  • min() / max() – inani eliphansi neliphezulu
  • mean() - isilinganiso
  • median() - uphakathi
  • count() - ubuningi

Umsebenzi reset_index() esibonelweni esingezansi isetshenziselwa ukusetha kabusha izinkomba ezifakwe esidlekeni lokho pandas okuzenzakalelayo kuya ngemva kokuhlanganiswa kwedatha.

Символ ikuvumela ukuthi uye emugqeni olandelayo.

Ukuqoqa nokuhlanganisa ku-Python: ama-pandas

# группировка и агрегация данных
titanic.groupby(["Pclass"]).
    agg({'PassengerId': 'count', 'Fare': 'mean'}).
        reset_index()

Ukuhlangana okuqondile kwamathebula

Umsebenzi ohlanganisa kuwo amathebula amabili noma ngaphezulu esakhiwo esifanayo. Idatha esiyilayishile iqukethe amathebula ga_nov и ga_dec. Lawa mathebula ayafana ngesakhiwo, i.e. abe namakholomu afanayo, kanye nezinhlobo zedatha kulawa makholomu.

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Lokhu ukulayisha okuvela ku-Google Analytics ngenyanga kaNovemba noDisemba, kulesi sigaba sizohlanganisa le datha ibe yithebula elilodwa.

Ihlanganisa amathebula ngokuqondile ku-R: tidyverse, dplyr

В dplyr Ungahlanganisa amathebula ama-2 kwelinye usebenzisa umsebenzi bind_rows(), edlulisa amatafula njengezimpikiswano zayo.

Ihlunga imigqa ku-R: dplyr

# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)

Ihlanganisa amathebula ngokuqondile ku-R: data.table

Futhi akuyona into eyinkimbinkimbi, masisebenzise rbind().

Ihlunga imigqa ku-R: data.table

## data.table
rbind(ga_nov, ga_dec)

Ukujoyina amathebula ngokuqondile ku-Python: ama-pandas

В pandas umsebenzi usetshenziselwa ukuhlanganisa amatafula concat(), lapho udinga ukudlulisa uhlu lwamafreyimu ukuze uwahlanganise.

Ukuhlunga izintambo kuPython: ama-pandas

# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])

Ukuhlanganisa okuvundlile kwamathebula

Umsebenzi lapho amakholomu asuka kwesibili engezwa kuthebula lokuqala ngokhiye. Ivamise ukusetshenziswa lapho kucebisa ithebula leqiniso (isibonelo, ithebula elinedatha yokuthengisa) nedatha ethile eyireferensi (isibonelo, izindleko zomkhiqizo).

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Kunezinhlobo eziningana zokuhlanganisa:

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Kuthebula elilayishwe ngaphambilini titanic sinekholomu Sex, ehambisana nekhodi yobulili yomgibeli:

1 - owesifazane
2 - owesilisa

Futhi, sidale itafula - incwadi yereferensi ubulili. Ukuze uthole iphrezentheshini elula kakhudlwana yedatha yobulili babagibeli, sidinga ukwengeza igama lobulili ohlwini lwemibhalo. ubulili etafuleni titanic.

Ithebula elivundlile lijoyina ku-R: i-tidyverse, i-dplyr

В dplyr Kunomndeni wonke wemisebenzi yokujoyina okuvundlile:

  • inner_join()
  • left_join()
  • right_join()
  • full_join()
  • semi_join()
  • nest_join()
  • anti_join()

Okuvame ukusetshenziswa kakhulu ekusebenzeni kwami ​​​​ngu left_join().

Njengama-agumenti amabili okuqala, imisebenzi ebalwe ngenhla ithatha amathebula amabili ukuze ihlanganiswe, futhi njengengxabano yesithathu by kufanele ucacise amakholomu ozowajoyina.

Ithebula elivundlile lijoyina ku-R: dplyr

# объединяем таблицы
left_join(titanic, gender,
          by = c("Sex" = "id"))

Ukujoyina okuvundlile kwamathebula kokuthi R: data.table

В data.table Udinga ukuhlanganisa amathebula ngokhiye usebenzisa umsebenzi merge().

Izimpikiswano zokuhlanganisa() umsebenzi kudatha.table

  • x, y - Amathebula okujoyina
  • ngo — Ikholomu ewukhiye wokuhlanganisa uma inegama elifanayo kuwo womabili amathebula
  • by.x, by.y — Amagama ekholomu azohlanganiswa, uma enamagama ahlukene kumathebula
  • konke, konke.x, konke.y — Joyina uhlobo, konke kuzobuyisela yonke imigqa kuwo womabili amathebula, i- all.x ihambisana nomsebenzi othi LEFT JOIN (izoshiya yonke imigqa yethebula lokuqala), all.y — ihambisana UKUSEBENZA NGOKUJOYINA NGESOKUDLA (kuzoshiya yonke imigqa yethebula lesibili).

Ukujoyina okuvundlile kwamathebula kokuthi R: data.table

# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)

Ithebula elivundlile lijoyina ku-Python: ama-pandas

Kanye naku data.table, ku pandas umsebenzi usetshenziselwa ukuhlanganisa amatafula merge().

Izimpikiswano zomsebenzi wokuhlanganisa() kuma-panda

  • kanjani - Uhlobo lokuxhuma: kwesokunxele, kwesokudla, ngaphandle, ingaphakathi
  • ku - Ikholomu ewukhiye uma inegama elifanayo kuwo womabili amathebula
  • left_on, right_on — Amagama amakholomu angukhiye, uma enamagama ahlukene kumathebula

Ithebula elivundlile lijoyina ku-Python: ama-pandas

# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")

Imisebenzi yefasitela eyisisekelo namakholomu abaliwe

Imisebenzi yamawindi iyafana ngencazelo nemisebenzi yokuhlanganisa, futhi ivame ukusetshenziswa ekuhlaziyeni idatha. Kodwa ngokungafani nemisebenzi yokuhlanganisa, imisebenzi yewindi ayishintshi inani lemigqa yozimele wedatha ophumayo.

Iluphi ulimi ongalukhetha ekusebenzeni ngedatha - R noma iPython? Kokubili! Ukufuduka kusuka kuma-panda kuya ku-tidyverse kanye nedatha.table futhi emuva

Empeleni, sisebenzisa umsebenzi wewindi, sihlukanisa uhlaka lwedatha olungenayo lube izingxenye ngokuya ngemibandela ethile, i.e. ngevelu yenkambu, noma izinkambu ezimbalwa. Futhi senza imisebenzi ye-arithmetic efasiteleni ngalinye. Umphumela wale misebenzi uzobuyiselwa kulayini ngamunye, i.e. ngaphandle kokushintsha ingqikithi yenani lemigqa kuthebula.

Ngokwesibonelo, ake sithathe itafula titanic. Singakwazi ukubala ukuthi ingakanani iphesenti izindleko zethikithi ngalinye bezingaphakathi kwekilasi lekhabethe layo.

Ukuze senze lokhu, sidinga ukuthola emugqeni ngamunye inani lezindleko zethikithi lekilasi lekhabethe lamanje ithikithi elikulo mugqa, bese sihlukanisa izindleko zethikithi ngalinye ngesamba sezindleko zawo wonke amathikithi ekilasi lekhabethe elifanayo. .

Iwindi lisebenza ku-R: tidyverse, dplyr

Ukwengeza amakholomu amasha, ngaphandle kokusebenzisa iqembu lemigqa, ku- dplyr inikeza umsebenzi mutate().

Ungakwazi ukuxazulula inkinga echazwe ngenhla ngokuqoqa idatha ngenkambu I-Pclass nokufingqa inkambu kukholamu entsha ukwenza. Okulandelayo, khipha itafula futhi uhlukanise amanani enkambu ukwenza kulokho okwenzeka esinyathelweni esedlule.

Imisebenzi yewindi ku-R: dplyr

group_by(titanic, Pclass) %>%
  mutate(Pclass_cost = sum(Fare)) %>%
  ungroup() %>%
  mutate(ticket_fare_rate = Fare / Pclass_cost)

Imisebenzi yewindi ku-R: data.table

I-algorithm yesixazululo ihlala ifana ne-in dplyr, sidinga ukuhlukanisa itafula ngamafasitela ngenkambu I-Pclass. Okukhiphayo kukholomu entsha inani leqembu elihambisana nomugqa ngamunye, bese wengeza ikholomu lapho sibala khona isabelo sezindleko zethikithi ngalinye eqenjini lalo.

Ukwengeza amakholomu amasha ku data.table opharetha okhona :=. Ngezansi kunesibonelo sokuxazulula inkinga usebenzisa iphakheji data.table

Imisebenzi yewindi ku-R: data.table

titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost), 
        by = Pclass]

Imisebenzi yewindi kuPython: pandas

Enye indlela yokwengeza ikholomu entsha kuyo pandas - sebenzisa umsebenzi assign(). Ukufingqa izindleko zamathikithi ngeklasi lekhabhinethi, ngaphandle kokuqoqa imigqa, sizosebenzisa umsebenzi transform().

Ngezansi isibonelo sesixazululo esifaka kuso etafuleni titanic amakholomu angu-2 afanayo.

Imisebenzi yewindi kuPython: pandas

titanic.assign(Pclass_cost      =  titanic.groupby('Pclass').Fare.transform(sum),
               ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])

Ithebula lezincwadi lemisebenzi nezindlela

Ngezansi kunethebula lokuxhumana phakathi kwezindlela zokwenza imisebenzi ehlukahlukene ngedatha kumaphakheji esiwacubungulile.

Incazelo
ihlanzekile
idatha.table
pandas

Ilayisha Idatha
vroom()/ readr::read_csv() / readr::read_tsv()
fread()
read_csv()

Idala ama-dataframes
tibble()
data.table()
dict() + from_dict()

Ukukhetha Amakholomu
select()
ukuphikisana j, indawo yesibili kubakaki abayisikwele
sidlulisa uhlu lwamakholomu adingekayo kubakaki abayisikwele / drop() / filter() / select_dtypes()

Ihlunga imigqa
filter()
ukuphikisana i, indawo yokuqala kubakaki abayisikwele
Sibala izimo zokuhlunga kubakaki abayisikwele / filter()

Ukuhlanganisa nokuhlanganisa
group_by() + summarise()
izingxabano j + by
groupby() + agg()

Inyunyana yamathebula eqondile (UNION)
bind_rows()
rbind()
concat()

Ukuhlanganisa amathebula okuvundlile (JOIN)
left_join() / *_join()
merge()
merge()

Imisebenzi yefasitela eyisisekelo nokwengeza amakholomu abaliwe
group_by() + mutate()
ukuphikisana j usebenzisa opharetha := + ukuphikisana by
transform() + assign()

isiphetho

Mhlawumbe esihlokweni engichazile hhayi ukuqaliswa okulungile kakhulu kokucubungula idatha, ngakho-ke ngizojabula uma ulungisa amaphutha ami kumazwana, noma umane wengeze ulwazi olunikezwe esihlokweni ngamanye amasu okusebenza ngedatha ku-R / Python.

Njengoba ngibhale ngenhla, inhloso yalesi sihloko kwakungekona ukuphoqelela umbono womuntu ngokuthi yiluphi ulimi olungcono, kodwa ukwenza lula ithuba lokufunda zombili izilimi, noma, uma kunesidingo, ukufuduka phakathi kwazo.

Uma usithandile lesi sihloko, ngizojabula ukuba nababhalisile abasha kweyami youtube и yocingo iziteshi.

I-poll

Imaphi amaphakheji alandelayo owasebenzisayo emsebenzini wakho?

Kumazwana ungabhala isizathu sokukhetha kwakho.

Abasebenzisi ababhalisiwe kuphela abangabamba iqhaza kuhlolovo. Ngena ngemvume, wamukelekile.

Iyiphi iphakheji yokucubungula idatha oyisebenzisayo (ungakhetha izinketho ezimbalwa)

  • 45,2%i-tidyverse19

  • 33,3%idatha.ithebula14

  • 54,8%ama-panda23

Bangu-42 abasebenzisi abavotile. Abasebenzisi abangu-9 bagobile.

Source: www.habr.com

Engeza amazwana