Nekutsvaga R kana Python paInternet, iwe unowana mamirioni ezvinyorwa uye makiromita ehurukuro pamusoro pekuti ndeipi iri nani, inokurumidza uye yakanakira kushanda nedata. Asi zvinosuruvarisa, zvese izvi zvinyorwa uye kukakavara hazvina kunyanya kubatsira.
Chinangwa chechinyorwa ichi ndechekuenzanisa maitiro ekutanga ekugadzirisa data mumapakeji anozivikanwa emitauro miviri. Uye batsira vaverengi nekukurumidza kugona chimwe chinhu chavasati vaziva. Kune avo vanonyora muPython, tsvaga maitiro ekuita chinhu chimwe chete muR, uye zvinopesana.
Munguva yechinyorwa tichaongorora syntax yeanonyanya kufarirwa mapakeji muR. Aya ndiwo mapakeji anosanganisirwa muraibhurari. tidyverse
uyewo pasuru data.table
. Uye enzanisa syntax yavo ne pandas
, iyo inonyanya kufarirwa data data pasuru muPython.
Tichaenda nhanho nhanho kuburikidza nenzira yese yekuongorora data kubva pakuirodha kusvika pakuita analytical windows mabasa uchishandisa Python uye R.
Zviri mukati
Ichi chinyorwa chinogona kushandiswa secheat sheet kana wakanganwa maitiro ekuita imwe data data process mune imwe yemapakeji ari kutariswa.
Misiyano mikuru yesyntax pakati peR nePython
1.1.Kupinda Package Mabasa
1.2.Basa
1.3.Indexing
1.4.Nzira uye OOP
1.5.Mapaipi
1.6.Data zvimiro Mazwi mashoma pamusoro pemapakeji atichashandisa
2.1.tidyverse
2.2.data.table
2.3.pandas Kuisa mapakeji Loading Data Kugadzira dataframes Kusarudza Columns Unoda Kusefa mitsetse Kuunganidza uye Kuunganidza Vertical union yematafura (UNION) Kujoinwa kwakachinjika kwematafura (JOIN) Basic window mabasa uye akaverengerwa columns Tafura yetsamba pakati penzira dzekugadzirisa data muR uye Python mhedziso Ongororo pfupi yekuti unoshandisa pasuru ipi
Kana iwe uchifarira kuongorora data, unogona kuwana yangu
Misiyano mikuru yesyntax pakati peR nePython
Kuita kuti zvive nyore kwauri kuti uchinje kubva kuPython kuenda kuR, kana zvinopesana, ini ndinopa mashoma mashoma mapoinzi aunoda kuterera.
Kupinda Package Mabasa
Kana pasuru yaiswa muR, haufanire kudoma zita repasuru kuti uwane mabasa ayo. Kazhinji izvi hazviwanzo muR, asi zvinogamuchirwa. Iwe haufanirwe kupinza pasuru zvachose kana iwe uchida rimwe remabasa ayo mukodhi yako, asi ingoidaidza nekutsanangura zita repasuru uye zita rebasa racho. Muparadzi pakati pepakeji nemazita ebasa muR ikoloni mbiri. package_name::function_name()
.
MuPython, zvakapesana, inoonekwa seyechinyakare kudaidza mabasa epasuru nekunyatso tsanangura zita rayo. Kana pasuru yatorwa, inowanzopihwa zita rakapfupikiswa, semuenzaniso. pandas
kazhinji pseudonym inoshandiswa pd
. Basa repasuru rinowanikwa kuburikidza nedoti package_name.function_name()
.
Basa
MuR, zvakajairika kushandisa museve kupa kukosha kuchinhu. obj_name <- value
, kunyangwe chiratidzo chimwechete chakaenzana chichibvumidzwa, chiratidzo chimwe chete chakaenzana muR chinoshandiswa kupfuudza kukosha kuita nharo.
MuPython, basa rinoitwa chete nechiratidzo chimwe chete chakaenzana obj_name = value
.
Indexing
Kune zvakare misiyano yakakosha pano. MuR, indexing inotanga pane imwe uye inosanganisira zvese zvakatsanangurwa muchikamu chinobuda,
muPython, indexing inotanga kubva ku zero uye iyo yakasarudzwa haisanganisi chinhu chekupedzisira chakatsanangurwa mu indexing. Saka kugadzira x[i:j]
muPython haizosanganise iyo j element.
Kune zvakare misiyano mune negative indexing, muR notation x[-1]
ichadzorera zvinhu zvose zvevector kunze kwekupedzisira. MuPython, chirevo chakafanana chinodzosa chete chinhu chekupedzisira.
Nzira uye OOP
R inoshandisa OOP nenzira yayo, ndakanyora nezve izvi muchinyorwa tydiverse
zvichava nyore pane pandas
. Kunyangwe iyi inogona kunge iri pfungwa yangu yekuzvibata.
Muchidimbu, zvinhu zviri muR hazvina nzira (kana tikataura nezveS3 makirasi, asi kune mamwe maOOP mashandisirwo asina kunyanya kuwanda). Kune chete mabasa akajairika anoagadzirisa zvakasiyana zvichienderana nekirasi yechinhu.
Mapaipi
Zvichida iri ndiro zita ra pandas
Hazvizove zvakakwana, asi ndichaedza kutsanangura zvinoreva.
Kuti urege kuchengetedza kuverenga kwepakati uye kusaburitsa zvinhu zvisingakoshi munzvimbo yekushanda, unogona kushandisa rudzi rwepombi. Avo. pfuudza mhedzisiro yekuverenga kubva kune rimwe basa kuenda kune rinotevera, uye usachengete yepakati mhinduro.
Ngatitorei inotevera kodhi muenzaniso, kwatinochengeta maverengero epakati muzvinhu zvakasiyana:
temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )
Takaita 3 oparesheni sequentially, uye mhedzisiro yeimwe neimwe yakachengetwa mune chimwe chinhu chakasiyana. Asi kutaura zvazviri, hatidi zvinhu zvepakati izvi.
Kana zvakatoipisisa, asi zvakanyanya kujaira kune vashandisi veExcel.
obj <- func3(func2(func1()))
Mune ino kesi, isu hatina kuchengetedza epakati maverengero mhinduro, asi kuverenga kodhi ine nested mabasa inokanganisa zvakanyanya.
Tichatarisa nzira dzinoverengeka dzekugadzirisa data muR, uye vanoita mashandiro akafanana nenzira dzakasiyana.
Mapaipi muraibhurari tidyverse
yakaitwa nemushandisi %>%
.
obj <- func1() %>%
func2() %>%
func3()
Nokudaro tinotora chigumisiro chebasa func1()
uye ipfuure senharo yekutanga kuna func2()
, tobva tapa mugumisiro wekuverenga uku senharo yekutanga func3()
. Uye pakupedzisira, isu tinonyora ese maverengero akaitwa muchinhu obj <-
.
Zvese zviri pamusoro zvinoratidzwa zvirinani pane mazwi neiyi meme:
В data.table
maketani anoshandiswa nenzira yakafanana.
newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]
Mune yega yega mabhureki eskweya unogona kushandisa mhedzisiro yekushanda kwekare.
В pandas
maoparesheni akadai anoparadzaniswa nedoti.
obj = df.fun1().fun2().fun3()
Avo. tinotora tafura yedu df uye shandisa nzira yake fun1()
, zvino tinoshandisa nzira kune chigumisiro chakawanikwa fun2()
, mushure fun3()
. Mhedzisiro yacho inochengetwa muchinhu object .
Data zvimiro
Zvimiro zve data muR nePython zvakafanana, asi zvine mazita akasiyana.
tsananguro
Zita muR
Zita muPython/pandas
Tafura chimiro
data.frame, data.table, tibble
DataFrame
One-dimensional list yezvinokosha
Vector
Series mu pandas kana rondedzero mune yakachena Python
Multi-level isiri-tabular chimiro
List
Duramazwi (dict)
Tichatarisa mamwe maficha uye mutsauko mu syntax pazasi.
Mazwi mashoma pamusoro pemapakeji atichashandisa
Kutanga, ini ndichakuudza zvishoma nezve mapakeji auchazojairana nawo panguva ino yechinyorwa.
tidyverse
Official webhusayithi:
raibhurari tidyverse
yakanyorwa naHedley Wickham, Senior Research Scientist paRStudio. tidyverse
ine seti inokatyamadza yemapakeji anorerutsa kugadzirisa data, mashanu ayo anosanganisirwa mune gumi epamusoro ekurodha kubva kuCRAN repository.
Iyo yakakosha ye library ine anotevera mapakeji: ggplot2
, dplyr
, tidyr
, readr
, purrr
, tibble
, stringr
, forcats
. Imwe neimwe yemapakeji aya ine chinangwa chekugadzirisa dambudziko chairo. Semuyenzaniso dplyr
yakagadzirirwa kushandura data, tidyr
kuunza iyo data kune yakarongeka fomu, stringr
inorerutsa kushanda netambo, uye ggplot2
ndeimwe yeanonyanya kufarirwa data kuona maturusi.
advantage tidyverse
ndiyo yakapfava uye iri nyore kuverenga syntax, iri munzira dzakawanda yakafanana nemutauro wemubvunzo weSQL.
data.table
By data.table
ndiMat Dole weH2O.ai.
Kuburitswa kwekutanga kweraibhurari kwakaitika muna 2006.
Iyo pasuru syntax haina nyore sezvairi tidyverse
uye inonyanya kuyeuchidza yekare dataframes muR, asi panguva imwechete yakawedzera zvakanyanya mukushanda.
Ese mashandisirwo ane tafura iri pasuru ino anotsanangurwa mumabhuraketi akaenzana, uye kana ukaturikira syntax. data.table
muSQL, unowana chimwe chinhu chakadai: data.table[ WHERE, SELECT, GROUP BY ]
Simba repakeji iyi kumhanya kwekugadzirisa huwandu hukuru hwe data.
pandas
Official webhusayithi:
Iro zita reraibhurari rinobva kune econometric izwi rekuti "panel data", rinoshandiswa kutsanangura multidimensional yakarongwa seti yeruzivo.
By pandas
ndiye weAmerica Wes McKinney.
Kana zvasvika pakuongororwa kwedata muPython, yakaenzana pandas
Aihwa. Iyo yakawanda inoshanda, yepamusoro-yepamusoro pasuru iyo inokutendera iwe kuti uite chero kunyengedza nedata, kubva kurodha data kubva kune chero masosi kusvika pakuona.
Kuisa mamwe mapakeji
Iwo mapakeji anokurukurwa muchinyorwa ichi haana kuisirwa mune yakakosha R uye Python kugovera. Kunyangwe paine diki caveat, kana iwe waisa iyo Anaconda kugovera, wobva waisa nekuwedzera pandas
hazvidiwe.
Kuisa mapakeji muR
Kana iwe wakavhura iyo RStudio budiriro nharaunda kamwechete, iwe pamwe unotoziva kuisa inodiwa pasuru muR. Kuisa mapakeji, shandisa iyo yakajairwa kuraira. install.packages()
nekuimhanyisa zvakananga muR pachayo.
# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")
Mushure mekuisa, mapakeji anoda kubatanidzwa, ayo muzviitiko zvakawanda murairo unoshandiswa library()
.
# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)
Kuisa mapakeji muPython
Saka, kana iwe uine yakachena Python yakaiswa, ipapo pandas
unofanira kuiisa nemaoko. Vhura mutsara wekuraira, kana terminal, zvichienderana neako sisitimu yekushandisa uye isa unotevera rairo.
pip install pandas
Zvadaro tinodzokera kuPython uye tinopinza purogiramu yakaiswa nemurairo import
.
import pandas as pd
Loading Data
Kuchera data ndeimwe yematanho akakosha pakuongorora data. Python uye R, kana zvichidikanwa, zvinokupa iwe mikana yakakura yekuwana data kubva kune chero masosi: mafaera emuno, mafaera kubva paInternet, mawebhusaiti, marudzi ese emadatabase.
Muchinyorwa chese tichashandisa akati wandei dataset:
- Dhaunirodha mbiri kubva kuGoogle Analytics.
- Titanic Passenger Dataset.
Yese data iri pane yangu
Kuisa data muR: tidyverse, vroom, muverengi
Kuisa data muraibhurari tidyverse
Pane mapakeji maviri: vroom
, readr
. vroom
zvemazuva ano, asi mune ramangwana mapakeji anogona kusanganiswa.
Quote kubva vroom
.
vroom vs muverengi
Ko kusunungurwa kwevroom
kureva kutireadr
? Parizvino isu tinoronga kurega mapakeji maviri achishanduka zvakasiyana, asi pamwe isu tichabatanidza mapakeji mune ramangwana. Imwe yakashata kune vroom yekuverenga usimbe ndeye mamwe matambudziko edata haagone kutaurwa kumberi, saka nzira yekuzvibatanidza inoda imwe pfungwa.vroom vs muverengi
Kusunungurwa kunorevei?vroom
nokutireadr
? Parizvino isu tinoronga kugadzira ese mapakeji akasiyana, asi isu tichaasanganisa mune ramangwana. Imwe yezvakaipira zveusimbe kuverengavroom
ndeyekuti mamwe matambudziko ane data haagone kutaurwa pachine nguva, saka iwe unofanirwa kufunga nezve nzira yekuzvibatanidza.
Muchikamu chino tichatarisa ese ari maviri ekurodha data mapakeji:
Kuisa data muR: vroom package
# install.packages("vroom")
library(vroom)
# Чтение данных
## vroom
ga_nov <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Kuisa data muR: muverengi
# install.packages("readr")
library(readr)
# Чтение данных
## readr
ga_nov <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Mune iyo package vroom
, zvisinei neiyo csv / tsv data fomati, kurodha kunoitwa nebasa rezita rimwe chete vroom()
, mupasuru readr
tinoshandisa basa rakasiyana kune imwe neimwe fomati read_tsv()
и read_csv()
.
Kuisa data muR: data.table
В data.table
pane basa rekurodha data fread()
.
Kuisa data muR: data.table package
# install.packages("data.table")
library(data.table)
## data.table
ga_nov <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")
Kurodha data muPython: pandas
Kana tikaenzanisa neR mapakeji, saka mune iyi kesi syntax iri padyo pandas
vachava readr
, nokuti pandas
inogona kukumbira data kubva chero kupi, uye kune mhuri yese yemabasa mune iyi package read_*()
.
read_csv()
read_excel()
read_sql()
read_json()
read_html()
Uye mamwe akawanda mabasa akagadzirirwa kuverenga data kubva akasiyana mafomati. Asi nokuda kwezvinangwa zvedu zvakakwana read_table()
kana read_csv()
kushandisa nharo Sep kudoma mutsara wembiru.
Kurodha data muPython: pandas
import pandas as pd
ga_nov = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")
Kugadzira dataframes
Tafura titanic, yatakatakura, pane munda Sex, inochengeta chiziviso chechikadzi chemufambi.
Asi kuitira kuratidzwa kuri nyore kwedata nemufambi wechikadzi, unofanirwa kushandisa zita kwete kodhi yevarume.
Kuti tiite izvi, tichagadzira dhairekitori diki, tafura umo kuchange kuine makoramu maviri chete (code uye gender zita) uye 2 mitsara, zvichiteerana.
Kugadzira dataframe muR: tidyverse, dplyr
Mumuenzaniso wekodhi pazasi, isu tinogadzira yaunoda dataframe tichishandisa basa tibble()
.
Kugadzira dataframe muR: dplyr
## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
gender = c("female", "male"))
Kugadzira dataframe muR: data.table
Kugadzira dataframe muR: data.table
## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
gender = c("female", "male"))
Kugadzira dataframe muPython: pandas
В pandas
Kugadzirwa kwemafuremu kunoitwa mumatanho akati wandei, kutanga tinogadzira duramazwi, tozoshandura duramazwi kuita dataframe.
Kugadzira dataframe muPython: pandas
# создаём дата фрейм
gender_dict = {'id': [1, 2],
'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)
Kusarudza Columns
Matafura aunoshanda nawo anogona kunge aine gumi nemaviri kana kunyange mazana emakoramu edata. Asi kuti uite ongororo, sekutonga, haudi makoramu ese anowanikwa mune tafura yekubva.
Naizvozvo, imwe yekutanga mashandiro auchaita neiyo tafura tafura ndeyekubvisa ruzivo rusina basa uye kusunungura ndangariro iyo ruzivo urwu.
Kusarudza makoramu muR: tidyverse, dplyr
nemarongerwo dplyr
yakafanana chaizvo nemutauro weSQL wemubvunzo, kana uchinge wajairana nawo iwe unokurumidza kugona iyi package.
Kusarudza makoramu, shandisa basa select()
.
Pazasi pane mienzaniso yekodhi yaunogona kusarudza nayo makoramu nenzira dzinotevera:
- Kunyora mazita emakoramu anodiwa
- Tarisa kumazita emakoramu uchishandisa zvirevo zvenguva dzose
- Nemhando yedata kana chero imwe pfuma yedata iri mukoramu
Kusarudza makoramu muR: dplyr
# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)
Kusarudza makoramu muR: data.table
Iwo maoperation akafanana mu data.table
dzinoitwa zvakati siyanei, pakutanga kwechinyorwa ndakapa tsananguro yekuti ndedzipi nharo dziri mukati memabhuraketi data.table
.
DT[i,j,by]
Kupi:
ini - kupi, i.e. kusefa nemitsara
j - sarudza|gadziridza| ita, i.e. kusarudza makoramu nekuashandura
by - data grouping
Kusarudza makoramu muR: data.table
## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]
Variable .SD
inokubvumira kuti uwane makoramu ese, uye .SDcols
sefa makoramu anodiwa uchishandisa mataurirwo enguva dzose, kana mamwe mabasa kusefa mazita emakoramu aunoda.
Kusarudza makoramu muPython, pandas
Kusarudza makoramu nemazita mukati pandas
zvakakwana kupa runyoro rwemazita avo. Uye kusarudza kana kusabvisa makoramu nemazita uchishandisa zvinogara zvichitaurwa, unofanirwa kushandisa mabasa drop()
и filter()
, uye nharo axis=1, iyo yaunoratidza nayo kuti inofanirwa kugadzirisa makoramu kwete mitsetse.
Kusarudza munda nemhando yedata, shandisa basa select_dtypes()
, uye mumakakatanwa inosanganisira kana kusanganisa pfuura rondedzero yemhando dzedata dzinoenderana nendima dzaunoda kusarudza.
Kusarudza makoramu muPython: pandas
# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])
Kusefa mitsetse
Semuenzaniso, iyo tafura yezvinyorwa inogona kunge iine makore akati wandei e data, asi iwe unongoda kuongorora mwedzi wapfuura. Zvekare, mitsetse yekuwedzera inononoka dhizaini yekugadzirisa uye kuvhara iyo PC memory.
Kusefa mitsetse muR: tydyverse, dplyr
В dplyr
basa rinoshandiswa kusefa mitsara filter()
. Zvinotora dataframe senharo yekutanga, wobva wanyora mamiriro ekusefa.
Paunenge uchinyora zvirevo zvine musoro kusefa tafura, mune ino kesi, tsanangura mazita emakoramu asina makotesheni uye pasina kuzivisa zita retafura.
Paunenge uchishandisa akawanda ane musoro kutaura kusefa, shandisa anotevera maoperator:
- & kana koma - zvine musoro UYE
- | - zvine musoro OR
Kusefa mitsetse muR: dplyr
# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)
Kusefa mitsetse muR: data.table
Sezvandatonyora pamusoro, mu data.table
Sintakisi yekushandura data yakavharirwa mumabhuraketi akaenzana.
DT[i,j,by]
Kupi:
ini - kupi, i.e. kusefa nemitsara
j - sarudza|gadziridza| ita, i.e. kusarudza makoramu nekuashandura
by - data grouping
Nhaurirano inoshandiswa kusefa mitsetse i, iyo ine nzvimbo yekutanga mumabhuraketi akaenzana.
Makolamu anowanikwa mumashoko ane musoro asina makotesheni uye pasina kudoma zita retafura.
Matauriro ane musoro ane hukama kune mumwe nemumwe nenzira imwechete semu dplyr
kuburikidza ne & uye | vashandisi.
Kusefa mitsetse muR: data.table
## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]
Kusefa tambo muPython: pandas
Sefa nemitsara mukati pandas
zvakafanana nekusefa mukati data.table
, uye inoitwa mumabhuraketi akaenzana.
Muchiitiko ichi, kuwana makoramu kunoitwa nekuratidza zita reiyo dataframe; ipapo zita rekoramu rinogonawo kuratidzwa mune quotation mamaki mumabhuraketi akaenzana (muenzaniso df['col_name']
), kana pasina makotesheni mushure menguva (muenzaniso df.col_name
).
Kana iwe uchida kusefa dataframe nemamiriro akati wandei, mamiriro ega ega anofanirwa kuiswa mumaparentheses. Mamiriro ezvinhu anonzwisisika akabatanidzwa kune mumwe nemumwe nevashandisi &
и |
.
Kusefa tambo muPython: pandas
# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]
Kuunganidza uye kuunganidza data
Chimwe chezvinonyanya kushandiswa pakuongorora data kuunganidza uye kuunganidza.
Iyo syntax yekuita izvi mashandiro akapararira pamapakeji ese atinoongorora.
Muchiitiko ichi, tichatora dataframe semuenzaniso titanic, uye kuverenga nhamba uye avhareji mutengo wematikiti zvichienderana nekirasi yekabhini.
Kuunganidza uye kuunganidza data muR: tidyverse, dplyr
В dplyr
basa rinoshandiswa pakuita mapoka group_by()
, uye pakuunganidza summarise()
. Saizvozvo, dplyr
kune mhuri yese yemabasa summarise_*()
, asi chinangwa chechinyorwa chino ndechekuenzanisa chirevo chekutanga, kuti tisapinda musango rakadaro.
Basic aggregation mabasa:
sum()
- kupfupisamin()
/max()
- shoma uye yepamusoro kukoshamean()
- pakatimedian()
- pakatilength()
- uwandu
Kuunganidza uye kuunganidza muR: dplyr
## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
summarise(passangers = length(PassengerId),
avg_price = mean(Fare))
Kushanda group_by()
takapfuura tafura senharo yekutanga titanic, ndokubva aratidza munda Pclass, yatichabatanidza nayo tafura yedu. Mhedzisiro yekushanda uku uchishandisa mushandisi %>%
yakapfuura senharo yekutanga kune basa summarise()
, uye akawedzera 2 mamwe minda: vapfuuri и avg_price. Mukutanga, kushandisa basa length()
akaverenga nhamba matikiti, uye yechipiri kushandisa basa mean()
yakagamuchira mutengo wetiketi wepakati.
Kubatanidza uye kuunganidzwa kwedata muR: data.table
В data.table
nharo inoshandiswa pakuunganidza j
iyo ine nzvimbo yechipiri mumabhuraketi akaenzana, uye yemapoka by
kana keyby
, dzine nzvimbo yechitatu.
Rondedzero yemabasa ekuunganidza mune ino kesi yakafanana neinotsanangurwa mukati dplyr
, nokuti aya mabasa kubva kune yekutanga R syntax.
Kuunganidza uye kuunganidza muR: data.table
## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
avg_price = mean(Fare)),
by = Pclass]
Kubatana uye kuunganidzwa kwedata muPython: pandas
Kuungana pandas
zvakafanana ne dplyr
, asi kuunganidza hakuna kufanana ne dplyr
kwete pa data.table
.
Kuti uite boka, shandisa nzira groupby()
, iyo yaunoda kupfuudza rondedzero yemakoramu ayo iyo dataframe ichaiswa mumapoka.
Kuti aggregation unogona kushandisa nzira agg()
iyo inogamuchira duramazwi. Makiyi eduramazwi ndiwo makoramu aunoshandisa mabasa ekubatanidza, uye kukosha ndiwo mazita emabasa ekuunganidza.
Aggregation mabasa:
sum()
- kupfupisamin()
/max()
- shoma uye yepamusoro kukoshamean()
- pakatimedian()
- pakaticount()
- uwandu
shanda reset_index()
mumuenzaniso pazasi unoshandiswa kuseta zvakare nested indexes izvo pandas
defaults kune mushure mekuunganidza data.
Symbol inokubvumira kuti uende kumutsara unotevera.
Kuunganidza uye kuunganidza muPython: pandas
# группировка и агрегация данных
titanic.groupby(["Pclass"]).
agg({'PassengerId': 'count', 'Fare': 'mean'}).
reset_index()
Kubatanidzwa kwematafura kwakatwasuka
Kuvhiya kwaunobatanidza matafura maviri kana anopfuura echimiro chimwe chete. Data yatakatakura ine matafura ga_nov и ga_dec. Aya matafura akafanana muchimiro, i.e. iva nemakoramu akafanana, uye marudzi edata ari mumakoramu aya.
Uku kurodha kubva kuGoogle Analytics yemwedzi waMbudzi naZvita, muchikamu chino tichabatanidza iyi data mutafura imwe.
Kubatanidza matafura muR: tidyverse, dplyr
В dplyr
Iwe unogona kusanganisa matafura maviri mune imwe uchishandisa basa bind_rows()
, ichipfuura matafura senharo dzayo.
Kusefa mitsetse muR: dplyr
# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)
Kubatanidza matafura muR: data.table
Izvo zvakare hapana chakaoma, ngatishandisei rbind()
.
Kusefa mitsetse muR: data.table
## data.table
rbind(ga_nov, ga_dec)
Vertical kujoinha matafura muPython: pandas
В pandas
basa rinoshandiswa kubatanidza matafura concat()
, mauri iwe unofanirwa kupfuudza runyoro rwemafuremu kuti uasanganise.
Kusefa tambo muPython: pandas
# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])
Horizontal kubatana kwematafura
Kushanda uko makoramu kubva kune yechipiri anowedzerwa patafura yekutanga nekiyi. Inowanzo shandiswa pakupfumisa tafura yechokwadi (semuenzaniso, tafura ine data rekutengesa) ine imwe data rengedzo (semuenzaniso, mutengo wechigadzirwa).
Kune marudzi akawanda ekubatanidza:
Mutafura yakamboiswa titanic tine mbiru Sex, inofambirana nekodhi yevarume nevatasvi:
1 - mukadzi
2 - murume
Zvakare, isu takagadzira tafura - bhuku rekutarisa hukama. Kuti uwane kuratidzwa kuri nyore kwe data pane murume kana mukadzi, isu tinofanirwa kuwedzera zita remurume kubva mudhairekitori. hukama kutafura titanic.
Tafura yakatwasuka inobatana muR: tidyverse, dplyr
В dplyr
Kune mhuri yese yemabasa ekubatanidza yakatwasuka:
inner_join()
left_join()
right_join()
full_join()
semi_join()
nest_join()
anti_join()
Inonyanya kushandiswa mukuita kwangu ndeye left_join()
.
Senharo mbiri dzekutanga, mabasa akanyorwa pamusoro anotora matafura maviri kuti abatanidze, uye senharo yechitatu by unofanira kutsanangura makoramu ekubatanidza.
Tafura yakatwasuka inobatana muR: dplyr
# объединяем таблицы
left_join(titanic, gender,
by = c("Sex" = "id"))
Kujoinwa kwakachinjika kwematafura muR: data.table
В data.table
Iwe unofanirwa kujoina matafura nekiyi uchishandisa basa merge()
.
Nharo dzekubatanidza () basa mu data.table
- x, y - Matafura ekubatanidza
- by - Column ndiyo kiyi yekubatanidza kana iine zita rimwechete mumatafura ese
- by.x, by.y - Mazita epaKoramu anofanira kubatanidzwa, kana aine mazita akasiyana mumatafura
- zvese, zvese.x, zvese.y - Join type, ese achadzosa mitsetse yese kubva kumatafura ese, all.x inoenderana neiyo LEFT JOIN operation (ichasiya mitsetse yese yetafura yekutanga), all.y - inofambirana ne RIGHT JOIN Operation (ichasiya mitsetse yese yetafura yechipiri).
Kujoinwa kwakachinjika kwematafura muR: data.table
# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)
Tafura yakatwasuka inobatana muPython: pandas
Uyewo mu data.table
in pandas
basa rinoshandiswa kubatanidza matafura merge()
.
Nharo dzekubatanidza () basa mumapanda
- sei - Yekubatanidza mhando: kuruboshwe, kurudyi, kunze, mukati
- pa - Column ndiyo kiyi kana iine zita rimwechete mumatafura ese ari maviri
- left_on, right_on - Mazita emakoramu akakosha, kana aine mazita akasiyana mumatafura
Tafura yakatwasuka inobatana muPython: pandas
# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")
Basic window mabasa uye akaverengerwa columns
Mahwindo emabasa akafanana muchirevo kumabasa ekuunganidza, uye anowanzoshandiswa mukuongorora data. Asi kusiyana nemabasa ekuunganidza, hwindo mabasa haachinje huwandu hwemitsara yeinobuda dataframe.
Chaizvoizvo, tichishandisa hwindo basa, tinotsemura iyo inouya dataframe kuita zvikamu zvinoenderana neimwe chirevo, i.e. nekukosha kwemunda, kana minda yakati wandei. Uye isu tinoita arithmetic mashandiro pahwindo rega rega. Mhedzisiro yezviito izvi ichadzorerwa mumutsara wega wega, i.e. pasina kushandura nhamba yese yemitsara patafura.
Somuenzaniso, ngatitorei tafura titanic. Tinogona kuverenga kuti iperesenti yemutengo wetikiti yega yega yaive mukati mekirasi yekabhini.
Kuti tiite izvi, isu tinofanirwa kupinza mumutsara wega wega mutengo wetikiti wekirasi yazvino yekabhini uko tikiti riri mumutsara uyu, togova mutengo wetiketi rega rega nemutengo wakakwana wematikiti ese ekirasi imwe chete. .
Hwindo rinoshanda muR: tidyverse, dplyr
Kuwedzera makoramu matsva, pasina kushandisa mitsara yemapoka, mukati dplyr
anoshanda basa mutate()
.
Iwe unogona kugadzirisa dambudziko rakatsanangurwa pamusoro nekuunganidza data nemunda Pclass uye kupfupisa munda muchikamu chitsva ita. Zvadaro, bvisa tafura uye ugovane maitiro emunda ita kune zvakaitika munhanho yapfuura.
Hwindo rinoshanda muR: dplyr
group_by(titanic, Pclass) %>%
mutate(Pclass_cost = sum(Fare)) %>%
ungroup() %>%
mutate(ticket_fare_rate = Fare / Pclass_cost)
Hwindo rinoshanda muR: data.table
Iyo mhinduro algorithm inoramba yakafanana neyemukati dplyr
, tinoda kupatsanura tafura mumahwindo nemunda Pclass. Ratidza muchikamu chitsva mari yeboka inoenderana nemutsara wega wega, uye wedzera koramu yatinoverengera mugove wemutengo wetiketi rega rega muboka raro.
Kuwedzera makoramu matsva ku data.table
opareta aripo :=
. Pazasi pane muenzaniso wekugadzirisa dambudziko uchishandisa package data.table
Hwindo rinoshanda muR: data.table
titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost),
by = Pclass]
Hwindo rinoshanda muPython: pandas
Imwe nzira yekuwedzera mutsara mutsva ku pandas
- shandisa basa assign()
. Kupfupisa mutengo wematiketi nekirasi yekabhini, pasina mapoka mitsara, isu tichashandisa basa transform()
.
Pasi pane muenzaniso wemhinduro yatinowedzera patafura titanic zvakafanana 2 columns.
Hwindo rinoshanda muPython: pandas
titanic.assign(Pclass_cost = titanic.groupby('Pclass').Fare.transform(sum),
ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])
Mabasa uye nzira dzetsamba tafura
Pazasi pane tafura yekunyorerana pakati penzira dzekuita akasiyana mashandiro ane data mumapakeji atakurukura.
tsananguro
tidyverse
data.table
pandas
Loading Data
vroom()
/ readr::read_csv()
/ readr::read_tsv()
fread()
read_csv()
Kugadzira dataframes
tibble()
data.table()
dict()
+ from_dict()
Kusarudza Columns
select()
kupokana j, nzvimbo yechipiri mumabhuraketi akaenzana
isu tinopasa runyorwa rwemakoramu anodiwa mumabhuraketi akaenzana / drop()
/ filter()
/ select_dtypes()
Kusefa mitsetse
filter()
kupokana i, nzvimbo yekutanga mumabhuraketi akaenzana
Isu tinonyora mamiriro ekusefa mumabhuraketi akaenzana / filter()
Kuunganidza uye Kuunganidza
group_by()
+ summarise()
nharo j + by
groupby()
+ agg()
Vertical union yematafura (UNION)
bind_rows()
rbind()
concat()
Kujoinwa kwakachinjika kwematafura (JOIN)
left_join()
/ *_join()
merge()
merge()
Basic window mabasa uye kuwedzera akaverengerwa columns
group_by()
+ mutate()
kupokana j kushandisa mushandisi :=
+ nharo by
transform()
+ assign()
mhedziso
Zvichida mune chinyorwa chandakatsanangura kwete yakanyanya kunaka mashandisirwo ekugadzirisa data, saka ndichafara kana ukagadzirisa kukanganisa kwangu mumashoko, kana kungowedzera ruzivo rwakapihwa muchinyorwa nemamwe matekiniki ekushanda nedata muR / Python.
Sezvandakanyora pamusoro apa, chinangwa chechinyorwa chakanga chisiri chokumanikidzira maonero omumwe mutauro uri nani, asi kurerutsa mukana wekudzidza mitauro miviri, kana, kana zvichidiwa, kutama pakati pavo.
Kana iwe wakafarira chinyorwa, ini ndichafara kuve nevatsva vanyoreri kune yangu
Poll
Ndeipi pasuru dzinotevera dzaunoshandisa mubasa rako?
Mumashoko iwe unogona kunyora chikonzero chekusarudza kwako.
Vashandisi vakanyoresa chete ndivo vanogona kutora chikamu muongororo.
Ndeipi data process package yaunoshandisa (unogona kusarudza akati wandei sarudzo)
-
45,2%tidyverse19
-
33,3%data.table14
-
54,8%pandas23
42 vashandisi vakavhota. 9 vashandisi vakaramba.
Source: www.habr.com