Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Nekutsvaga R kana Python paInternet, iwe unowana mamirioni ezvinyorwa uye makiromita ehurukuro pamusoro pekuti ndeipi iri nani, inokurumidza uye yakanakira kushanda nedata. Asi zvinosuruvarisa, zvese izvi zvinyorwa uye kukakavara hazvina kunyanya kubatsira.

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Chinangwa chechinyorwa ichi ndechekuenzanisa maitiro ekutanga ekugadzirisa data mumapakeji anozivikanwa emitauro miviri. Uye batsira vaverengi nekukurumidza kugona chimwe chinhu chavasati vaziva. Kune avo vanonyora muPython, tsvaga maitiro ekuita chinhu chimwe chete muR, uye zvinopesana.

Munguva yechinyorwa tichaongorora syntax yeanonyanya kufarirwa mapakeji muR. Aya ndiwo mapakeji anosanganisirwa muraibhurari. tidyverseuyewo pasuru data.table. Uye enzanisa syntax yavo ne pandas, iyo inonyanya kufarirwa data data pasuru muPython.

Tichaenda nhanho nhanho kuburikidza nenzira yese yekuongorora data kubva pakuirodha kusvika pakuita analytical windows mabasa uchishandisa Python uye R.

Zviri mukati

Ichi chinyorwa chinogona kushandiswa secheat sheet kana wakanganwa maitiro ekuita imwe data data process mune imwe yemapakeji ari kutariswa.

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

  1. Misiyano mikuru yesyntax pakati peR nePython
    1.1. Kupinda Package Mabasa
    1.2. Basa
    1.3. Indexing
    1.4. Nzira uye OOP
    1.5. Mapaipi
    1.6. Data zvimiro
  2. Mazwi mashoma pamusoro pemapakeji atichashandisa
    2.1. tidyverse
    2.2. data.table
    2.3. pandas
  3. Kuisa mapakeji
  4. Loading Data
  5. Kugadzira dataframes
  6. Kusarudza Columns Unoda
  7. Kusefa mitsetse
  8. Kuunganidza uye Kuunganidza
  9. Vertical union yematafura (UNION)
  10. Kujoinwa kwakachinjika kwematafura (JOIN)
  11. Basic window mabasa uye akaverengerwa columns
  12. Tafura yetsamba pakati penzira dzekugadzirisa data muR uye Python
  13. mhedziso
  14. Ongororo pfupi yekuti unoshandisa pasuru ipi

Kana iwe uchifarira kuongorora data, unogona kuwana yangu teregiramu и YouTube channels. Zvizhinji zvemukati zvakatsaurirwa kumutauro weR.

Misiyano mikuru yesyntax pakati peR nePython

Kuita kuti zvive nyore kwauri kuti uchinje kubva kuPython kuenda kuR, kana zvinopesana, ini ndinopa mashoma mashoma mapoinzi aunoda kuterera.

Kupinda Package Mabasa

Kana pasuru yaiswa muR, haufanire kudoma zita repasuru kuti uwane mabasa ayo. Kazhinji izvi hazviwanzo muR, asi zvinogamuchirwa. Iwe haufanirwe kupinza pasuru zvachose kana iwe uchida rimwe remabasa ayo mukodhi yako, asi ingoidaidza nekutsanangura zita repasuru uye zita rebasa racho. Muparadzi pakati pepakeji nemazita ebasa muR ikoloni mbiri. package_name::function_name().

MuPython, zvakapesana, inoonekwa seyechinyakare kudaidza mabasa epasuru nekunyatso tsanangura zita rayo. Kana pasuru yatorwa, inowanzopihwa zita rakapfupikiswa, semuenzaniso. pandas kazhinji pseudonym inoshandiswa pd. Basa repasuru rinowanikwa kuburikidza nedoti package_name.function_name().

Basa

MuR, zvakajairika kushandisa museve kupa kukosha kuchinhu. obj_name <- value, kunyangwe chiratidzo chimwechete chakaenzana chichibvumidzwa, chiratidzo chimwe chete chakaenzana muR chinoshandiswa kupfuudza kukosha kuita nharo.

MuPython, basa rinoitwa chete nechiratidzo chimwe chete chakaenzana obj_name = value.

Indexing

Kune zvakare misiyano yakakosha pano. MuR, indexing inotanga pane imwe uye inosanganisira zvese zvakatsanangurwa muchikamu chinobuda,

muPython, indexing inotanga kubva ku zero uye iyo yakasarudzwa haisanganisi chinhu chekupedzisira chakatsanangurwa mu indexing. Saka kugadzira x[i:j] muPython haizosanganise iyo j element.

Kune zvakare misiyano mune negative indexing, muR notation x[-1] ichadzorera zvinhu zvose zvevector kunze kwekupedzisira. MuPython, chirevo chakafanana chinodzosa chete chinhu chekupedzisira.

Nzira uye OOP

R inoshandisa OOP nenzira yayo, ndakanyora nezve izvi muchinyorwa "OOP mumutauro weR (chikamu 1): makirasi eS3". Kazhinji, R mutauro unoshanda, uye zvese zviri mairi zvakavakwa pamabasa. Naizvozvo, semuenzaniso, kune vashandisi veExcel, enda ku tydiverse zvichava nyore pane pandas. Kunyangwe iyi inogona kunge iri pfungwa yangu yekuzvibata.

Muchidimbu, zvinhu zviri muR hazvina nzira (kana tikataura nezveS3 makirasi, asi kune mamwe maOOP mashandisirwo asina kunyanya kuwanda). Kune chete mabasa akajairika anoagadzirisa zvakasiyana zvichienderana nekirasi yechinhu.

Mapaipi

Zvichida iri ndiro zita ra pandas Hazvizove zvakakwana, asi ndichaedza kutsanangura zvinoreva.

Kuti urege kuchengetedza kuverenga kwepakati uye kusaburitsa zvinhu zvisingakoshi munzvimbo yekushanda, unogona kushandisa rudzi rwepombi. Avo. pfuudza mhedzisiro yekuverenga kubva kune rimwe basa kuenda kune rinotevera, uye usachengete yepakati mhinduro.

Ngatitorei inotevera kodhi muenzaniso, kwatinochengeta maverengero epakati muzvinhu zvakasiyana:

temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )

Takaita 3 oparesheni sequentially, uye mhedzisiro yeimwe neimwe yakachengetwa mune chimwe chinhu chakasiyana. Asi kutaura zvazviri, hatidi zvinhu zvepakati izvi.

Kana zvakatoipisisa, asi zvakanyanya kujaira kune vashandisi veExcel.

obj  <- func3(func2(func1()))

Mune ino kesi, isu hatina kuchengetedza epakati maverengero mhinduro, asi kuverenga kodhi ine nested mabasa inokanganisa zvakanyanya.

Tichatarisa nzira dzinoverengeka dzekugadzirisa data muR, uye vanoita mashandiro akafanana nenzira dzakasiyana.

Mapaipi muraibhurari tidyverse yakaitwa nemushandisi %>%.

obj <- func1() %>% 
            func2() %>%
            func3()

Nokudaro tinotora chigumisiro chebasa func1() uye ipfuure senharo yekutanga kuna func2(), tobva tapa mugumisiro wekuverenga uku senharo yekutanga func3(). Uye pakupedzisira, isu tinonyora ese maverengero akaitwa muchinhu obj <-.

Zvese zviri pamusoro zvinoratidzwa zvirinani pane mazwi neiyi meme:
Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

В data.table maketani anoshandiswa nenzira yakafanana.

newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]

Mune yega yega mabhureki eskweya unogona kushandisa mhedzisiro yekushanda kwekare.

В pandas maoparesheni akadai anoparadzaniswa nedoti.

obj = df.fun1().fun2().fun3()

Avo. tinotora tafura yedu df uye shandisa nzira yake fun1(), zvino tinoshandisa nzira kune chigumisiro chakawanikwa fun2(), mushure fun3(). Mhedzisiro yacho inochengetwa muchinhu object .

Data zvimiro

Zvimiro zve data muR nePython zvakafanana, asi zvine mazita akasiyana.

tsananguro
Zita muR
Zita muPython/pandas

Tafura chimiro
data.frame, data.table, tibble
DataFrame

One-dimensional list yezvinokosha
Vector
Series mu pandas kana rondedzero mune yakachena Python

Multi-level isiri-tabular chimiro
List
Duramazwi (dict)

Tichatarisa mamwe maficha uye mutsauko mu syntax pazasi.

Mazwi mashoma pamusoro pemapakeji atichashandisa

Kutanga, ini ndichakuudza zvishoma nezve mapakeji auchazojairana nawo panguva ino yechinyorwa.

tidyverse

Official webhusayithi: tidyverse.org
Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure
raibhurari tidyverse yakanyorwa naHedley Wickham, Senior Research Scientist paRStudio. tidyverse ine seti inokatyamadza yemapakeji anorerutsa kugadzirisa data, mashanu ayo anosanganisirwa mune gumi epamusoro ekurodha kubva kuCRAN repository.

Iyo yakakosha ye library ine anotevera mapakeji: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats. Imwe neimwe yemapakeji aya ine chinangwa chekugadzirisa dambudziko chairo. Semuyenzaniso dplyr yakagadzirirwa kushandura data, tidyr kuunza iyo data kune yakarongeka fomu, stringr inorerutsa kushanda netambo, uye ggplot2 ndeimwe yeanonyanya kufarirwa data kuona maturusi.

advantage tidyverse ndiyo yakapfava uye iri nyore kuverenga syntax, iri munzira dzakawanda yakafanana nemutauro wemubvunzo weSQL.

data.table

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashureOfficial webhusayithi: r-datatable.com

By data.table ndiMat Dole weH2O.ai.

Kuburitswa kwekutanga kweraibhurari kwakaitika muna 2006.

Iyo pasuru syntax haina nyore sezvairi tidyverse uye inonyanya kuyeuchidza yekare dataframes muR, asi panguva imwechete yakawedzera zvakanyanya mukushanda.

Ese mashandisirwo ane tafura iri pasuru ino anotsanangurwa mumabhuraketi akaenzana, uye kana ukaturikira syntax. data.table muSQL, unowana chimwe chinhu chakadai: data.table[ WHERE, SELECT, GROUP BY ]

Simba repakeji iyi kumhanya kwekugadzirisa huwandu hukuru hwe data.

pandas

Official webhusayithi: pandas.pydata.org Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Iro zita reraibhurari rinobva kune econometric izwi rekuti "panel data", rinoshandiswa kutsanangura multidimensional yakarongwa seti yeruzivo.

By pandas ndiye weAmerica Wes McKinney.

Kana zvasvika pakuongororwa kwedata muPython, yakaenzana pandas Aihwa. Iyo yakawanda inoshanda, yepamusoro-yepamusoro pasuru iyo inokutendera iwe kuti uite chero kunyengedza nedata, kubva kurodha data kubva kune chero masosi kusvika pakuona.

Kuisa mamwe mapakeji

Iwo mapakeji anokurukurwa muchinyorwa ichi haana kuisirwa mune yakakosha R uye Python kugovera. Kunyangwe paine diki caveat, kana iwe waisa iyo Anaconda kugovera, wobva waisa nekuwedzera pandas hazvidiwe.

Kuisa mapakeji muR

Kana iwe wakavhura iyo RStudio budiriro nharaunda kamwechete, iwe pamwe unotoziva kuisa inodiwa pasuru muR. Kuisa mapakeji, shandisa iyo yakajairwa kuraira. install.packages() nekuimhanyisa zvakananga muR pachayo.

# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")

Mushure mekuisa, mapakeji anoda kubatanidzwa, ayo muzviitiko zvakawanda murairo unoshandiswa library().

# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)

Kuisa mapakeji muPython

Saka, kana iwe uine yakachena Python yakaiswa, ipapo pandas unofanira kuiisa nemaoko. Vhura mutsara wekuraira, kana terminal, zvichienderana neako sisitimu yekushandisa uye isa unotevera rairo.

pip install pandas

Zvadaro tinodzokera kuPython uye tinopinza purogiramu yakaiswa nemurairo import.

import pandas as pd

Loading Data

Kuchera data ndeimwe yematanho akakosha pakuongorora data. Python uye R, kana zvichidikanwa, zvinokupa iwe mikana yakakura yekuwana data kubva kune chero masosi: mafaera emuno, mafaera kubva paInternet, mawebhusaiti, marudzi ese emadatabase.

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Muchinyorwa chese tichashandisa akati wandei dataset:

  1. Dhaunirodha mbiri kubva kuGoogle Analytics.
  2. Titanic Passenger Dataset.

Yese data iri pane yangu GitHub muchimiro checsv uye tsv mafaera. Tichaakumbira kubva kupi?

Kuisa data muR: tidyverse, vroom, muverengi

Kuisa data muraibhurari tidyverse Pane mapakeji maviri: vroom, readr. vroom zvemazuva ano, asi mune ramangwana mapakeji anogona kusanganiswa.

Quote kubva zvinyorwa zvepamutemo vroom.

vroom vs muverengi
Ko kusunungurwa kwe vroom kureva kuti readr? Parizvino isu tinoronga kurega mapakeji maviri achishanduka zvakasiyana, asi pamwe isu tichabatanidza mapakeji mune ramangwana. Imwe yakashata kune vroom yekuverenga usimbe ndeye mamwe matambudziko edata haagone kutaurwa kumberi, saka nzira yekuzvibatanidza inoda imwe pfungwa.

vroom vs muverengi
Kusunungurwa kunorevei? vroom nokuti readr? Parizvino isu tinoronga kugadzira ese mapakeji akasiyana, asi isu tichaasanganisa mune ramangwana. Imwe yezvakaipira zveusimbe kuverenga vroom ndeyekuti mamwe matambudziko ane data haagone kutaurwa pachine nguva, saka iwe unofanirwa kufunga nezve nzira yekuzvibatanidza.

Muchikamu chino tichatarisa ese ari maviri ekurodha data mapakeji:

Kuisa data muR: vroom package

# install.packages("vroom")
library(vroom)

# Чтение данных
## vroom
ga_nov  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Kuisa data muR: muverengi

# install.packages("readr")
library(readr)

# Чтение данных
## readr
ga_nov  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Mune iyo package vroom, zvisinei neiyo csv / tsv data fomati, kurodha kunoitwa nebasa rezita rimwe chete vroom(), mupasuru readr tinoshandisa basa rakasiyana kune imwe neimwe fomati read_tsv() и read_csv().

Kuisa data muR: data.table

В data.table pane basa rekurodha data fread().

Kuisa data muR: data.table package

# install.packages("data.table")
library(data.table)

## data.table
ga_nov  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Kurodha data muPython: pandas

Kana tikaenzanisa neR mapakeji, saka mune iyi kesi syntax iri padyo pandas vachava readr, nokuti pandas inogona kukumbira data kubva chero kupi, uye kune mhuri yese yemabasa mune iyi package read_*().

  • read_csv()
  • read_excel()
  • read_sql()
  • read_json()
  • read_html()

Uye mamwe akawanda mabasa akagadzirirwa kuverenga data kubva akasiyana mafomati. Asi nokuda kwezvinangwa zvedu zvakakwana read_table() kana read_csv() kushandisa nharo Sep kudoma mutsara wembiru.

Kurodha data muPython: pandas

import pandas as pd

ga_nov  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")

Kugadzira dataframes

Tafura titanic, yatakatakura, pane munda Sex, inochengeta chiziviso chechikadzi chemufambi.

Asi kuitira kuratidzwa kuri nyore kwedata nemufambi wechikadzi, unofanirwa kushandisa zita kwete kodhi yevarume.

Kuti tiite izvi, tichagadzira dhairekitori diki, tafura umo kuchange kuine makoramu maviri chete (code uye gender zita) uye 2 mitsara, zvichiteerana.

Kugadzira dataframe muR: tidyverse, dplyr

Mumuenzaniso wekodhi pazasi, isu tinogadzira yaunoda dataframe tichishandisa basa tibble() .

Kugadzira dataframe muR: dplyr

## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
                 gender = c("female", "male"))

Kugadzira dataframe muR: data.table

Kugadzira dataframe muR: data.table

## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
                    gender = c("female", "male"))

Kugadzira dataframe muPython: pandas

В pandas Kugadzirwa kwemafuremu kunoitwa mumatanho akati wandei, kutanga tinogadzira duramazwi, tozoshandura duramazwi kuita dataframe.

Kugadzira dataframe muPython: pandas

# создаём дата фрейм
gender_dict = {'id': [1, 2],
               'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)

Kusarudza Columns

Matafura aunoshanda nawo anogona kunge aine gumi nemaviri kana kunyange mazana emakoramu edata. Asi kuti uite ongororo, sekutonga, haudi makoramu ese anowanikwa mune tafura yekubva.

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Naizvozvo, imwe yekutanga mashandiro auchaita neiyo tafura tafura ndeyekubvisa ruzivo rusina basa uye kusunungura ndangariro iyo ruzivo urwu.

Kusarudza makoramu muR: tidyverse, dplyr

nemarongerwo dplyr yakafanana chaizvo nemutauro weSQL wemubvunzo, kana uchinge wajairana nawo iwe unokurumidza kugona iyi package.

Kusarudza makoramu, shandisa basa select().

Pazasi pane mienzaniso yekodhi yaunogona kusarudza nayo makoramu nenzira dzinotevera:

  • Kunyora mazita emakoramu anodiwa
  • Tarisa kumazita emakoramu uchishandisa zvirevo zvenguva dzose
  • Nemhando yedata kana chero imwe pfuma yedata iri mukoramu

Kusarudza makoramu muR: dplyr

# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)

Kusarudza makoramu muR: data.table

Iwo maoperation akafanana mu data.table dzinoitwa zvakati siyanei, pakutanga kwechinyorwa ndakapa tsananguro yekuti ndedzipi nharo dziri mukati memabhuraketi data.table.

DT[i,j,by]

Kupi:
ini - kupi, i.e. kusefa nemitsara
j - sarudza|gadziridza| ita, i.e. kusarudza makoramu nekuashandura
by - data grouping

Kusarudza makoramu muR: data.table

## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]

Variable .SD inokubvumira kuti uwane makoramu ese, uye .SDcols sefa makoramu anodiwa uchishandisa mataurirwo enguva dzose, kana mamwe mabasa kusefa mazita emakoramu aunoda.

Kusarudza makoramu muPython, pandas

Kusarudza makoramu nemazita mukati pandas zvakakwana kupa runyoro rwemazita avo. Uye kusarudza kana kusabvisa makoramu nemazita uchishandisa zvinogara zvichitaurwa, unofanirwa kushandisa mabasa drop() и filter(), uye nharo axis=1, iyo yaunoratidza nayo kuti inofanirwa kugadzirisa makoramu kwete mitsetse.

Kusarudza munda nemhando yedata, shandisa basa select_dtypes(), uye mumakakatanwa inosanganisira kana kusanganisa pfuura rondedzero yemhando dzedata dzinoenderana nendima dzaunoda kusarudza.

Kusarudza makoramu muPython: pandas

# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])

Kusefa mitsetse

Semuenzaniso, iyo tafura yezvinyorwa inogona kunge iine makore akati wandei e data, asi iwe unongoda kuongorora mwedzi wapfuura. Zvekare, mitsetse yekuwedzera inononoka dhizaini yekugadzirisa uye kuvhara iyo PC memory.

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Kusefa mitsetse muR: tydyverse, dplyr

В dplyr basa rinoshandiswa kusefa mitsara filter(). Zvinotora dataframe senharo yekutanga, wobva wanyora mamiriro ekusefa.

Paunenge uchinyora zvirevo zvine musoro kusefa tafura, mune ino kesi, tsanangura mazita emakoramu asina makotesheni uye pasina kuzivisa zita retafura.

Paunenge uchishandisa akawanda ane musoro kutaura kusefa, shandisa anotevera maoperator:

  • & kana koma - zvine musoro UYE
  • | - zvine musoro OR

Kusefa mitsetse muR: dplyr

# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)

Kusefa mitsetse muR: data.table

Sezvandatonyora pamusoro, mu data.table Sintakisi yekushandura data yakavharirwa mumabhuraketi akaenzana.

DT[i,j,by]

Kupi:
ini - kupi, i.e. kusefa nemitsara
j - sarudza|gadziridza| ita, i.e. kusarudza makoramu nekuashandura
by - data grouping

Nhaurirano inoshandiswa kusefa mitsetse i, iyo ine nzvimbo yekutanga mumabhuraketi akaenzana.

Makolamu anowanikwa mumashoko ane musoro asina makotesheni uye pasina kudoma zita retafura.

Matauriro ane musoro ane hukama kune mumwe nemumwe nenzira imwechete semu dplyr kuburikidza ne & uye | vashandisi.

Kusefa mitsetse muR: data.table

## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]

Kusefa tambo muPython: pandas

Sefa nemitsara mukati pandas zvakafanana nekusefa mukati data.table, uye inoitwa mumabhuraketi akaenzana.

Muchiitiko ichi, kuwana makoramu kunoitwa nekuratidza zita reiyo dataframe; ipapo zita rekoramu rinogonawo kuratidzwa mune quotation mamaki mumabhuraketi akaenzana (muenzaniso df['col_name']), kana pasina makotesheni mushure menguva (muenzaniso df.col_name).

Kana iwe uchida kusefa dataframe nemamiriro akati wandei, mamiriro ega ega anofanirwa kuiswa mumaparentheses. Mamiriro ezvinhu anonzwisisika akabatanidzwa kune mumwe nemumwe nevashandisi & и |.

Kusefa tambo muPython: pandas

# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]

Kuunganidza uye kuunganidza data

Chimwe chezvinonyanya kushandiswa pakuongorora data kuunganidza uye kuunganidza.

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Iyo syntax yekuita izvi mashandiro akapararira pamapakeji ese atinoongorora.

Muchiitiko ichi, tichatora dataframe semuenzaniso titanic, uye kuverenga nhamba uye avhareji mutengo wematikiti zvichienderana nekirasi yekabhini.

Kuunganidza uye kuunganidza data muR: tidyverse, dplyr

В dplyr basa rinoshandiswa pakuita mapoka group_by(), uye pakuunganidza summarise(). Saizvozvo, dplyr kune mhuri yese yemabasa summarise_*(), asi chinangwa chechinyorwa chino ndechekuenzanisa chirevo chekutanga, kuti tisapinda musango rakadaro.

Basic aggregation mabasa:

  • sum() - kupfupisa
  • min() / max() - shoma uye yepamusoro kukosha
  • mean() - pakati
  • median() - pakati
  • length() - uwandu

Kuunganidza uye kuunganidza muR: dplyr

## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
  summarise(passangers = length(PassengerId),
            avg_price  = mean(Fare))

Kushanda group_by() takapfuura tafura senharo yekutanga titanic, ndokubva aratidza munda Pclass, yatichabatanidza nayo tafura yedu. Mhedzisiro yekushanda uku uchishandisa mushandisi %>% yakapfuura senharo yekutanga kune basa summarise(), uye akawedzera 2 mamwe minda: vapfuuri и avg_price. Mukutanga, kushandisa basa length() akaverenga nhamba matikiti, uye yechipiri kushandisa basa mean() yakagamuchira mutengo wetiketi wepakati.

Kubatanidza uye kuunganidzwa kwedata muR: data.table

В data.table nharo inoshandiswa pakuunganidza j iyo ine nzvimbo yechipiri mumabhuraketi akaenzana, uye yemapoka by kana keyby, dzine nzvimbo yechitatu.

Rondedzero yemabasa ekuunganidza mune ino kesi yakafanana neinotsanangurwa mukati dplyr, nokuti aya mabasa kubva kune yekutanga R syntax.

Kuunganidza uye kuunganidza muR: data.table

## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
            avg_price  = mean(Fare)),
        by = Pclass]

Kubatana uye kuunganidzwa kwedata muPython: pandas

Kuungana pandas zvakafanana ne dplyr, asi kuunganidza hakuna kufanana ne dplyr kwete pa data.table.

Kuti uite boka, shandisa nzira groupby(), iyo yaunoda kupfuudza rondedzero yemakoramu ayo iyo dataframe ichaiswa mumapoka.

Kuti aggregation unogona kushandisa nzira agg()iyo inogamuchira duramazwi. Makiyi eduramazwi ndiwo makoramu aunoshandisa mabasa ekubatanidza, uye kukosha ndiwo mazita emabasa ekuunganidza.

Aggregation mabasa:

  • sum() - kupfupisa
  • min() / max() - shoma uye yepamusoro kukosha
  • mean() - pakati
  • median() - pakati
  • count() - uwandu

shanda reset_index() mumuenzaniso pazasi unoshandiswa kuseta zvakare nested indexes izvo pandas defaults kune mushure mekuunganidza data.

Symbol inokubvumira kuti uende kumutsara unotevera.

Kuunganidza uye kuunganidza muPython: pandas

# группировка и агрегация данных
titanic.groupby(["Pclass"]).
    agg({'PassengerId': 'count', 'Fare': 'mean'}).
        reset_index()

Kubatanidzwa kwematafura kwakatwasuka

Kuvhiya kwaunobatanidza matafura maviri kana anopfuura echimiro chimwe chete. Data yatakatakura ine matafura ga_nov и ga_dec. Aya matafura akafanana muchimiro, i.e. iva nemakoramu akafanana, uye marudzi edata ari mumakoramu aya.

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Uku kurodha kubva kuGoogle Analytics yemwedzi waMbudzi naZvita, muchikamu chino tichabatanidza iyi data mutafura imwe.

Kubatanidza matafura muR: tidyverse, dplyr

В dplyr Iwe unogona kusanganisa matafura maviri mune imwe uchishandisa basa bind_rows(), ichipfuura matafura senharo dzayo.

Kusefa mitsetse muR: dplyr

# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)

Kubatanidza matafura muR: data.table

Izvo zvakare hapana chakaoma, ngatishandisei rbind().

Kusefa mitsetse muR: data.table

## data.table
rbind(ga_nov, ga_dec)

Vertical kujoinha matafura muPython: pandas

В pandas basa rinoshandiswa kubatanidza matafura concat(), mauri iwe unofanirwa kupfuudza runyoro rwemafuremu kuti uasanganise.

Kusefa tambo muPython: pandas

# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])

Horizontal kubatana kwematafura

Kushanda uko makoramu kubva kune yechipiri anowedzerwa patafura yekutanga nekiyi. Inowanzo shandiswa pakupfumisa tafura yechokwadi (semuenzaniso, tafura ine data rekutengesa) ine imwe data rengedzo (semuenzaniso, mutengo wechigadzirwa).

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Kune marudzi akawanda ekubatanidza:

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Mutafura yakamboiswa titanic tine mbiru Sex, inofambirana nekodhi yevarume nevatasvi:

1 - mukadzi
2 - murume

Zvakare, isu takagadzira tafura - bhuku rekutarisa hukama. Kuti uwane kuratidzwa kuri nyore kwe data pane murume kana mukadzi, isu tinofanirwa kuwedzera zita remurume kubva mudhairekitori. hukama kutafura titanic.

Tafura yakatwasuka inobatana muR: tidyverse, dplyr

В dplyr Kune mhuri yese yemabasa ekubatanidza yakatwasuka:

  • inner_join()
  • left_join()
  • right_join()
  • full_join()
  • semi_join()
  • nest_join()
  • anti_join()

Inonyanya kushandiswa mukuita kwangu ndeye left_join().

Senharo mbiri dzekutanga, mabasa akanyorwa pamusoro anotora matafura maviri kuti abatanidze, uye senharo yechitatu by unofanira kutsanangura makoramu ekubatanidza.

Tafura yakatwasuka inobatana muR: dplyr

# объединяем таблицы
left_join(titanic, gender,
          by = c("Sex" = "id"))

Kujoinwa kwakachinjika kwematafura muR: data.table

В data.table Iwe unofanirwa kujoina matafura nekiyi uchishandisa basa merge().

Nharo dzekubatanidza () basa mu data.table

  • x, y - Matafura ekubatanidza
  • by - Column ndiyo kiyi yekubatanidza kana iine zita rimwechete mumatafura ese
  • by.x, by.y - Mazita epaKoramu anofanira kubatanidzwa, kana aine mazita akasiyana mumatafura
  • zvese, zvese.x, zvese.y - Join type, ese achadzosa mitsetse yese kubva kumatafura ese, all.x inoenderana neiyo LEFT JOIN operation (ichasiya mitsetse yese yetafura yekutanga), all.y - inofambirana ne RIGHT JOIN Operation (ichasiya mitsetse yese yetafura yechipiri).

Kujoinwa kwakachinjika kwematafura muR: data.table

# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)

Tafura yakatwasuka inobatana muPython: pandas

Uyewo mu data.tablein pandas basa rinoshandiswa kubatanidza matafura merge().

Nharo dzekubatanidza () basa mumapanda

  • sei - Yekubatanidza mhando: kuruboshwe, kurudyi, kunze, mukati
  • pa - Column ndiyo kiyi kana iine zita rimwechete mumatafura ese ari maviri
  • left_on, right_on - Mazita emakoramu akakosha, kana aine mazita akasiyana mumatafura

Tafura yakatwasuka inobatana muPython: pandas

# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")

Basic window mabasa uye akaverengerwa columns

Mahwindo emabasa akafanana muchirevo kumabasa ekuunganidza, uye anowanzoshandiswa mukuongorora data. Asi kusiyana nemabasa ekuunganidza, hwindo mabasa haachinje huwandu hwemitsara yeinobuda dataframe.

Ndeupi mutauro wekusarudza wekushanda nedata - R kana Python? Zvose! Kutama kubva papanda kuenda kune yakarongeka uye data.table uye kumashure

Chaizvoizvo, tichishandisa hwindo basa, tinotsemura iyo inouya dataframe kuita zvikamu zvinoenderana neimwe chirevo, i.e. nekukosha kwemunda, kana minda yakati wandei. Uye isu tinoita arithmetic mashandiro pahwindo rega rega. Mhedzisiro yezviito izvi ichadzorerwa mumutsara wega wega, i.e. pasina kushandura nhamba yese yemitsara patafura.

Somuenzaniso, ngatitorei tafura titanic. Tinogona kuverenga kuti iperesenti yemutengo wetikiti yega yega yaive mukati mekirasi yekabhini.

Kuti tiite izvi, isu tinofanirwa kupinza mumutsara wega wega mutengo wetikiti wekirasi yazvino yekabhini uko tikiti riri mumutsara uyu, togova mutengo wetiketi rega rega nemutengo wakakwana wematikiti ese ekirasi imwe chete. .

Hwindo rinoshanda muR: tidyverse, dplyr

Kuwedzera makoramu matsva, pasina kushandisa mitsara yemapoka, mukati dplyr anoshanda basa mutate().

Iwe unogona kugadzirisa dambudziko rakatsanangurwa pamusoro nekuunganidza data nemunda Pclass uye kupfupisa munda muchikamu chitsva ita. Zvadaro, bvisa tafura uye ugovane maitiro emunda ita kune zvakaitika munhanho yapfuura.

Hwindo rinoshanda muR: dplyr

group_by(titanic, Pclass) %>%
  mutate(Pclass_cost = sum(Fare)) %>%
  ungroup() %>%
  mutate(ticket_fare_rate = Fare / Pclass_cost)

Hwindo rinoshanda muR: data.table

Iyo mhinduro algorithm inoramba yakafanana neyemukati dplyr, tinoda kupatsanura tafura mumahwindo nemunda Pclass. Ratidza muchikamu chitsva mari yeboka inoenderana nemutsara wega wega, uye wedzera koramu yatinoverengera mugove wemutengo wetiketi rega rega muboka raro.

Kuwedzera makoramu matsva ku data.table opareta aripo :=. Pazasi pane muenzaniso wekugadzirisa dambudziko uchishandisa package data.table

Hwindo rinoshanda muR: data.table

titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost), 
        by = Pclass]

Hwindo rinoshanda muPython: pandas

Imwe nzira yekuwedzera mutsara mutsva ku pandas - shandisa basa assign(). Kupfupisa mutengo wematiketi nekirasi yekabhini, pasina mapoka mitsara, isu tichashandisa basa transform().

Pasi pane muenzaniso wemhinduro yatinowedzera patafura titanic zvakafanana 2 columns.

Hwindo rinoshanda muPython: pandas

titanic.assign(Pclass_cost      =  titanic.groupby('Pclass').Fare.transform(sum),
               ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])

Mabasa uye nzira dzetsamba tafura

Pazasi pane tafura yekunyorerana pakati penzira dzekuita akasiyana mashandiro ane data mumapakeji atakurukura.

tsananguro
tidyverse
data.table
pandas

Loading Data
vroom()/ readr::read_csv() / readr::read_tsv()
fread()
read_csv()

Kugadzira dataframes
tibble()
data.table()
dict() + from_dict()

Kusarudza Columns
select()
kupokana j, nzvimbo yechipiri mumabhuraketi akaenzana
isu tinopasa runyorwa rwemakoramu anodiwa mumabhuraketi akaenzana / drop() / filter() / select_dtypes()

Kusefa mitsetse
filter()
kupokana i, nzvimbo yekutanga mumabhuraketi akaenzana
Isu tinonyora mamiriro ekusefa mumabhuraketi akaenzana / filter()

Kuunganidza uye Kuunganidza
group_by() + summarise()
nharo j + by
groupby() + agg()

Vertical union yematafura (UNION)
bind_rows()
rbind()
concat()

Kujoinwa kwakachinjika kwematafura (JOIN)
left_join() / *_join()
merge()
merge()

Basic window mabasa uye kuwedzera akaverengerwa columns
group_by() + mutate()
kupokana j kushandisa mushandisi := + nharo by
transform() + assign()

mhedziso

Zvichida mune chinyorwa chandakatsanangura kwete yakanyanya kunaka mashandisirwo ekugadzirisa data, saka ndichafara kana ukagadzirisa kukanganisa kwangu mumashoko, kana kungowedzera ruzivo rwakapihwa muchinyorwa nemamwe matekiniki ekushanda nedata muR / Python.

Sezvandakanyora pamusoro apa, chinangwa chechinyorwa chakanga chisiri chokumanikidzira maonero omumwe mutauro uri nani, asi kurerutsa mukana wekudzidza mitauro miviri, kana, kana zvichidiwa, kutama pakati pavo.

Kana iwe wakafarira chinyorwa, ini ndichafara kuve nevatsva vanyoreri kune yangu YouTube и teregiramu channels.

Poll

Ndeipi pasuru dzinotevera dzaunoshandisa mubasa rako?

Mumashoko iwe unogona kunyora chikonzero chekusarudza kwako.

Vashandisi vakanyoresa chete ndivo vanogona kutora chikamu muongororo. Nyorera mu, Munogamuchirwa.

Ndeipi data process package yaunoshandisa (unogona kusarudza akati wandei sarudzo)

  • 45,2%tidyverse19

  • 33,3%data.table14

  • 54,8%pandas23

42 vashandisi vakavhota. 9 vashandisi vakaramba.

Source: www.habr.com

Voeg