Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Ta hanyar neman R ko Python akan Intanet, zaku sami miliyoyin labarai da tattaunawa ta kilomita akan batun wanda ya fi kyau, sauri da dacewa don aiki tare da bayanai. Amma abin takaici, duk waɗannan labarai da jayayya ba su da amfani musamman.

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Manufar wannan labarin ita ce kwatanta ainihin dabarun sarrafa bayanai a cikin fitattun fakitin harsunan biyu. Kuma taimaka wa masu karatu da sauri su mallaki wani abu da ba su sani ba tukuna. Ga masu rubutu da Python, bincika yadda ake yin abu iri ɗaya a cikin R, kuma akasin haka.

A yayin labarin, za mu yi nazarin tsarin haɗin gwiwar fakitin da suka fi shahara a cikin R. Waɗannan su ne fakitin da aka haɗa a cikin ɗakin karatu. tidyverseda kuma kunshin data.table. Kuma kwatanta su syntax da pandas, mafi mashahuri kunshin tantance bayanai a Python.

Za mu bi mataki-mataki ta hanyar duk hanyar binciken bayanai tun daga loda shi zuwa aiwatar da ayyukan taga ta amfani da Python da R.

Abubuwa

Ana iya amfani da wannan labarin azaman takardar yaudara idan kun manta yadda ake yin wasu ayyukan sarrafa bayanai a ɗayan fakitin da ake la'akari.

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

  1. Babban bambance-bambancen syntax tsakanin R da Python
    1.1. Shiga Ayyukan Kunshin
    1.2. Ayyuka
    1.3. Fitarwa
    1.4. Hanyoyi da OOP
    1.5. Bututun ruwa
    1.6. Tsarin Bayanai
  2. Kalmomi kaɗan game da fakitin da za mu yi amfani da su
    2.1. tsararru
    2.2. bayanai
    2.3. pandas
  3. Sanya fakiti
  4. Loading Data
  5. Ƙirƙirar dataframes
  6. Zaɓin ginshiƙan da kuke buƙata
  7. Tace layuka
  8. Ƙungiya da Tari
  9. Ƙungiyar Tebura ta tsaye (UNION)
  10. Haɗin tebur na kwance (JOIN)
  11. Ayyukan taga na asali da ginshiƙai masu ƙididdigewa
  12. Teburin magana tsakanin hanyoyin sarrafa bayanai a cikin R da Python
  13. ƙarshe
  14. Wani ɗan gajeren bincike game da kunshin da kuke amfani da shi

Idan kuna sha'awar nazarin bayanai, kuna iya samun nawa telegram и youtube tashoshi. Yawancin abun ciki an sadaukar da shi ga yaren R.

Babban bambance-bambancen syntax tsakanin R da Python

Don sauƙaƙa muku sauyawa daga Python zuwa R, ko akasin haka, zan ba da wasu mahimman abubuwan da kuke buƙatar kula da su.

Shiga Ayyukan Kunshin

Da zarar an ɗora fakiti zuwa cikin R, ba kwa buƙatar saka sunan fakitin don samun damar ayyukan sa. A mafi yawan lokuta wannan ba kowa bane a cikin R, amma abin karɓa ne. Ba lallai ne ku shigo da kunshin kwata-kwata idan kuna buƙatar ɗayan ayyukansa a cikin lambar ku ba, amma kawai ku kira shi ta hanyar tantance sunan kunshin da sunan aikin. Mai raba tsakanin kunshin da sunayen aiki a cikin R shine hanji biyu. package_name::function_name().

A cikin Python, akasin haka, ana ɗaukar al'ada don kiran ayyukan kunshin ta hanyar fayyace sunanta a sarari. Lokacin da aka sauke kunshin, yawanci ana ba shi gajeriyar suna, misali. pandas yawanci ana amfani da sunan ƙarya pd. Ana samun damar aikin fakiti ta hanyar digo package_name.function_name().

Ayyuka

A cikin R, ya zama ruwan dare don amfani da kibiya don sanya ƙima ga abu. obj_name <- value, ko da yake an yarda da alamar daidai guda ɗaya, alamar daidaitattun guda ɗaya a cikin R ana amfani da ita musamman don ƙaddamar da ƙima don aiki da muhawara.

A Python, ana yin aiki ne kawai tare da alamar daidaici ɗaya obj_name = value.

Fitarwa

Hakanan akwai bambance-bambance masu mahimmanci anan. A cikin R, firikwensin yana farawa daga ɗaya kuma ya haɗa da duk ƙayyadaddun abubuwa a cikin kewayon da aka samu,

A Python, firikwensin yana farawa daga sifili kuma kewayon da aka zaɓa bai ƙunshi kashi na ƙarshe da aka ƙayyade a cikin fihirisar ba. Don haka zane x[i:j] a Python ba zai haɗa da j element ba.

Hakanan akwai bambance-bambance a cikin firikwensin mara kyau, a cikin bayanin R x[-1] zai dawo da duk abubuwan da ke cikin vector sai na ƙarshe. A cikin Python, irin wannan bayanin zai dawo kawai kashi na ƙarshe.

Hanyoyi da OOP

R yana aiwatar da OOP ta hanyar kansa, na rubuta game da wannan a cikin labarin "OOP a cikin yaren R (sashe na 1): azuzuwan S3". Gabaɗaya, R harshe ne mai aiki, kuma duk abin da ke cikinsa an gina shi akan ayyuka. Don haka, alal misali, ga masu amfani da Excel, je zuwa tydiverse zai fi sauki fiye da pandas. Ko da yake wannan na iya zama ra'ayi na na zahiri.

A takaice, abubuwa a cikin R ba su da hanyoyi (idan muka yi magana game da azuzuwan S3, amma akwai wasu aiwatar da OOP waɗanda ba su da yawa). Akwai ayyuka na gaba ɗaya kawai waɗanda ke sarrafa su daban-daban dangane da ajin abun.

Bututun ruwa

Wataƙila wannan shine sunan don pandas Ba zai zama daidai ba, amma zan yi ƙoƙarin bayyana ma'anar.

Don kada ku ajiye lissafin matsakaici kuma kada ku samar da abubuwan da ba dole ba a cikin yanayin aiki, za ku iya amfani da wani nau'i na bututu. Wadancan. wuce sakamakon lissafin daga aiki ɗaya zuwa na gaba, kuma kada ku ajiye sakamakon matsakaici.

Bari mu ɗauki misali na lamba mai zuwa, inda muke adana matsakaicin lissafin a cikin abubuwa daban:

temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )

Mun yi ayyuka 3 a jere, kuma an ajiye sakamakon kowannensu a cikin wani abu dabam. Amma a zahiri, ba ma buƙatar waɗannan abubuwan tsaka-tsaki.

Ko ma mafi muni, amma mafi sani ga masu amfani da Excel.

obj  <- func3(func2(func1()))

A wannan yanayin, ba mu adana sakamakon lissafin matsakaici ba, amma lambar karantawa tare da ayyukan gida yana da matukar wahala.

Za mu dubi hanyoyi da yawa don sarrafa bayanai a cikin R, kuma suna yin irin wannan ayyuka ta hanyoyi daban-daban.

Bututu a cikin ɗakin karatu tidyverse aiwatar da mai aiki %>%.

obj <- func1() %>% 
            func2() %>%
            func3()

Don haka muna ɗaukar sakamakon aikin func1() da kuma wuce shi a matsayin hujja ta farko zuwa ga func2(), sai mu wuce sakamakon wannan lissafin a matsayin hujja ta farko func3(). Kuma a ƙarshe, muna rubuta duk lissafin da aka yi a cikin abu obj <-.

Dukkan abubuwan da ke sama an kwatanta su da kyau fiye da kalmomi ta wannan meme:
Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

В data.table Ana amfani da sarƙoƙi ta irin wannan hanya.

newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]

A cikin kowane shingen murabba'i zaka iya amfani da sakamakon aikin da ya gabata.

В pandas Ana raba irin waɗannan ayyukan ta hanyar digo.

obj = df.fun1().fun2().fun3()

Wadancan. muna daukar teburin mu df da amfani da hanyarta fun1(), to muna amfani da hanyar zuwa sakamakon da aka samu fun2()bayan fun3(). Ana ajiye sakamakon sakamakon cikin abu abu .

Tsarin Bayanai

Tsarin bayanai a cikin R da Python iri ɗaya ne, amma suna da sunaye daban-daban.

Description
Suna in R
Suna cikin Python/pandas

Tsarin tebur
data.frame, data.table, tibble
DataFrame

Jerin dabi'u mai-girma ɗaya
Vector
Jerin cikin pandas ko jera a cikin tsantsar Python

Multi-mataki tsarin mara-tabular
Jerin
Kamus (dict)

Za mu kalli wasu wasu fasaloli da bambance-bambance a cikin rubutun da ke ƙasa.

Kalmomi kaɗan game da fakitin da za mu yi amfani da su

Na farko, zan gaya muku kadan game da fakitin da za ku saba da su yayin wannan labarin.

tsararru

Tashar yanar gizon: tidyverse.org
Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya
Laburare tidyverse Hedley Wickham, Babban Masanin Kimiyyar Bincike a Rstudio ya rubuta. tidyverse ya ƙunshi fakiti masu ban sha'awa waɗanda ke sauƙaƙe sarrafa bayanai, 5 daga cikinsu suna cikin manyan abubuwan zazzagewa guda 10 daga ma'ajiyar CRAN.

Babban ɗakin ɗakin karatu ya ƙunshi fakiti masu zuwa: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats. Kowane ɗayan waɗannan fakitin yana nufin magance takamaiman matsala. Misali dplyr an kirkireshi don sarrafa bayanai, tidyr don kawo bayanan zuwa tsari mai kyau, stringr yana sauƙaƙa aiki da igiyoyi, kuma ggplot2 yana ɗaya daga cikin shahararrun kayan aikin gani na bayanai.

Amfanin tidyverse shine sauƙi da sauƙin karantawa, wanda ta hanyoyi da yawa yayi kama da harshen tambaya na SQL.

bayanai

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da bayaTashar yanar gizon: r-datatable.com

Marubuci data.table Matt Dole ne na H2O.ai.

Sakin farko na ɗakin karatu ya faru a shekara ta 2006.

Rubutun kunshin bai dace ba kamar a ciki tidyverse kuma ya fi tunawa da ƙayyadaddun bayanan bayanai a cikin R, amma a lokaci guda an faɗaɗa sosai a cikin ayyuka.

Duk manipulations tare da tebur a cikin wannan fakitin an bayyana su a cikin maƙallan murabba'i, kuma idan kun fassara fassarar data.table a cikin SQL, kuna samun wani abu kamar haka: data.table[ WHERE, SELECT, GROUP BY ]

Ƙarfin wannan kunshin shine saurin sarrafa bayanai masu yawa.

pandas

Tashar yanar gizon: pandas.pydata.org Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Sunan ɗakin karatu ya fito daga kalmar tattalin arziƙin “bayanan kwamiti”, wanda ake amfani da shi don bayyana tsarin bayanai da yawa.

Marubuci pandas dan Amurka Wes McKinney.

Idan ya zo ga nazarin bayanai a Python, daidai pandas A'a. Kunshin mai aiki da yawa, babban matakin da ke ba ku damar yin kowane magudi tare da bayanai, daga loda bayanai daga kowace tushe zuwa hangen nesa.

Sanya ƙarin fakiti

Ba a haɗa fakitin da aka tattauna a wannan labarin a cikin ainihin R da Python rarraba ba. Ko da yake akwai ƙaramin faɗakarwa, idan kun shigar da rarrabawar Anaconda, sannan shigar da ƙari pandas ba da ake bukata ba.

Sanya fakiti a cikin R

Idan kun buɗe yanayin ci gaban RStudio aƙalla sau ɗaya, tabbas kun riga kun san yadda ake shigar da fakitin da ake buƙata a cikin R. Don shigar da fakiti, yi amfani da daidaitaccen umarni. install.packages() ta hanyar gudanar da shi kai tsaye a cikin R kanta.

# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")

Bayan shigarwa, ana buƙatar haɗa fakitin, wanda a mafi yawan lokuta ana amfani da umarnin library().

# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)

Sanya fakiti a cikin Python

Don haka, idan kun shigar da Python mai tsabta, to pandas kana buƙatar shigar da shi da hannu. Bude layin umarni, ko tasha, ya danganta da tsarin aikin ku kuma shigar da umarni mai zuwa.

pip install pandas

Sa'an nan kuma mu koma Python kuma mu shigo da kunshin da aka shigar tare da umarnin import.

import pandas as pd

Loading Data

Ma'adinan bayanai yana ɗaya daga cikin matakai mafi mahimmanci a cikin nazarin bayanai. Duk Python da R, idan ana so, suna ba ku dama mai yawa don samun bayanai daga kowane tushe: fayilolin gida, fayiloli daga Intanet, gidajen yanar gizo, kowane nau'in bayanan bayanai.

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

A cikin dukan labarin za mu yi amfani da datasets da yawa:

  1. Zazzagewa biyu daga Google Analytics.
  2. Titanic Passenger Dataset.

Duk bayanan suna kan nawa GitHub a cikin hanyar csv da tsv fayiloli. Daga ina za mu neme su?

Ana loda bayanai zuwa cikin R: tsararru, vroom, mai karatu

Don loda bayanai cikin ɗakin karatu tidyverse Akwai fakiti guda biyu: vroom, readr. vroom mafi zamani, amma a nan gaba za a iya haɗa fakitin.

Magana daga takardun shaida vroom.

vroom vs mai karatu
Menene sakin vroom nufi ga readr? A yanzu muna shirin barin fakitin biyu sun samo asali daban, amma da alama za mu hada fakitin nan gaba. Ɗaya daga cikin lahani ga malalacin karatun vroom shine wasu matsalolin bayanai ba za a iya ba da rahoto a gaba ba, don haka yadda mafi kyau don haɗa su yana buƙatar wasu tunani.

vroom vs mai karatu
Me ake nufi da sakin? vroom to readr? A halin yanzu muna shirin haɓaka fakitin biyu daban, amma tabbas za mu haɗa su a nan gaba. Daya daga cikin illolin kasala karatu vroom shi ne cewa wasu matsaloli tare da bayanan ba za a iya ba da rahoto a gaba ba, don haka kuna buƙatar yin la'akari da yadda mafi kyau don haɗa su.

A cikin wannan labarin za mu kalli duka fakitin loda bayanai:

Ana loda bayanai cikin R: kunshin vroom

# install.packages("vroom")
library(vroom)

# Чтение данных
## vroom
ga_nov  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Ana loda bayanai cikin R: mai karatu

# install.packages("readr")
library(readr)

# Чтение данных
## readr
ga_nov  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

A cikin kunshin vroom, ba tare da la'akari da tsarin bayanan csv / tsv ba, ana yin loda ta hanyar sunan iri ɗaya. vroom(), a cikin kunshin readr muna amfani da aiki daban-daban don kowane tsari read_tsv() и read_csv().

Ana loda bayanai cikin R: data.table

В data.table akwai aiki don loda bayanai fread().

Ana loda bayanai cikin R: kunshin data.table

# install.packages("data.table")
library(data.table)

## data.table
ga_nov  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Ana loda bayanai a cikin Python: pandas

Idan muka kwatanta da R fakitin, to, a cikin wannan harka da syntax ne mafi kusa pandas zai zama readr, saboda pandas na iya buƙatar bayanai daga ko'ina, kuma akwai dukan iyalin ayyuka a cikin wannan kunshin read_*().

  • read_csv()
  • read_excel()
  • read_sql()
  • read_json()
  • read_html()

Da sauran ayyuka da yawa da aka tsara don karanta bayanai daga nau'o'i daban-daban. Amma don manufofinmu ya isa read_table() ko read_csv() amfani da hujja Sep don tantance mai raba shafi.

Ana loda bayanai a cikin Python: pandas

import pandas as pd

ga_nov  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")

Ƙirƙirar dataframes

Tebur Titanic, wanda muka loda, akwai filin Sex, wanda ke ajiye fasinja ta jinsi.

Amma don mafi dacewa gabatar da bayanai dangane da jinsin fasinja, yakamata ku yi amfani da sunan maimakon lambar jinsi.

Don yin wannan, za mu ƙirƙiri ƙaramin kundin adireshi, tebur wanda za a sami ginshiƙai 2 kawai (lambar da sunan jinsi) da layuka 2, bi da bi.

Ƙirƙirar tsarin bayanai a cikin R: tidyverse, dplyr

A cikin misalin lambar da ke ƙasa, mun ƙirƙiri tsarin bayanan da ake so ta amfani da aikin tibble() .

Ƙirƙirar tsarin bayanai a cikin R: dplyr

## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
                 gender = c("female", "male"))

Ƙirƙirar tsarin bayanai a cikin R: data.table

Ƙirƙirar tsarin bayanai a cikin R: data.table

## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
                    gender = c("female", "male"))

Ƙirƙirar tsarin bayanai a Python: pandas

В pandas Ƙirƙirar firam ɗin ana aiwatar da shi ta matakai da yawa, da farko za mu ƙirƙiri ƙamus, sannan mu canza ƙamus zuwa tsarin bayanai.

Ƙirƙirar tsarin bayanai a Python: pandas

# создаём дата фрейм
gender_dict = {'id': [1, 2],
               'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)

Zabar ginshiƙai

Teburan da kuke aiki da su na iya ƙunsar da dama ko ma ɗaruruwan ginshiƙan bayanai. Amma don aiwatar da bincike, a matsayin mai mulkin, ba kwa buƙatar duk ginshiƙan da ke samuwa a cikin tebur na tushe.

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Don haka, ɗayan ayyukan farko da za ku yi tare da teburin tushe shine share shi daga bayanan da ba dole ba kuma ya 'yantar da ƙwaƙwalwar ajiyar da wannan bayanin ya mamaye.

Zaɓin ginshiƙai a cikin R: tidyverse, dplyr

ginin kalma dplyr yayi kama da yaren tambaya na SQL, idan kun saba dashi zaku iya ƙware da sauri wannan kunshin.

Don zaɓar ginshiƙai, yi amfani da aikin select().

A ƙasa akwai misalan lambar da za ku iya zaɓar ginshiƙai ta hanyoyi masu zuwa:

  • Jerin sunayen ginshiƙan da ake buƙata
  • Koma zuwa sunayen ginshiƙi ta amfani da maganganu na yau da kullun
  • Ta nau'in bayanai ko duk wata kadara na bayanan da ke cikin ginshiƙi

Zaɓin ginshiƙai a cikin R: dplyr

# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)

Zaɓin ginshiƙai a cikin R: data.table

Ayyukan iri ɗaya a cikin data.table ana yin su daban-daban, a farkon labarin na ba da bayanin abin da muhawara ke cikin maƙallan murabba'i a ciki data.table.

DT[i,j,by]

Inda:
ina, i.e. tace ta layuka
j - zaži|sabunta|yi, i.e. zabar ginshiƙai da canza su
ta - tarin bayanai

Zaɓin ginshiƙai a cikin R: data.table

## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]

Mai canzawa .SD ba ka damar samun dama ga duk ginshiƙai, kuma .SDcols tace ginshiƙan da ake buƙata ta amfani da maganganu na yau da kullun, ko wasu ayyuka don tace sunayen ginshiƙan da kuke buƙata.

Zaɓin ginshiƙai a Python, pandas

Don zaɓar ginshiƙai da suna a ciki pandas ya isa ya ba da jerin sunayen su. Kuma don zaɓar ko ware ginshiƙai da suna ta amfani da maganganu na yau da kullun, kuna buƙatar amfani da ayyukan drop() и filter(), da jayayya axis=1, da abin da kuke nuna cewa wajibi ne don aiwatar da ginshiƙai maimakon layuka.

Don zaɓar fili ta nau'in bayanai, yi amfani da aikin select_dtypes(), kuma cikin jayayya sun hada da ko ware wuce jerin nau'ikan bayanan da suka dace da wane filayen da kuke buƙatar zaɓar.

Zaɓin ginshiƙai a Python: pandas

# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])

Tace layuka

Misali, teburin tushen na iya ƙunsar bayanai na shekaru da yawa, amma kawai kuna buƙatar bincika watan da ya gabata. Bugu da ƙari, ƙarin layukan za su rage aikin sarrafa bayanai kuma su toshe ƙwaƙwalwar PC.

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Tace layuka a cikin R: tydyverse, dplyr

В dplyr Ana amfani da aikin don tace layuka filter(). Yana ɗaukar tsarin bayanai azaman hujja ta farko, sannan ka lissafa yanayin tacewa.

Lokacin rubuta maganganu masu ma'ana don tace tebur, a wannan yanayin, saka sunayen ginshiƙan ba tare da ambato ba kuma ba tare da bayyana sunan tebur ba.

Lokacin amfani da kalmomi masu ma'ana da yawa don tacewa, yi amfani da masu aiki masu zuwa:

  • & ko waƙafi - ma'ana AND
  • | - ma'ana KO

Tace layuka a cikin R: dplyr

# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)

Tace layuka a cikin R: data.table

Kamar yadda na riga na rubuta a sama, in data.table An rufe tsarin jujjuya bayanai a cikin maƙallan murabba'ai.

DT[i,j,by]

Inda:
ina, i.e. tace ta layuka
j - zaži|sabunta|yi, i.e. zabar ginshiƙai da canza su
ta - tarin bayanai

Ana amfani da hujja don tace layuka i, wanda ke da matsayi na farko a cikin maƙallan murabba'i.

Ana samun isa ga ginshiƙai a cikin maganganu masu ma'ana ba tare da alamun ambato ba kuma ba tare da tantance sunan tebur ba.

Kalmomi masu ma'ana suna da alaƙa da juna kamar yadda a cikin dplyr ta hanyar & da | masu aiki.

Tace layuka a cikin R: data.table

## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]

Tace igiyoyi a Python: pandas

Tace ta layuka a ciki pandas kama da tace ciki data.table, kuma an yi shi a cikin maƙallan murabba'i.

A wannan yanayin, samun dama ga ginshiƙai ana aiwatar da shi ta hanyar nuna sunan ginshiƙi; sannan kuma ana iya nuna sunan ginshiƙi a cikin alamomin zance a cikin maƙallan murabba'i (misali df['col_name']), ko kuma ba tare da ambato bayan lokaci ba (misali df.col_name).

Idan kana buƙatar tace tsarin bayanai ta sharuɗɗa da yawa, kowane yanayi dole ne a sanya shi cikin ƙira. An haɗa yanayin ma'ana tare da juna ta masu aiki & и |.

Tace igiyoyi a Python: pandas

# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]

Ƙungiya da tara bayanai

Ɗaya daga cikin ayyukan da aka fi amfani da su a cikin nazarin bayanai shine tarawa da tarawa.

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Maƙasudin yin waɗannan ayyuka sun warwatse a duk fakitin da muke bita.

A wannan yanayin, za mu dauki wani dataframe a matsayin misali Titanic, da ƙididdige lamba da matsakaicin farashin tikiti dangane da ajin gida.

Ƙungiya da tara bayanai a cikin R: tidyverse, dplyr

В dplyr ana amfani da aikin don haɗawa group_by(), kuma don tarawa summarise(). A hakika, dplyr akwai dukan iyali ayyuka summarise_*(), amma manufar wannan labarin ita ce kwatanta ainihin ma'anar, don haka ba za mu shiga cikin irin wannan daji ba.

Ainihin ayyukan tarawa:

  • sum() - taƙaitawa
  • min() / max() – m da matsakaicin darajar
  • mean() - matsakaici
  • median() - tsakiya
  • length() - yawa

Ƙungiya da tarawa a cikin R: dplyr

## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
  summarise(passangers = length(PassengerId),
            avg_price  = mean(Fare))

Don aiki group_by() mun wuce teburin a matsayin hujja ta farko Titanic, sannan ya nuna filin Pclass, ta inda za mu hada teburin mu. Sakamakon wannan aiki ta amfani da mai aiki %>% ya wuce azaman hujja ta farko zuwa aikin summarise(), kuma ya kara wasu filayen guda 2: fasinja и m_price. A cikin farko, yin amfani da aikin length() ƙididdige adadin tikiti, kuma a cikin na biyu ta amfani da aikin mean() ya karɓi matsakaicin farashin tikiti.

Ƙungiya da tara bayanai a cikin R: data.table

В data.table ana amfani da hujja don tarawa j wanda ke da matsayi na biyu a cikin maƙallan murabba'i, kuma don haɗawa by ko keyby, wanda ke da matsayi na uku.

Jerin ayyukan tarawa a wannan yanayin yayi kama da wanda aka bayyana a ciki dplyr, saboda Waɗannan ayyuka ne daga ainihin R syntax.

Ƙungiya da tarawa a cikin R: data.table

## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
            avg_price  = mean(Fare)),
        by = Pclass]

Ƙungiya da tara bayanai a cikin Python: pandas

Haɗin kai pandas kama da dplyr, amma tarawa baya kama da dplyr ba akan data.table.

Don rukuni, yi amfani da hanyar groupby(), wanda a ciki kuna buƙatar wuce jerin ginshiƙai waɗanda za a haɗa su da tsarin bayanai.

Don tarawa zaka iya amfani da hanyar agg()wanda ke karɓar ƙamus. Maɓallan ƙamus sune ginshiƙan da za ku yi amfani da ayyukan tarawa, kuma ƙimar su ne sunayen ayyukan tarawa.

Ayyukan tarawa:

  • sum() - taƙaitawa
  • min() / max() – m da matsakaicin darajar
  • mean() - matsakaici
  • median() - tsakiya
  • count() - yawa

aiki reset_index() a cikin misalin da ke ƙasa ana amfani da shi don sake saita firikwensin gida wanda pandas gazawar zuwa bayan tattara bayanai.

Alamar yana ba ku damar matsawa zuwa layi na gaba.

Ƙungiya da tarawa a cikin Python: pandas

# группировка и агрегация данных
titanic.groupby(["Pclass"]).
    agg({'PassengerId': 'count', 'Fare': 'mean'}).
        reset_index()

Haɗin tebur na tsaye

Wani aiki wanda zaku haɗa tebur biyu ko fiye na tsari iri ɗaya. Bayanan da muka lodawa sun ƙunshi teburi ga_nov и ga_dec. Waɗannan tebura sun yi daidai da tsari, watau. suna da ginshiƙai iri ɗaya, da nau'ikan bayanan da ke cikin waɗannan ginshiƙan.

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Wannan shi ne upload daga Google Analytics na watan Nuwamba da Disamba, a cikin wannan sashe za mu hada wannan bayanai a cikin tebur daya.

Haɗin tebur a tsaye a cikin R: tidyverse, dplyr

В dplyr Kuna iya haɗa tebur 2 zuwa ɗaya ta amfani da aikin bind_rows(), wucewa tebur a matsayin muhawara.

Tace layuka a cikin R: dplyr

# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)

Haɗin tebur a tsaye a cikin R: data.table

Hakanan ba wani abu bane mai rikitarwa, bari mu yi amfani da shi rbind().

Tace layuka a cikin R: data.table

## data.table
rbind(ga_nov, ga_dec)

Haɗin tebur a tsaye a Python: pandas

В pandas ana amfani da aikin don haɗa tebur concat(), wanda a ciki kuna buƙatar wuce jerin firam ɗin don haɗa su.

Tace igiyoyi a Python: pandas

# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])

Haɗin tebur a kwance

Aiki wanda a ciki ake ƙara ginshiƙai daga na biyu zuwa tebur na farko ta maɓalli. Ana amfani da shi sau da yawa lokacin haɓaka teburin gaskiya (misali, tebur mai bayanan tallace-tallace) tare da wasu bayanan tunani (misali, farashin samfur).

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Akwai nau'ikan haɗin gwiwa da yawa:

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

A cikin teburin da aka ɗora a baya Titanic muna da shafi Sex, wanda yayi daidai da lambar jinsi na fasinja:

1 - mace
2 - namiji

Har ila yau, mun ƙirƙiri tebur - littafin tunani jinsi. Don ƙarin dacewa gabatar da bayanai game da jinsi na fasinjoji, muna buƙatar ƙara sunan jinsi daga kundin adireshi. jinsi zuwa teburin Titanic.

Hannun tebur a kwance a cikin R: tidyverse, dplyr

В dplyr Akwai duka iyali na ayyuka don haɗawa a kwance:

  • inner_join()
  • left_join()
  • right_join()
  • full_join()
  • semi_join()
  • nest_join()
  • anti_join()

Mafi yawan amfani da su a cikin aikina shine left_join().

A matsayin gardama biyu na farko, ayyukan da aka jera a sama suna ɗaukar teburi biyu don haɗawa, kuma azaman hujja ta uku by dole ne ka saka ginshiƙan don shiga.

Hannun tebur a kwance a cikin R: dplyr

# объединяем таблицы
left_join(titanic, gender,
          by = c("Sex" = "id"))

Haɗin kai tsaye na tebur a cikin R: data.table

В data.table Kuna buƙatar haɗa tebur ta maɓalli ta amfani da aikin merge().

Hujja don haɗa () aiki a cikin data.table

  • x, y - Tables don shiga
  • by - Rukunin wanda shine mabuɗin shiga idan yana da suna iri ɗaya a cikin tebur biyu
  • by.x, by.y - Sunayen ginshiƙan da za a haɗa su, idan suna da sunaye daban-daban a cikin tebur
  • duk, all.x, all.y - Nau'in haɗin gwiwa, duk za su dawo da duk layuka daga tebur biyu, all.x yayi daidai da aikin JOIN HAGU (zai bar duk layuka na teburin farko), all.y - yayi daidai da Aiki JOIN DAMAN (zai bar duk layuka na tebur na biyu).

Haɗin kai tsaye na tebur a cikin R: data.table

# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)

Tebur a kwance shiga cikin Python: pandas

Haka kuma in data.table, in pandas Ana amfani da aikin don haɗa tebur merge().

Hujjar haɗaka() aiki a pandas

  • yadda - Nau'in haɗi: hagu, dama, waje, ciki
  • on - Rukunin maɓalli ne idan yana da suna iri ɗaya a cikin tebur biyu
  • hagu_on, dama_on - Sunayen ginshiƙan maɓalli, idan suna da sunaye daban-daban a cikin tebur

Tebur a kwance shiga cikin Python: pandas

# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")

Ayyukan taga na asali da ginshiƙai masu ƙididdigewa

Ayyukan taga suna kama da ma'ana ga ayyukan tarawa, kuma ana amfani da su sau da yawa wajen nazarin bayanai. Amma ba kamar ayyukan tarawa ba, ayyukan taga ba sa canza adadin layuka na tsarin bayanai masu fita.

Wanne harshe za a zaɓa don aiki tare da bayanai - R ko Python? Duka! Hijira daga pandas zuwa tsararru da bayanai.table da baya

Mahimmanci, ta yin amfani da aikin taga, muna raba ɓangarorin bayanai masu shigowa zuwa sassa bisa ga wasu ma'auni, watau. ta darajar filin, ko filayen da yawa. Kuma muna gudanar da ayyukan lissafi akan kowace taga. Za a mayar da sakamakon waɗannan ayyuka a kowane layi, watau. ba tare da canza jimlar adadin layuka a cikin tebur ba.

Misali, bari mu dauki tebur Titanic. Za mu iya ƙididdige yawan adadin kuɗin kowane tikitin a cikin ajin ɗakinsa.

Don yin wannan, muna buƙatar samun a cikin kowane layi jimlar kuɗin tikitin ajin ɗakin gida na yanzu wanda tikitin a cikin wannan layin yake, sannan a raba farashin kowane tikitin da jimillar tikitin ajin gida ɗaya. .

Ayyukan taga a cikin R: tsararru, dplyr

Don ƙara sabbin ginshiƙai, ba tare da amfani da rukunin layi ba, a ciki dplyr hidima aiki mutate().

Kuna iya magance matsalar da aka bayyana a sama ta hanyar haɗa bayanai ta filin Pclass da kuma taƙaita filin a cikin sabon shafi yi. Na gaba, cire rukunin tebur kuma raba ƙimar filin yi ga abin da ya faru a mataki na baya.

Ayyukan taga a R: dplyr

group_by(titanic, Pclass) %>%
  mutate(Pclass_cost = sum(Fare)) %>%
  ungroup() %>%
  mutate(ticket_fare_rate = Fare / Pclass_cost)

Ayyukan taga a R: data.table

Maganin algorithm ya kasance iri ɗaya kamar a cikin dplyr, muna bukatar mu raba tebur zuwa windows ta filin Pclass. Fitar a cikin sabon ginshiƙi adadin adadin ƙungiyar da ta dace da kowane jere, kuma ƙara ginshiƙi inda muke ƙididdige rabon kuɗin kowane tikiti a rukunin sa.

Don ƙara sababbin ginshiƙai zuwa data.table ma'aikacin yanzu :=. A ƙasa akwai misalin warware matsala ta amfani da kunshin data.table

Ayyukan taga a R: data.table

titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost), 
        by = Pclass]

Ayyukan taga a Python: pandas

Hanya ɗaya don ƙara sabon shafi zuwa pandas - amfani da aikin assign(). Don taƙaita farashin tikiti ta ajin gida, ba tare da haɗa layuka ba, za mu yi amfani da aikin transform().

Da ke ƙasa akwai misalin bayani wanda muka ƙara zuwa teburin Titanic guda 2 ginshiƙai.

Ayyukan taga a Python: pandas

titanic.assign(Pclass_cost      =  titanic.groupby('Pclass').Fare.transform(sum),
               ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])

Teburin rubutu na ayyuka da hanyoyin

A ƙasa akwai tebur na wasiƙa tsakanin hanyoyin yin ayyuka daban-daban tare da bayanai a cikin fakitin da muka yi la'akari da su.

Description
tsararru
bayanai
pandas

Loading Data
vroom()/ readr::read_csv() / readr::read_tsv()
fread()
read_csv()

Ƙirƙirar dataframes
tibble()
data.table()
dict() + from_dict()

Zabar ginshiƙai
select()
hujja j, matsayi na biyu a maƙallan murabba'i
mun wuce jerin ginshiƙan da ake buƙata a maƙallan murabba'i / drop() / filter() / select_dtypes()

Tace layuka
filter()
hujja i, matsayi na farko a cikin maƙallan murabba'i
Mun lissafta yanayin tacewa a cikin maƙallan murabba'ai / filter()

Ƙungiya da Tari
group_by() + summarise()
muhawara j + by
groupby() + agg()

Ƙungiyar Tebura ta tsaye (UNION)
bind_rows()
rbind()
concat()

Haɗin tebur na kwance (JOIN)
left_join() / *_join()
merge()
merge()

Ayyukan taga na asali da ƙara ginshiƙai masu ƙididdigewa
group_by() + mutate()
hujja j ta amfani da mai aiki := + hujja by
transform() + assign()

ƙarshe

Wataƙila a cikin labarin da na bayyana ba mafi kyawun aiwatarwa na sarrafa bayanai ba, don haka zan yi farin ciki idan kun gyara kurakuraina a cikin sharhi, ko kuma kawai ƙara bayanan da aka bayar a cikin labarin tare da wasu dabaru don aiki tare da bayanai a cikin R / Python.

Kamar yadda na rubuta a sama, makasudin labarin ba shine don sanya ra'ayin mutum akan wane harshe ne ya fi kyau ba, amma don sauƙaƙe damar koyon harsunan biyu, ko kuma, idan ya cancanta, ƙaura a tsakanin su.

Idan kuna son labarin, zan yi farin cikin samun sabbin masu biyan kuɗi na youtube и sakon waya tashoshi.

Kasa

Wanne daga cikin waɗannan fakitin kuke amfani da su a aikinku?

A cikin sharhin zaku iya rubuta dalilin zabinku.

Masu amfani da rajista kawai za su iya shiga cikin binciken. Shigadon Allah.

Wane fakitin sarrafa bayanai kuke amfani da shi (zaku iya zaɓar zaɓuɓɓuka da yawa)

  • 45,2%tsafta19

  • 33,3%bayanai.table14

  • 54,8%panda 23

Masu amfani 42 sun kada kuri'a. Masu amfani 9 sun kaurace.

source: www.habr.com

Add a comment