Chapisho hili litawavutia wale wanaotumia jedwali la maktaba ya kuchakata data ya jedwali la R na wanaweza kufurahishwa kuona ubadilikaji wa matumizi yake katika mifano mbalimbali.
Imeongozwa na mfano mzuri , na nikitumai kuwa tayari umesoma nakala yake, ninapendekeza kuchimba zaidi katika uboreshaji wa nambari na utendaji kulingana na data.meza.
Utangulizi: Jedwali la data linatoka wapi?
Ni bora kuanza kufahamiana na maktaba kutoka mbali kidogo, yaani, na miundo ya data ambayo kitu cha meza ya data (hapa, DT) kinaweza kupatikana.
Mpangilio
Kanuni
## arrays ---------
arrmatr <- array(1:20, c(4,5))
class(arrmatr)
typeof(arrmatr)
is.array(arrmatr)
is.matrix(arrmatr)
Moja ya miundo kama hii ni safu (?msingi::safu) Kama ilivyo kwa lugha zingine, safu hapa ni za pande nyingi. Walakini, kinachovutia ni kwamba, kwa mfano, safu ya pande mbili huanza kurithi mali kutoka kwa darasa la matrix. (?msingi::matrix), na safu ya sura moja, ambayo pia ni muhimu, hairithi kutoka kwa vekta (?msingi::vekta).
Ni muhimu kuelewa kwamba aina ya data iliyo katika kitu chochote inapaswa kuchunguzwa na kazi msingi::aina ya, ambayo inarudisha maelezo ya ndani ya aina kulingana na R Ndani - itifaki ya lugha ya kawaida inayohusishwa na asili C.
Amri nyingine ya kuamua darasa la kitu, msingi:: darasa, katika kesi ya vectors, inarudi aina ya vector (ina jina tofauti kutoka kwa ndani, lakini pia inakuwezesha kuelewa aina ya data).
Orodha ya
Kutoka kwa safu ya pande mbili, pia inajulikana kama matrix, unaweza kwenda kwenye orodha (?msingi::orodha).
Kanuni
## lists ------------------
mylist <- as.list(arrmatr)
is.vector(mylist)
is.list(mylist)
Mambo kadhaa hutokea mara moja:
- Mwelekeo wa pili wa matrix huanguka, yaani, tunapata orodha na vector kwa wakati mmoja.
- Orodha kwa hivyo hurithi kutoka kwa madarasa haya. Kumbuka kwamba kila kipengele cha orodha kitalingana na thamani moja (ya ukubwa) kutoka kwa seli ya mkusanyiko wa matriki.
Kwa kuwa orodha pia ni vekta, kazi zingine za vekta zinaweza kutumika kwake.
Dataframe
Kutoka kwa orodha, matrix au vekta unaweza kwenda kwa sura ya data (?msingi::data.frame).
Kanuni
## data.frames ------------
df <- as.data.frame(arrmatr)
df2 <- as.data.frame(mylist)
is.list(df)
df$V6 <- df$V1 + df$V2
Kinachofurahisha juu yake ni kwamba mfumo wa data hurithi kutoka kwa orodha! Safu wima za mfumo wa data ni visanduku vya orodha. Hili litakuwa muhimu baadaye tunapotumia vipengele vinavyotumika kwenye orodha.
data.meza
Pata DT (?meza.data::meza.data) inaweza kutoka sura ya data, orodha, vekta, au tumbo. Kwa mfano, kama hii (mahali).
Kanuni
## data.tables -----------------------
library(data.table)
data.table::setDT(df)
is.list(df)
is.data.frame(df)
is.data.table(df)
Kinachofaa ni kwamba, kama mfumo wa data, DT hurithi sifa za orodha.
DT na kumbukumbu
Tofauti na vitu vingine vyote kwenye msingi wa R, DT hupitishwa kwa kumbukumbu. Ikiwa unahitaji kuzinakili kwenye eneo jipya la kumbukumbu, unahitaji chaguo la kukokotoa data.meza::nakala au unahitaji kufanya uteuzi kutoka kwa kitu cha zamani.
Kanuni
df2 <- df
df[V1 == 1, V2 := 999]
data.table::fsetdiff(df, df2)
df2 <- data.table::copy(df)
df[V1 == 2, V2 := 999]
data.table::fsetdiff(df, df2)
Hii inahitimisha utangulizi. DT ni mwendelezo wa ukuzaji wa miundo ya data katika R, haswa kupitia upanuzi na kuongeza kasi ya shughuli zinazofanywa kwenye vitu vya mfumo wa data. Wakati wa kudumisha urithi kutoka kwa primitives zingine.
Baadhi ya mifano ya kutumia sifa za data.table
Kama orodha…
Kurudia safu za safu ya data au DT sio wazo bora, kwani nambari ya kitanzi iko katika lugha. R polepole zaidi C, lakini kupiga kitanzi kupitia nguzo, ambazo kwa kawaida ni chache zaidi, kunawezekana kabisa. Unapopitia safu wima, kumbuka kwamba kila safu ni kipengele cha orodha, kwa kawaida huwa na vekta. Na utendakazi kwenye vekta umewekewa vekta vizuri katika kazi za msingi za lugha. Unaweza pia kutumia waendeshaji uteuzi asili ya orodha na vekta: `[[`, `$`.
Kanuni
## operations on data.tables ------------
#using list properties
df$'V1'[1]
df[['V1']]
df[[1]][1]
sapply(df, class)
sapply(df, function(x) sum(is.na(x)))
Vectorization
Ikiwa unahitaji kurudia kupitia safu za DT kubwa, suluhisho bora ni kuandika kazi ya vectorized. Lakini ikiwa hiyo haifanyi kazi, kumbuka kwamba kitanzi ndani DT bado ina kasi zaidi kuliko mzunguko ndani R, kwani inatekelezwa C.
Wacha tuijaribu kwa mfano mkubwa na safu mlalo 100K. Tutatoa herufi ya kwanza kutoka kwa maneno yaliyojumuishwa kwenye vekta ya safu wima. w.
Updated
Kanuni
library(magrittr)
library(microbenchmark)
## Bigger example ----
rown <- 100000
dt <-
data.table(
w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
, a = sample(letters, rown, replace = T)
, b = runif(rown, -3, 3)
, c = runif(rown, -3, 3)
, e = rnorm(rown)
) %>%
.[, d := 1 + b + c + rnorm(nrow(.))]
# vectorization
microbenchmark({
dt[
, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
, by = 1:nrow(dt)
]
})
# second
first_l_f <- function(sd)
{
strsplit(sd, split = ' ', fixed = T) %>%
do.call(rbind, .) %>%
`[`(,1)
}
dt[, first_l := NULL]
microbenchmark({
dt[
, first_l := .(first_l_f(w))
]
})
# third
first_l_f2 <- function(sd)
{
strsplit(sd, split = ' ', fixed = T) %>%
unlist %>%
matrix(nrow = 3) %>%
`[`(1,)
}
dt[, first_l := NULL]
microbenchmark({
dt[
, first_l := .(first_l_f2(w))
]
})
Kwanza endesha na kurudia juu ya safu:
Kitengo: milliseconds
expr dakika
{dt[, `:=`(first_l, unlist(strsplit(w, split = " ", fixed = T))[1]), by = 1:nrow(dt)] } 439.6217
lq maana ya wastani uq max neval
451.9998 460.1593 456.2505 460.9147 621.4042 100
Uendeshaji wa pili, ambapo vectorization hutokea kwa kubadilisha orodha kwenye tumbo na kuchukua vipengele vya kipande na index 1 (mwisho ni vectorization halisi). Marekebisho: vectorization katika ngazi ya kazi. strsplit, ambayo inaweza kukubali vekta kama pembejeo. Inabadilika kuwa utaratibu wa kubadilisha orodha kuwa matrix ni ngumu zaidi kuliko vectorization yenyewe, lakini hata katika kesi hii, ni haraka sana kuliko toleo lisilo la vector.
Kitengo: milliseconds
expr min lq maana wastani uq max neval
{dt[, `:=`(first_l, .(first_l_f(w)))] } 93.07916 112.1381 161.9267 149.6863 185.9893 442.5199 100
Kuongeza kasi kwa wastani katika mara 3.
Uendeshaji wa tatu, ambapo mpango wa ubadilishaji wa matrix ulibadilishwa.
Kitengo: milliseconds
expr min lq maana wastani uq max neval
{dt[, `:=`(first_l, .(first_l_f2(w)))] } 32.60481 34.13679 40.4544 35.57115 42.11975 222.972 100
Kuongeza kasi kwa wastani katika mara 13.
Unahitaji kujaribu na jambo hili, bora zaidi.
Huu hapa ni mfano mwingine wa uwekaji vekta, pia wenye maandishi, lakini karibu na hali halisi ya ulimwengu: urefu tofauti wa maneno na hesabu tofauti za maneno. Lengo ni kutoa maneno matatu ya kwanza. Kama hii:

Chaguo za kukokotoa za awali hazifanyi kazi hapa kwa sababu vekta ni za urefu tofauti, na tulibainisha ukubwa wa matrix. Hebu tulifanyie kazi upya hili kwa kuchimba mtandaoni.
Kanuni
# fourth
rown <- 100000
words <-
sapply(
seq_len(rown)
, function(x){
nwords <- rbinom(1, 10, 0.5)
paste(
sapply(
seq_len(nwords)
, function(x){
paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
}
)
, collapse = ' '
)
}
)
dt <-
data.table(
w = words
, a = sample(letters, rown, replace = T)
, b = runif(rown, -3, 3)
, c = runif(rown, -3, 3)
, e = rnorm(rown)
) %>%
.[, d := 1 + b + c + rnorm(nrow(.))]
first_l_f3 <- function(sd, n)
{
l <- strsplit(sd, split = ' ', fixed = T)
maxl <- max(lengths(l))
sapply(l, "length<-", maxl) %>%
`[`(n,) %>%
as.character
}
microbenchmark({
dt[
, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
]
})
dt[
, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
]
Kitengo: milliseconds
expr min lq maana ya wastani
{dt[, `:=`((bandika0(“w_”, 1:3)), strsplit(w, split = " ", fasta = T))] } 851.7623 916.071 1054.5 1035.199
uq max neval
1178.738 1356.816 100
Hati ilienda kwa kasi ya wastani ya sekunde 1. Sio mbaya.
Imeunganishwa na mnyororo mmoja…
Unaweza kufanya kazi na vitu vya DT kwa kutumia minyororo. Hii inaonekana kama kufunga sintaksia ya mabano upande wa kulia, kimsingi mipako ya sukari.
Kanuni
# chaining
res1 <- dt[a == 'a'][sample(.N, 100)]
res2 <- dt[, .N, a][, N]
res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]
Inavuja kupitia mabomba...
Shughuli sawa zinaweza kufanywa kwa kutumia mabomba; inaonekana sawa, lakini ni tajiri zaidi kiutendaji, kwani njia zozote zinaweza kutumika, sio tu DT. Wacha tupate hesabu za urekebishaji wa vifaa kwa data yetu ya syntetisk na vichujio kadhaa vya DT.
Kanuni
# piping
samplpe_b <- dt[a %in% head(letters), sample(b, 1)]
res4 <-
dt %>%
.[a %in% head(letters)] %>%
.[,
{
dt0 <- .SD[1:100]
quants <-
dt0[, c] %>%
quantile(seq(0.1, 1, 0.1), na.rm = T)
.(q = quants)
}
, .(cond = b > samplpe_b)
] %>%
glm(
cond ~ q -1
, family = binomial(link = "logit")
, data = .
) %>%
summary %>%
.[[12]]
Takwimu, kujifunza kwa mashine, na zaidi ndani ya DT
Vipengele vya kukokotoa vya Lambda vinaweza kutumika, lakini wakati mwingine ni bora kuziunda kando, kuandika bomba zima la uchanganuzi wa data, na ukienda zako—zinafanya kazi ndani ya DT. Mfano huo umeimarishwa na vipengele vyote vilivyotajwa hapo juu, pamoja na mambo kadhaa muhimu kutoka kwa arsenal ya DT (kama vile kufikia DT yenyewe ndani ya DT kupitia kiungo, wakati mwingine huingizwa nje ya mlolongo, lakini ili tu kuwa na uhakika).
Kanuni
# function
rm(lm_preds)
lm_preds <- function(
sd, by, n
)
{
if(
n < 100 |
!by[['a']] %in% head(letters, 4)
)
{
res <-
list(
low = NA
, mean = NA
, high = NA
, coefs = NA
)
} else {
lmm <-
lm(
d ~ c + b
, data = sd
)
preds <-
stats::predict.lm(
lmm
, sd
, interval = "prediction"
)
res <-
list(
low = preds[, 2]
, mean = preds[, 1]
, high = preds[, 3]
, coefs = coefficients(lmm)
)
}
res
}
res5 <-
dt %>%
.[e < 0] %>%
.[.[, .I[b > 0]]] %>%
.[, `:=` (
low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
)
, a
] %>%
.[!is.na(mean), -'e', with = F]
# plot
plo <-
res5 %>%
ggplot +
facet_wrap(~ a) +
geom_ribbon(
aes(
x = c * coef_c + b * coef_b + coef_int
, ymin = low
, ymax = high
, fill = a
)
, size = 0.1
, alpha = 0.1
) +
geom_point(
aes(
x = c * coef_c + b * coef_b + coef_int
, y = mean
, color = a
)
, size = 1
) +
geom_point(
aes(
x = c * coef_c + b * coef_b + coef_int
, y = d
)
, size = 1
, color = 'black'
) +
theme_minimal()
print(plo)
Hitimisho
Natumai nimeweza kuunda picha ya kina, ingawa haijakamilika, ya kitu cha data.table, kutoka kwa sifa zake zinazohusiana na urithi kutoka kwa madarasa ya R hadi sifa zake na mazingira ya vipengele safi. Natumai hii itakusaidia kuelewa vyema na kutumia maktaba hii kwenye kazi yako. burudani.

Asante!
Msimbo kamili
Kanuni
## load libs ----------------
library(data.table)
library(ggplot2)
library(magrittr)
library(microbenchmark)
## arrays ---------
arrmatr <- array(1:20, c(4,5))
class(arrmatr)
typeof(arrmatr)
is.array(arrmatr)
is.matrix(arrmatr)
## lists ------------------
mylist <- as.list(arrmatr)
is.vector(mylist)
is.list(mylist)
## data.frames ------------
df <- as.data.frame(arrmatr)
is.list(df)
df$V6 <- df$V1 + df$V2
## data.tables -----------------------
data.table::setDT(df)
is.list(df)
is.data.frame(df)
is.data.table(df)
df2 <- df
df[V1 == 1, V2 := 999]
data.table::fsetdiff(df, df2)
df2 <- data.table::copy(df)
df[V1 == 2, V2 := 999]
data.table::fsetdiff(df, df2)
## operations on data.tables ------------
#using list properties
df$'V1'[1]
df[['V1']]
df[[1]][1]
sapply(df, class)
sapply(df, function(x) sum(is.na(x)))
## Bigger example ----
rown <- 100000
dt <-
data.table(
w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
, a = sample(letters, rown, replace = T)
, b = runif(rown, -3, 3)
, c = runif(rown, -3, 3)
, e = rnorm(rown)
) %>%
.[, d := 1 + b + c + rnorm(nrow(.))]
# vectorization
# zero - for loop
microbenchmark({
for(i in 1:nrow(dt))
{
dt[
i
, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
]
}
})
# first
microbenchmark({
dt[
, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
, by = 1:nrow(dt)
]
})
# second
first_l_f <- function(sd)
{
strsplit(sd, split = ' ', fixed = T) %>%
do.call(rbind, .) %>%
`[`(,1)
}
dt[, first_l := NULL]
microbenchmark({
dt[
, first_l := .(first_l_f(w))
]
})
# third
first_l_f2 <- function(sd)
{
strsplit(sd, split = ' ', fixed = T) %>%
unlist %>%
matrix(nrow = 3) %>%
`[`(1,)
}
dt[, first_l := NULL]
microbenchmark({
dt[
, first_l := .(first_l_f2(w))
]
})
# fourth
rown <- 100000
words <-
sapply(
seq_len(rown)
, function(x){
nwords <- rbinom(1, 10, 0.5)
paste(
sapply(
seq_len(nwords)
, function(x){
paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
}
)
, collapse = ' '
)
}
)
dt <-
data.table(
w = words
, a = sample(letters, rown, replace = T)
, b = runif(rown, -3, 3)
, c = runif(rown, -3, 3)
, e = rnorm(rown)
) %>%
.[, d := 1 + b + c + rnorm(nrow(.))]
first_l_f3 <- function(sd, n)
{
l <- strsplit(sd, split = ' ', fixed = T)
maxl <- max(lengths(l))
sapply(l, "length<-", maxl) %>%
`[`(n,) %>%
as.character
}
microbenchmark({
dt[
, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
]
})
dt[
, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
]
# chaining
res1 <- dt[a == 'a'][sample(.N, 100)]
res2 <- dt[, .N, a][, N]
res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]
# piping
samplpe_b <- dt[a %in% head(letters), sample(b, 1)]
res4 <-
dt %>%
.[a %in% head(letters)] %>%
.[,
{
dt0 <- .SD[1:100]
quants <-
dt0[, c] %>%
quantile(seq(0.1, 1, 0.1), na.rm = T)
.(q = quants)
}
, .(cond = b > samplpe_b)
] %>%
glm(
cond ~ q -1
, family = binomial(link = "logit")
, data = .
) %>%
summary %>%
.[[12]]
# function
rm(lm_preds)
lm_preds <- function(
sd, by, n
)
{
if(
n < 100 |
!by[['a']] %in% head(letters, 4)
)
{
res <-
list(
low = NA
, mean = NA
, high = NA
, coefs = NA
)
} else {
lmm <-
lm(
d ~ c + b
, data = sd
)
preds <-
stats::predict.lm(
lmm
, sd
, interval = "prediction"
)
res <-
list(
low = preds[, 2]
, mean = preds[, 1]
, high = preds[, 3]
, coefs = coefficients(lmm)
)
}
res
}
res5 <-
dt %>%
.[e < 0] %>%
.[.[, .I[b > 0]]] %>%
.[, `:=` (
low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
)
, a
] %>%
.[!is.na(mean), -'e', with = F]
# plot
plo <-
res5 %>%
ggplot +
facet_wrap(~ a) +
geom_ribbon(
aes(
x = c * coef_c + b * coef_b + coef_int
, ymin = low
, ymax = high
, fill = a
)
, size = 0.1
, alpha = 0.1
) +
geom_point(
aes(
x = c * coef_c + b * coef_b + coef_int
, y = mean
, color = a
)
, size = 1
) +
geom_point(
aes(
x = c * coef_c + b * coef_b + coef_int
, y = d
)
, size = 1
, color = 'black'
) +
theme_minimal()
print(plo)
Chanzo: mapenzi.com
