แแแแแแแแแแถแแแแแแแนแแแถแแแถแแ แถแแแขแถแแแแแแแ แแแแแขแแแแแแแแแแพแแแแแถแแแแแแแพแแแถแแแทแแแแแแแแถแแถแแแแแแถแแ R - data.table แ แพแแแแแ แแแแถแแธแแแถแแแแแปแแแถแแแพแแแถแแแแแแแแแแแถแแแแแพแแแแถแแแแแแแแถแแ แแแแปแแงแแถแ แแแแแแแแแแ
แแแแปแแแแแทแแแแแแแแผแแแแแขแแฝแแ แ แพแแแแแแนแแแถแขแแแแแถแแขแถแแขแแแแแแแแแแแถแแแแฝแ
แ แพแ แแแแปแแแแแพแฑแแแแแแแแแแแแถแแแแแแแธแแแแแ
แแแแแแแ
แแแแถแแแแแแพแแแแแแทแแแแแถแแแผแ แแทแแแถแแขแแปแแแแแแแแแแขแแแแพ แแถแแถแแแทแแแแแแ.
แแแ แแแแธแแแแพแแ แแพ data.table แแแแธแแถ?
แแถแแถแแถแแแแขแแแแปแแแแแปแแแถแแ แถแแแแแแพแแแแแถแแแแแแแถแแแแแธแ แแแแถแแแแแแทแ แแแแแบแแถแแฝแแแนแแแ แแถแแแแแแแแแแทแแแแแแ แแแแแแแแป data.table (แแแ แแแแ แ แแถ DT) แขแถแ แแแฝแแแถแแ
แขแถแแ
แแแแแผแ
## arrays ---------
arrmatr <- array(1:20, c(4,5))
class(arrmatr)
typeof(arrmatr)
is.array(arrmatr)
is.matrix(arrmatr)
แแ แแถแแแแแแแแแแฝแแแแแแแแแบแขแถแแ (?แแผแแแแแถแ::แขแถแแ) แแผแ แแ แแแแปแแแถแแถแแแแแแแแ แขแถแแแแ แแธแแแแแถแแแ แปแแทแแถแแแแ แแแแแถแแแถแแแถแแแแแ แขแแแธแแแแแฝแแฑแแแ แถแแแขแถแแแแแแแแแแแบแแถ แแถแงแแถแ แแแ แขแถแแแแธแแแทแแถแแแแ แถแแแแแแพแแแแฝแแแแแแแแแแแแแแแแแแแทแแธแแแแถแแแแแถแแแแธแแ (?แแผแแแแแถแแ:แแแถแแแแธแ) แแทแโแขแถแแโแแฝแโแแทแแถแแแ แแแโแแแแถแแโแแโแแแ แแทแโแแแฝแโแแถแโแแธโแแแทแ แแแ (แแผแแแแแถแ :: แแแทแ แแแ).
แแถแแฝแแแแแแแผแแแถแแแแแแถแแแแแแแแแแทแแแแแแแแแแแถแแแ แแแแปแแแแแแปแแถแแฝแแแฝแแแแแแแผแแแถแแแทแแทแแแแแแแแแแพแแปแแแถแ แแผแแแแแถแ :: แแแแแแแแแแแแแกแแแแถแแแทแแแแแถแแแแแแแแถแแแแแปแแแแแแ แแถแ R แแถแแแแแปแ - แแทแแธแแถแแแผแแ แแแแถแแถแแแแแแแถแแแแถแแฝแแแพแ C.
แแถแแแแแแแแถแแฝแแแแแแพแแแแธแแแแแแแแแถแแแแแแแแแปแแบ แแผแแแแแถแ :: แแแแถแแแแแแปแแแแแธแแแทแ แแแ แแแแกแแแแแแแแแแแทแ แแแ (แแถแแปแแแธแแแแแแแถแแแแแปแ แแแปแแแแแแแขแแปแแแแถแแฑแแแขแแแแแแแแธแแแแแแแแทแแแแแแแแแแแ)แ
แแแแแธ
แแธแขแถแแแแธแแแทแแถแแแ แแแแแแแผแแแถแแแแแแแถแแแแถแแถแแแถแแแแธแ แขแแแแขแถแ แ แผแแแ แแถแแแแแแแธ (แแผแแแแแถแ :: แแแแแธ).
แแแแแผแ
## lists ------------------
mylist <- as.list(arrmatr)
is.vector(mylist)
is.list(mylist)
แแถแแแฟแแแถแ แแแพแแแพแแกแพแแแแแปแแแแแแแแฝแแ
- แแทแแถแแแแแธแแธแแแแแแถแแแแธแแแฝแแแแ แแแแแบแแพแแแแฝแแแถแแแถแแแแแแแธ แแทแแแแทแ แแแแแแแปแแแแแแแแฝแแ
- แแแแแธแแแแแแฝแแแแแแแธแแแแถแแแแถแแแแแแ แแถแแแแผแแแแแแแแถแแปแแแแแปแแ แทแแแแแถแแถแแปแแแแแธแแฝแแแนแแแแแผแแแแแถแแ แแนแแแแแแแแฝแ (แแถแแแแแแแถแ) แแธแแแแกแถแแแแแถแแแแธแแขแถแแแ
แแแแแถแแแแแแแแธแแฝแแแแแถแแแทแ แแแ แแปแแแถแแแแทแ แแแแแฝแแ แแแฝแแขแถแ แแแแผแแแถแแขแแปแแแแแแ แแถแ
แแแปแแแทแแแแแแ
แขแแแแขแถแ แแ แแธแแแแแธ แแแถแแแแธแ แฌแแแทแ แแแ แแ แแถแแแแแปแแแทแแแแแแ (?base::data.frame).
แแแแแผแ
## data.frames ------------
df <- as.data.frame(arrmatr)
df2 <- as.data.frame(mylist)
is.list(df)
df$V6 <- df$V1 + df$V2
แขแแแธแแแแแฝแแฑแแแ แถแแแขแถแแแแแแแขแแแธแแถแ แแแปแแแทแแแแแแแแแฝแแแแแแแธแแแแแธ! แแฝแแแแแแปแแแทแแแแแแแแบแแถแแแแกแถแแแแแธแ แแถแแนแแแถแแแถแแแแแแถแแแแ แแแแแแแแ แแ แแแแแแแแพแแแแแพแแปแแแถแแแแแแถแแขแแปแแแแแ แแแแแแแแแธแ
แแถแแถแแแทแแแแแแ
แแแฝแแแถแ DT (?data.table::data.table) แขแถแ แแแแธ แแแปแแแทแแแแแแแแแแแธ แแแทแ แแแ แฌแแแถแแแแธแแ แงแแถแ แแแแแผแ แแแ (แแ แแนแแแแแแแ)แ
แแแแแผแ
## data.tables -----------------------
library(data.table)
data.table::setDT(df)
is.list(df)
is.data.frame(df)
is.data.table(df)
แแถแแถแแแแแแแแแแแแแแผแ แแถ dataframe แแฝแ DT แแแฝแแแแแแแผแแแแแแแแแแแแแแแทแแแแแแแธแแฝแแ
DT แแทแแขแแแแ แแ แถแ
แแทแแแผแ แแแแแปแแแแแแแแแแถแแแขแแแแ แแแแปแแแผแแแแแถแ R แแ DTs แแแแผแแแถแแแแแแแถแแแแแแฏแแแถแแแแแ แแแแแทแแแพแขแแแแแแแผแแแถแแ แแแแแแ แแแแแแแ แแ แถแแแแแธ แขแแแแแแแผแแแถแแแปแแแถแแแฝแแ data.table::แ แแแแ แฌแขแแแแแแแผแแแแแพแแถแแแแแพแแแพแแแธแแแแแปแ แถแแแ
แแแแแผแ
df2 <- df
df[V1 == 1, V2 := 999]
data.table::fsetdiff(df, df2)
df2 <- data.table::copy(df)
df[V1 == 2, V2 := 999]
data.table::fsetdiff(df, df2)
แแแแแแแ แแแแถแแแแแถแแ DT แแบแแถแแถแแแแแแแแแถแแขแแทแแแแแแแแ แแถแแแแแแแแแแทแแแแแแแแ แแแแปแ R แแแแแพแแกแพแแแถแ แแแแแแแแแถแแแแแถแแแแแแธแ แแทแแแถแแแแแแพแแแแแฟแแแแแแแแทแแแแแทแแถแแแแแแถแแขแแปแแแแแแพแแแแแปแแแแแแถแแ dataframe แ แแแแแนแแแนแแแแ แแแแแแธแแปแแแแแถแแแแแแแแแแแแแผแแแถแแแแแแถแแปแแ
แงแแถแ แแแแแฝแแ แแแฝแแแแแถแแแแแพแแแแถแแแแแแแแแแแแแแแแท data.table
แแผแ แแถแแแแแธ ...
แแถแแแแแพแแแแแแแแแพแแฝแแแแแแ dataframe แฌ DT แแทแแแแแแถแแแแทแแแแขแแ แแแแแแแแแแผแแแแแแทแแแปแแแถแแถแแถ R แแบแแแถแ Cแแแปแแแแแแถแแทแแแถแขแถแ แแ แแฝแ แแแแปแแแถแแแแแแทแแแปแแแถแแแฝแแแ แแแแแถแแแแแแถแแถแแแแ แแแผแ แแถแแ แแแแแแถแแแแฝแแแ แแผแแ แถแแแถแแฝแแแแแธแแฝแแแแบแแถแแถแแปแแแแแแแธ แแแแแถแแแแแแถแแถแแแแทแ แแแแ แ แพแแแแแแทแแแแแทแแถแแแพแแแทแ แแแแแแแผแแแถแแแแทแ แแแแแแถแแแแขแแ แแแแปแแแปแแแถแแแผแแแแแถแแแแแถแแถแ แขแแแโแแโแขแถแ โแแแแพโแแแแแถโแแแแแถแแแทแแธโแแแแพแแแพแโแแผแแ โแ แแแแโแแแแแธ แแทแโแแแทแ แแแโแแแแแแ `[[`, `$`.
แแแแแผแ
## operations on data.tables ------------
#using list properties
df$'V1'[1]
df[['V1']]
df[[1]][1]
sapply(df, class)
sapply(df, function(x) sum(is.na(x)))
แแแทแ แแแ
แแแแแทแแแพแแถแแแแแแผแแแถแแแแแแแถแแแแแแแถแแแแ DT แแ แแแแแแแแแถแแแแแแขแแแแปแแแบแแแแผแแแแแแแแปแแแถแแแถแแฝแแแแทแ แแแแ แแแปแแแแแแแแแทแแแพแแฟแแแแแแทแแแแแพแแแถแแแแแแแขแแแแแฝแแแแ แแ แถแแแถแแแแ แแ แแถแแแแแปแ DT แแ แแแแฟแแแถแแแแแ Rแ แถแแแแถแแแแธแแถแแแแผแแแถแแขแแปแแแแแแ แแพ C.
แแแแแถแแแแแแแถแแ แแพแงแแถแ แแแแแแแถแแแถแแฝแแแฝแ 100K แ แแพแแแนแแแแแแแแแขแแแแแแธแแฝแแ แแแแธแแถแแแแแแแแแแ แผแแแแแปแแแฝแแแแแแทแ แแแ w.
แแถแแแแแแถแแแแแแ
แแแแแผแ
library(magrittr)
library(microbenchmark)
## Bigger example ----
rown <- 100000
dt <-
data.table(
w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
, a = sample(letters, rown, replace = T)
, b = runif(rown, -3, 3)
, c = runif(rown, -3, 3)
, e = rnorm(rown)
) %>%
.[, d := 1 + b + c + rnorm(nrow(.))]
# vectorization
microbenchmark({
dt[
, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
, by = 1:nrow(dt)
]
})
# second
first_l_f <- function(sd)
{
strsplit(sd, split = ' ', fixed = T) %>%
do.call(rbind, .) %>%
`[`(,1)
}
dt[, first_l := NULL]
microbenchmark({
dt[
, first_l := .(first_l_f(w))
]
})
# third
first_l_f2 <- function(sd)
{
strsplit(sd, split = ' ', fixed = T) %>%
unlist %>%
matrix(nrow = 3) %>%
`[`(1,)
}
dt[, first_l := NULL]
microbenchmark({
dt[
, first_l := .(first_l_f2(w))
]
})
แแแโแแถโแแพแโแแแแผแโแแพโแแฝแโแแแแ
แฏแแแถแ แแทแแแแธแแทแแถแแธ
expr แแถแแธ
{ dt[, `:=`(first_l, unlist(strsplit(w, split = " ", fixed = T))[1]), by = 1:nrow(dt)] } 439.6217
lq แแถแแแแแแถแแแแแ uq max neval
451.9998 460.1593 456.2505 460.9147 621.4042 แแแแผแแแธ 100th
แแถแแแแแแธแแธแ แแแแแถแแแแแแแแแแแทแ แแแแแพแแกแพแแแแแแแแแแแแแแแธแแ แแถแแแถแแแแธแ แ แพแแแแแถแแปแแ แแพแ แแแทแแแถแแฝแแแทแแทแแแแ 1 (แแแแแแแแแแแแบแแถแแแทแ แแแแแแแฝแแฏแ)แ แแถแแแแแแแแผแแ แแแทแ แแแแแ แแแแแทแแแปแแแถแ strsplitแแแแขแถแ แแแฝแแแแแแทแ แแแแแถแแถแแแแแ แผแแ แแถแแแแแแถแแธแแทแแทแแธแแแแแถแแแแแแแแแแแแแธแแ แแถแแแถแแแแธแแแบแแทแแถแแแถแแแทแ แแแแแแแแแแฝแแแถแแแปแแแแแแแแปแแแแแธแแแแแถแแฟแแแถแแแแแแแแแแทแแแถแแแแแแแแแแทแ แแแแ
แฏแแแถแ แแทแแแแธแแทแแถแแธ
expr min lq แแถแแแแแแถ median uq max neval
{ dt[, `:=`(first_l, .(first_l_f(w)))] } 93.07916 112.1381 161.9267 149.6863 185.9893 442.5199 100
แแถแแแแแแพแแแแแฟแแแแแแแแแแแแแปแ 3 แแ.
แแถแแแแแแธแแธแแแแแแแแแแถแแแแแแแถแแแแแแผแแแ แแถแแแถแแแแธแแแแแผแแแถแแแแแถแแแแแแผแแ
แฏแแแถแ แแทแแแแธแแทแแถแแธ
expr min lq แแถแแแแแแถ median uq max neval
{ dt[, `:=`(first_l, .(first_l_f2(w))] } 32.60481 34.13679 40.4544 35.57115 42.11975 222.972 100
แแถแแแแแแพแแแแแฟแแแแแแแแแแแแแปแ 13 แแ.
แขแแแแแแแผแแแทแแแแแแแพแแแแ แถแแแ แแถแแแแแ แแแพแ แแถแแนแแแถแแแแแแแแแพแแ
แงแแถแ แแแแแฝแแแแแแถแแฝแแแแทแ แแแ แแแแแถแแแแแแแแแแแถแแขแแแแแแแแแแ แแแปแแแแแแถแแ แแทแแแแแแแแแแแทแแ แแแแแแแแถแแแแแปแแแแแถ แ แแแฝแแแถแแแแแแแแแแแแถแ แขแแแแแแแผแแแแฝแแแถแ 3 แแถแแแแแแแผแแ แแผแ แแแแ

แแ
แแธแแแแแปแแแถแแแธแแปแแแทแแแแแพแแแถแแแ แแแแแถแแแแทแ
แแแแแถแแแแแแแแแปแแแแแแถ แ แพแแแพแแแแแแแแแ แแแแถแแแแธแแ แ
แผแแแพแแแแแพแแถแกแพแแแทแแแแแแถแแแธแแแแถแแแ
แแพแขแแธแแแบแแทแแ
แแแแแผแ
# fourth
rown <- 100000
words <-
sapply(
seq_len(rown)
, function(x){
nwords <- rbinom(1, 10, 0.5)
paste(
sapply(
seq_len(nwords)
, function(x){
paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
}
)
, collapse = ' '
)
}
)
dt <-
data.table(
w = words
, a = sample(letters, rown, replace = T)
, b = runif(rown, -3, 3)
, c = runif(rown, -3, 3)
, e = rnorm(rown)
) %>%
.[, d := 1 + b + c + rnorm(nrow(.))]
first_l_f3 <- function(sd, n)
{
l <- strsplit(sd, split = ' ', fixed = T)
maxl <- max(lengths(l))
sapply(l, "length<-", maxl) %>%
`[`(n,) %>%
as.character
}
microbenchmark({
dt[
, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
]
})
dt[
, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
]
แฏแแแถแ แแทแแแแธแแทแแถแแธ
expr min lq แแแแแ
{ dt[, `:=`((paste0(โw_โ, 1:3)), strsplit(w, split="", fixed = T))] } 851.7623 916.071 1054.5 1035.199
uq แขแแทแแแแถ neval
1178.738 1356.816 100
แแแแแแธแแแแแพแแแถแแแแแปแแแแแฟแแแถแแแแแ 1 แแทแแถแแธแ แแทแแขแถแแแแแโแแแ
แแแแถแแแแแแแแแแแแแฝแ...
แขแแแแขแถแ แแแแพแแถแแแถแแฝแแแแแแป DT แแแแแแแพแแแแแแแแแถแแแ แแถแแพแแแ แแผแ แแถแแถแแแแแถแแแแถแแแแแแแแแแแแแแแแแแแ แแถแแแแแถแ แแแแแถแแแแแแแแถแแแ
แแแแแผแ
# chaining
res1 <- dt[a == 'a'][sample(.N, 100)]
res2 <- dt[, .N, a][, N]
res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]
แ แผแแแถแแแแแแ ...
แแแแแทแแแแแทแแถแแแผแ แแแแถแขแถแ แแแแผแแแถแแแแแพแแถแแแแแแแแแ แแถแแพแแแ แแแแแแแแแแถ แแแปแแแแแแถแแแปแแแถแแ แแแพแแแถแ แแแแแแขแแแแขแถแ แแแแพแแทแแธแแถแแแแแแแถแแฝแ แแทแแแแแนแแแ DT แแแปแแแแแแแแ แ แผแแแพแแแถแแแแแแแปแแแแแแแแแแ logistic แแแแแถแแแแทแแแแแแแแแแแแแแแแแพแแแถแแฝแแแนแแแแแแแแฝแแ แแแฝแแแ แแพ DT แ
แแแแแผแ
# piping
samplpe_b <- dt[a %in% head(letters), sample(b, 1)]
res4 <-
dt %>%
.[a %in% head(letters)] %>%
.[,
{
dt0 <- .SD[1:100]
quants <-
dt0[, c] %>%
quantile(seq(0.1, 1, 0.1), na.rm = T)
.(q = quants)
}
, .(cond = b > samplpe_b)
] %>%
glm(
cond ~ q -1
, family = binomial(link = "logit")
, data = .
) %>%
summary %>%
.[[12]]
แแแแทแแท แแถแแแแแแแถแแแธแ แแทแแขแแแธแแแถแ แแแพแแแแแแ แแถแแแแแปแ DT
แขแแแแขแถแ แแแแพแแปแแแถแ lambda แแแปแแแแแแแแแแแแแถแแแขแแแแแพแแแถแแแแแปแแแถแแแแแแพแแแฝแแแถแแแแกแแแแธแแแแถ แแแแแแแแแแแแทแแถแแแทแแแแแแแแถแแแแผแ แ แพแแแแแแแ แแปแแแแ - แแฝแแแแแแแพแแถแแแ แแถแแแแแปแ DT แ แงแแถแ แแแแแบแแแแผแแแ แแแแแแแแแแแทแแแแแถแแแขแแแแถแแแพ แแผแแแฝแแแถแแแแแแแปแแถแแแแแแแแแแแถแ แแแพแแแธแแแแถแแแขแถแแปแ DT (แแผแ แแถแแถแแ แผแแแแแพ DT แแแแฝแแแถแแ แแถแแแแแปแ DT แแถแแแแแแแแแแแถแแ แแฝแแแถแแแแแ แผแแแทแแแถแแแแแแถ แแแปแแแแแแผแ แแแแแแถแแบ)แ
แแแแแผแ
# function
rm(lm_preds)
lm_preds <- function(
sd, by, n
)
{
if(
n < 100 |
!by[['a']] %in% head(letters, 4)
)
{
res <-
list(
low = NA
, mean = NA
, high = NA
, coefs = NA
)
} else {
lmm <-
lm(
d ~ c + b
, data = sd
)
preds <-
stats::predict.lm(
lmm
, sd
, interval = "prediction"
)
res <-
list(
low = preds[, 2]
, mean = preds[, 1]
, high = preds[, 3]
, coefs = coefficients(lmm)
)
}
res
}
res5 <-
dt %>%
.[e < 0] %>%
.[.[, .I[b > 0]]] %>%
.[, `:=` (
low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
)
, a
] %>%
.[!is.na(mean), -'e', with = F]
# plot
plo <-
res5 %>%
ggplot +
facet_wrap(~ a) +
geom_ribbon(
aes(
x = c * coef_c + b * coef_b + coef_int
, ymin = low
, ymax = high
, fill = a
)
, size = 0.1
, alpha = 0.1
) +
geom_point(
aes(
x = c * coef_c + b * coef_b + coef_int
, y = mean
, color = a
)
, size = 1
) +
geom_point(
aes(
x = c * coef_c + b * coef_b + coef_int
, y = d
)
, size = 1
, color = 'black'
) +
theme_minimal()
print(plo)
แแแ แแแแธแแแแแทแแแแถแ
แแแแปแแแแแแนแแแถแแแแปแแขแถแ แแแแแพแแแผแแแถแแแแแแแแปแแผแ แแถ data.table แแแแแแ แแแปแแแแแแถแแถแแแทแแแถแแ แแถแแทแแแแแแแแแ แแแแ แถแแแแแแพแแแธแแแแแแแแแแแแแแทแแแแแแถแแแแแถแแแแแแนแแแถแแแแฝแแแแแแแธแแแแถแแ R แ แพแแแแแ แแแแแแแแแแแแแทแแแ แแทแแแแทแแแแถแแแแแแแถแแแแถแแแแธแแถแแปแ แแแแปแแ . แแแแปแแแแแแนแแแถแแถแแนแแแฝแแขแแแแฑแแแแถแแแแแแแแแพแแกแพแแแแแปแแแถแแแแ แแทแแแแแพแแแแถแแแแแแแถแแแแแแแแแแแถแแแแถแแแถแ แแทแ แแถแแแแแถแแแ.

แแผแแขแแแปแ!
แแแแแผแแแแ
แแแแแผแ
## load libs ----------------
library(data.table)
library(ggplot2)
library(magrittr)
library(microbenchmark)
## arrays ---------
arrmatr <- array(1:20, c(4,5))
class(arrmatr)
typeof(arrmatr)
is.array(arrmatr)
is.matrix(arrmatr)
## lists ------------------
mylist <- as.list(arrmatr)
is.vector(mylist)
is.list(mylist)
## data.frames ------------
df <- as.data.frame(arrmatr)
is.list(df)
df$V6 <- df$V1 + df$V2
## data.tables -----------------------
data.table::setDT(df)
is.list(df)
is.data.frame(df)
is.data.table(df)
df2 <- df
df[V1 == 1, V2 := 999]
data.table::fsetdiff(df, df2)
df2 <- data.table::copy(df)
df[V1 == 2, V2 := 999]
data.table::fsetdiff(df, df2)
## operations on data.tables ------------
#using list properties
df$'V1'[1]
df[['V1']]
df[[1]][1]
sapply(df, class)
sapply(df, function(x) sum(is.na(x)))
## Bigger example ----
rown <- 100000
dt <-
data.table(
w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
, a = sample(letters, rown, replace = T)
, b = runif(rown, -3, 3)
, c = runif(rown, -3, 3)
, e = rnorm(rown)
) %>%
.[, d := 1 + b + c + rnorm(nrow(.))]
# vectorization
# zero - for loop
microbenchmark({
for(i in 1:nrow(dt))
{
dt[
i
, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
]
}
})
# first
microbenchmark({
dt[
, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
, by = 1:nrow(dt)
]
})
# second
first_l_f <- function(sd)
{
strsplit(sd, split = ' ', fixed = T) %>%
do.call(rbind, .) %>%
`[`(,1)
}
dt[, first_l := NULL]
microbenchmark({
dt[
, first_l := .(first_l_f(w))
]
})
# third
first_l_f2 <- function(sd)
{
strsplit(sd, split = ' ', fixed = T) %>%
unlist %>%
matrix(nrow = 3) %>%
`[`(1,)
}
dt[, first_l := NULL]
microbenchmark({
dt[
, first_l := .(first_l_f2(w))
]
})
# fourth
rown <- 100000
words <-
sapply(
seq_len(rown)
, function(x){
nwords <- rbinom(1, 10, 0.5)
paste(
sapply(
seq_len(nwords)
, function(x){
paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
}
)
, collapse = ' '
)
}
)
dt <-
data.table(
w = words
, a = sample(letters, rown, replace = T)
, b = runif(rown, -3, 3)
, c = runif(rown, -3, 3)
, e = rnorm(rown)
) %>%
.[, d := 1 + b + c + rnorm(nrow(.))]
first_l_f3 <- function(sd, n)
{
l <- strsplit(sd, split = ' ', fixed = T)
maxl <- max(lengths(l))
sapply(l, "length<-", maxl) %>%
`[`(n,) %>%
as.character
}
microbenchmark({
dt[
, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
]
})
dt[
, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
]
# chaining
res1 <- dt[a == 'a'][sample(.N, 100)]
res2 <- dt[, .N, a][, N]
res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]
# piping
samplpe_b <- dt[a %in% head(letters), sample(b, 1)]
res4 <-
dt %>%
.[a %in% head(letters)] %>%
.[,
{
dt0 <- .SD[1:100]
quants <-
dt0[, c] %>%
quantile(seq(0.1, 1, 0.1), na.rm = T)
.(q = quants)
}
, .(cond = b > samplpe_b)
] %>%
glm(
cond ~ q -1
, family = binomial(link = "logit")
, data = .
) %>%
summary %>%
.[[12]]
# function
rm(lm_preds)
lm_preds <- function(
sd, by, n
)
{
if(
n < 100 |
!by[['a']] %in% head(letters, 4)
)
{
res <-
list(
low = NA
, mean = NA
, high = NA
, coefs = NA
)
} else {
lmm <-
lm(
d ~ c + b
, data = sd
)
preds <-
stats::predict.lm(
lmm
, sd
, interval = "prediction"
)
res <-
list(
low = preds[, 2]
, mean = preds[, 1]
, high = preds[, 3]
, coefs = coefficients(lmm)
)
}
res
}
res5 <-
dt %>%
.[e < 0] %>%
.[.[, .I[b > 0]]] %>%
.[, `:=` (
low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
)
, a
] %>%
.[!is.na(mean), -'e', with = F]
# plot
plo <-
res5 %>%
ggplot +
facet_wrap(~ a) +
geom_ribbon(
aes(
x = c * coef_c + b * coef_b + coef_int
, ymin = low
, ymax = high
, fill = a
)
, size = 0.1
, alpha = 0.1
) +
geom_point(
aes(
x = c * coef_c + b * coef_b + coef_int
, y = mean
, color = a
)
, size = 1
) +
geom_point(
aes(
x = c * coef_c + b * coef_b + coef_int
, y = d
)
, size = 1
, color = 'black'
) +
theme_minimal()
print(plo)
แแแแแ: www.habr.com
