แž‡แžปแŸ†แžœแžทแž‰ data.table

แž€แŸ†แžŽแžแŸ‹แžŸแž˜แŸ’แž‚แžถแž›แŸ‹แž“แŸแŸ‡แž“แžนแž„แž˜แžถแž“แž€แžถแžšแž…แžถแž”แŸ‹แžขแžถแžšแž˜แŸ’แž˜แžŽแŸแž…แŸ†แž–แŸ„แŸ‡แžขแŸ’แž“แž€แžŠแŸ‚แž›แž”แŸ’แžšแžพแž”แžŽแŸ’แžŽแžถแž›แŸแž™แžŠแŸ†แžŽแžพแžšแž€แžถแžšแž‘แžทแž“แŸ’แž“แž“แŸแž™แžแžถแžšแžถแž„แžŸแž˜แŸ’แžšแžถแž”แŸ‹ R - data.table แž แžพแž™แž”แŸ’แžšแž แŸ‚แž›แž‡แžถแžšแžธแž€แžšแžถแž™แž€แŸ’แž“แžปแž„แž€แžถแžšแž˜แžพแž›แž—แžถแž–แž”แžแŸ‹แž”แŸ‚แž“แž“แŸƒแž€แžถแžšแž”แŸ’แžšแžพแž”แŸ’แžšแžถแžŸแŸ‹แžšแž”แžŸแŸ‹แžœแžถแž“แŸ…แž€แŸ’แž“แžปแž„แžงแž‘แžถแž แžšแžŽแŸแž•แŸ’แžŸแŸแž„แŸ—แŸ”

แž”แŸ†แž•แžปแžŸแž‚แŸ†แž“แžทแžแžŠแŸ„แž™แž‚แŸ†แžšแžผแžŠแŸแž›แŸ’แžขแž˜แžฝแž™แŸ” แž˜แžทแžแŸ’แžแžšแžฝแž˜แž€แžถแžšแž„แžถแžšแž แžพแž™แžŸแž„แŸ’แžƒแžนแž˜แžแžถแžขแŸ’แž“แž€แž”แžถแž“แžขแžถแž“แžขแžแŸ’แžแž”แž‘แžšแž”แžŸแŸ‹แž‚แžถแžแŸ‹แžšแžฝแž…แž แžพแž™ แžแŸ’แž‰แžปแŸ†แžŸแŸ’แž“แžพแžฑแŸ’แž™แžŸแŸ’แžœแŸ‚แž„แž™แž›แŸ‹แž€แžถแž“แŸ‹แžแŸ‚แžŸแŸŠแžธแž‡แž˜แŸ’แžšแŸ…แž†แŸ’แž–แŸ„แŸ‡แž‘แŸ…แžšแž€แž€แžถแžšแž”แž„แŸ’แž€แžพแž“แž”แŸ’แžšแžŸแžทแž‘แŸ’แž’แž—แžถแž–แž€แžผแžŠ แž“แžทแž„แž€แžถแžšแžขแž“แžปแžœแžแŸ’แžแžŠแŸ„แž™แž•แŸ’แžขแŸ‚แž€แž›แžพ แžแžถแžšแžถแž„แž‘แžทแž“แŸ’แž“แž“แŸแž™.

แžŸแŸแž…แž€แŸ’แžแžธแž•แŸ’แžแžพแž˜แŸ– แžแžพ data.table แž˜แž€แž–แžธแžŽแžถ?

แžœแžถแž‡แžถแž€แžถแžšแž›แŸ’แžขแž”แŸ†แž•แžปแžแž€แŸ’แž“แžปแž„แž€แžถแžšแž…แžถแž”แŸ‹แž•แŸ’แžแžพแž˜แžŸแŸ’แž‚แžถแž›แŸ‹แž”แžŽแŸ’แžŽแžถแž›แŸแž™แž–แžธแž…แž˜แŸ’แž„แžถแž™แž”แž“แŸ’แžแžทแž… แž–แŸ„แž›แž‚แžบแž‡แžถแž˜แžฝแž™แž“แžนแž„แžšแž…แž“แžถแžŸแž˜แŸ’แž–แŸแž“แŸ’แž’แž‘แžทแž“แŸ’แž“แž“แŸแž™ แžŠแŸ‚แž›แžœแžแŸ’แžแžป data.table (แžแž‘แŸ…แž“แŸแŸ‡แž แŸ…แžแžถ DT) แžขแžถแž…แž‘แž‘แžฝแž›แž”แžถแž“แŸ”

แžขแžถแžšแŸ

แž›แŸแžแž€แžผแžŠ

## arrays ---------

arrmatr <- array(1:20, c(4,5))

class(arrmatr)

typeof(arrmatr)

is.array(arrmatr)

is.matrix(arrmatr)

แžšแž…แž“แžถแžŸแž˜แŸ’แž–แŸแž“แŸ’แž’แž˜แžฝแž™แž”แŸ‚แž”แž“แŸ„แŸ‡แž‚แžบแžขแžถแžšแŸ (?แž˜แžผแž›แžŠแŸ’แž‹แžถแž“::แžขแžถแžšแŸ) แžŠแžผแž…แž“แŸ…แž€แŸ’แž“แžปแž„แž—แžถแžŸแžถแž•แŸ’แžŸแŸแž„แž‘แŸ€แž แžขแžถแžšแŸแž“แŸ…แž‘แžธแž“แŸแŸ‡แž˜แžถแž“แž–แž แžปแžœแžทแž˜แžถแžแŸ’แžšแŸ” แž‘แŸ„แŸ‡แž‡แžถแž™แŸ‰แžถแž„แžŽแžถแž€แŸแžŠแŸ„แž™ แžขแŸ’แžœแžธแžŠแŸ‚แž›แž‚แžฝแžšแžฑแŸ’แž™แž…แžถแž”แŸ‹แžขแžถแžšแž˜แŸ’แž˜แžŽแŸแž“แŸ„แŸ‡แž‚แžบแžแžถ แž‡แžถแžงแž‘แžถแž แžšแžŽแŸ แžขแžถแžšแŸแž–แžธแžšแžœแžทแž˜แžถแžแŸ’แžšแž…แžถแž”แŸ‹แž•แŸ’แžแžพแž˜แž‘แž‘แžฝแž›แž˜แžšแžแž€แž›แž€แŸ’แžแžŽแŸˆแžŸแž˜แŸ’แž”แžแŸ’แžแžทแž–แžธแžแŸ’แž“แžถแž€แŸ‹แž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸแŸ” (?แž˜แžผแž›แžŠแŸ’แž‹แžถแž“แŸ–:แž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸ) แž“แžทแž„โ€‹แžขแžถแžšแŸโ€‹แž˜แžฝแž™โ€‹แžœแžทแž˜แžถแžแŸ’แžš แžŠแŸ‚แž›โ€‹แžŸแŸ†แžแžถแž“แŸ‹โ€‹แž•แž„โ€‹แžŠแŸ‚แžš แž˜แžทแž“โ€‹แž‘แž‘แžฝแž›โ€‹แž”แžถแž“โ€‹แž–แžธโ€‹แžœแŸ‰แžทแž…แž‘แŸแžš (แž˜แžผแž›แžŠแŸ’แž‹แžถแž“ :: แžœแŸ‰แžทแž…แž‘แŸแžš).

แžœแžถแž‚แžฝแžšแžแŸ‚แžแŸ’แžšแžผแžœแž”แžถแž“แž™แž›แŸ‹แžแžถแž”แŸ’แžšแž—แŸแž‘แž“แŸƒแž‘แžทแž“แŸ’แž“แž“แŸแž™แžŠแŸ‚แž›แž˜แžถแž“แž“แŸ…แž€แŸ’แž“แžปแž„แžœแžแŸ’แžแžปแžŽแžถแž˜แžฝแž™แž‚แžฝแžšแžแŸ‚แžแŸ’แžšแžผแžœแž”แžถแž“แž–แžทแž“แžทแžแŸ’แž™แžŠแŸ„แž™แž”แŸ’แžšแžพแž˜แžปแžแž„แžถแžš แž˜แžผแž›แžŠแŸ’แž‹แžถแž“ :: แž”แŸ’แžšแž—แŸแž‘แžŠแŸ‚แž›แžแŸ’แžšแžกแž”แŸ‹แž€แžถแžšแž–แžทแž–แžŽแŸŒแž“แžถแž”แŸ’แžšแž—แŸแž‘แžแžถแž„แž€แŸ’แž“แžปแž„แž™แŸ„แž„แž‘แŸ…แžแžถแž˜ R แžแžถแž„แž€แŸ’แž“แžปแž„ - แž–แžทแž’แžธแž€แžถแžšแž‘แžผแž‘แŸ…แž“แŸƒแž—แžถแžŸแžถแžŠแŸ‚แž›แž—แŸ’แž‡แžถแž”แŸ‹แž‡แžถแž˜แžฝแž™แžŠแžพแž˜ C.

แž–แžถแž€แŸ’แž™แž”แž‰แŸ’แž‡แžถแž˜แžฝแž™แž‘แŸ€แžแžŠแžพแž˜แŸ’แž”แžธแž€แŸ†แžŽแžแŸ‹แžแŸ’แž“แžถแž€แŸ‹แž“แŸƒแžœแžแŸ’แžแžปแž‚แžบ แž˜แžผแž›แžŠแŸ’แž‹แžถแž“ :: แžแŸ’แž“แžถแž€แŸ‹แž€แŸ’แž“แžปแž„แž€แžšแžŽแžธแžœแŸ‰แžทแž…แž‘แŸแžš แžแŸ’แžšแžกแž”แŸ‹แž”แŸ’แžšแž—แŸแž‘แžœแŸ‰แžทแž…แž‘แŸแžš (แžœแžถแžแžปแžŸแž–แžธแžˆแŸ’แž˜แŸ„แŸ‡แžแžถแž„แž€แŸ’แž“แžปแž„ แž”แŸ‰แžปแž“แŸ’แžแŸ‚แž€แŸแžขแž“แžปแž‰แŸ’แž‰แžถแžแžฑแŸ’แž™แžขแŸ’แž“แž€แž™แž›แŸ‹แž–แžธแž”แŸ’แžšแž—แŸแž‘แž‘แžทแž“แŸ’แž“แž“แŸแž™แž•แž„แžŠแŸ‚แžš)แŸ”

แž”แž‰แŸ’แž‡แžธ

แž–แžธแžขแžถแžšแŸแž–แžธแžšแžœแžทแž˜แžถแžแŸ’แžš แžŠแŸ‚แž›แžแŸ’แžšแžผแžœแž”แžถแž“แž‚แŸแžŸแŸ’แž‚แžถแž›แŸ‹แžแžถแž‡แžถแž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸ แžขแŸ’แž“แž€แžขแžถแž…แž…แžผแž›แž‘แŸ…แž€แžถแž“แŸ‹แž”แž‰แŸ’แž‡แžธ (แž˜แžผแž›แžŠแŸ’แž‹แžถแž“ :: แž”แž‰แŸ’แž‡แžธ).

แž›แŸแžแž€แžผแžŠ

## lists ------------------

mylist <- as.list(arrmatr)

is.vector(mylist)

is.list(mylist)

แž˜แžถแž“แžšแžฟแž„แž‡แžถแž…แŸ’แžšแžพแž“แž€แžพแžแžกแžพแž„แž€แŸ’แž“แžปแž„แž–แŸแž›แžแŸ‚แž˜แžฝแž™แŸ–

  • แžœแžทแž˜แžถแžแŸ’แžšแž‘แžธแž–แžธแžšแž“แŸƒแž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸแžŠแžฝแž›แžšแž›แŸ† แž“แŸ„แŸ‡แž‚แžบแž™แžพแž„แž‘แž‘แžฝแž›แž”แžถแž“แž‘แžถแŸ†แž„แž”แž‰แŸ’แž‡แžธ แž“แžทแž„แžœแŸ‰แžทแž…แž‘แŸแžšแž€แŸ’แž“แžปแž„แž–แŸแž›แžแŸ‚แž˜แžฝแž™แŸ”
  • แž”แž‰แŸ’แž‡แžธแž“แŸแŸ‡แž‘แž‘แžฝแž›แž˜แžšแžแž€แž–แžธแžแŸ’แž“แžถแž€แŸ‹แž‘แžถแŸ†แž„แž“แŸแŸ‡แŸ” แžœแžถแžแŸ’แžšแžผแžœแžแŸ‚แžšแž€แŸ’แžŸแžถแž‘แžปแž€แž€แŸ’แž“แžปแž„แž…แžทแžแŸ’แžแžแžถแž’แžถแžแžปแž”แž‰แŸ’แž‡แžธแž˜แžฝแž™แž“แžนแž„แžแŸ’แžšแžผแžœแž‚แŸ’แž“แžถแž‘แŸ…แž“แžนแž„แžแž˜แŸ’แž›แŸƒแž˜แžฝแž™ (แž˜แžถแžแŸ’แžšแžŠแŸ’แž‹แžถแž“) แž–แžธแž€แŸ’แžšแžกแžถแž“แŸƒแž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸแžขแžถแžšแŸแŸ”

แžŠแŸ„แž™แžŸแžถแžšแžแŸ‚แž”แž‰แŸ’แž‡แžธแž˜แžฝแž™แž€แŸแž‡แžถแžœแŸ‰แžทแž…แž‘แŸแžš แž˜แžปแžแž„แžถแžšแžœแŸ‰แžทแž…แž‘แŸแžšแž˜แžฝแž™แž…แŸ†แž“แžฝแž“แžขแžถแž…แžแŸ’แžšแžผแžœแž”แžถแž“แžขแž“แžปแžœแžแŸ’แžแž‘แŸ…แžœแžถแŸ”

แžŸแŸŠแžปแž˜แž‘แžทแž“แŸ’แž“แž“แŸแž™

แžขแŸ’แž“แž€แžขแžถแž…แž‘แŸ…แž–แžธแž”แž‰แŸ’แž‡แžธ แž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸ แžฌแžœแŸ‰แžทแž…แž‘แŸแžš แž‘แŸ…แž€แžถแž“แŸ‹แžŸแŸŠแžปแž˜แž‘แžทแž“แŸ’แž“แž“แŸแž™ (?base::data.frame).

แž›แŸแžแž€แžผแžŠ

## data.frames ------------

df <- as.data.frame(arrmatr)
df2 <- as.data.frame(mylist)

is.list(df)

df$V6 <- df$V1 + df$V2

แžขแŸ’แžœแžธแžŠแŸ‚แž›แž‚แžฝแžšแžฑแŸ’แž™แž…แžถแž”แŸ‹แžขแžถแžšแž˜แŸ’แž˜แžŽแŸแžขแŸ†แž–แžธแžœแžถแŸ– แžŸแŸŠแžปแž˜แž‘แžทแž“แŸ’แž“แž“แŸแž™แž‘แž‘แžฝแž›แž˜แžšแžแž€แž–แžธแž”แž‰แŸ’แž‡แžธ! แž‡แžฝแžšแžˆแžšแžŸแŸŠแžปแž˜แž‘แžทแž“แŸ’แž“แž“แŸแž™แž‚แžบแž‡แžถแž€แŸ’แžšแžกแžถแž”แž‰แŸ’แž‡แžธแŸ” แžœแžถแž“แžนแž„แž˜แžถแž“แžŸแžถแžšแŸˆแžŸแŸ†แžแžถแž“แŸ‹แž“แŸ…แž–แŸแž›แž€แŸ’แžšแŸ„แž™ แž“แŸ…แž–แŸแž›แžŠแŸ‚แž›แž™แžพแž„แž”แŸ’แžšแžพแž˜แžปแžแž„แžถแžšแžŠแŸ‚แž›แž”แžถแž“แžขแž“แžปแžœแžแŸ’แžแž…แŸ†แž–แŸ„แŸ‡แž”แž‰แŸ’แž‡แžธแŸ”

แžแžถแžšแžถแž„แž‘แžทแž“แŸ’แž“แž“แŸแž™

แž‘แž‘แžฝแž›แž”แžถแž“ DT (?data.table::data.table) แžขแžถแž…แž˜แž€แž–แžธ แžŸแŸŠแžปแž˜แž‘แžทแž“แŸ’แž“แž“แŸแž™แž”แž‰แŸ’แž‡แžธ แžœแŸ‰แžทแž…แž‘แŸแžš แžฌแž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸแŸ” แžงแž‘แžถแž แžšแžŽแŸแžŠแžผแž…แž“แŸแŸ‡ (แž“แŸ…แž“แžนแž„แž€แž“แŸ’แž›แŸ‚แž„)แŸ”

แž›แŸแžแž€แžผแžŠ

## data.tables -----------------------
library(data.table)

data.table::setDT(df)

is.list(df)

is.data.frame(df)

is.data.table(df)

แžœแžถแž˜แžถแž“แž”แŸ’แžšแž™แŸ„แž‡แž“แŸแžŠแŸ‚แž›แžŠแžผแž…แž‡แžถ dataframe แž˜แžฝแž™ DT แž‘แž‘แžฝแž›แž˜แžšแžแž€แž“แžผแžœแž›แž€แŸ’แžแžŽแŸˆแžŸแž˜แŸ’แž”แžแŸ’แžแžทแž“แŸƒแž”แž‰แŸ’แž‡แžธแž˜แžฝแž™แŸ”

DT แž“แžทแž„แžขแž„แŸ’แž‚แž…แž„แž…แžถแŸ†

แž˜แžทแž“แžŠแžผแž…แžœแžแŸ’แžแžปแž•แŸ’แžŸแŸแž„แž‘แŸ€แžแž‘แžถแŸ†แž„แžขแžŸแŸ‹แž“แŸ…แž€แŸ’แž“แžปแž„แž˜แžผแž›แžŠแŸ’แž‹แžถแž“ R แž‘แŸ DTs แžแŸ’แžšแžผแžœแž”แžถแž“แž†แŸ’แž›แž„แž€แžถแžแŸ‹แžŠแŸ„แž™แžฏแž€แžŸแžถแžšแž™แŸ„แž„แŸ” แž”แŸ’แžšแžŸแžทแž“แž”แžพแžขแŸ’แž“แž€แžแŸ’แžšแžผแžœแž€แžถแžšแž…แž˜แŸ’แž›แž„แž‘แŸ…แž€แž“แŸ’แž›แŸ‚แž„แž…แž„แž…แžถแŸ†แžแŸ’แž˜แžธ แžขแŸ’แž“แž€แžแŸ’แžšแžผแžœแž€แžถแžšแž˜แžปแžแž„แžถแžšแž˜แžฝแž™แŸ” data.table::แž…แž˜แŸ’แž›แž„ แžฌแžขแŸ’แž“แž€แžแŸ’แžšแžผแžœแž’แŸ’แžœแžพแž€แžถแžšแž‡แŸ’แžšแžพแžŸแžšแžพแžŸแž–แžธแžœแžแŸ’แžแžปแž…แžถแžŸแŸ‹แŸ”

แž›แŸแžแž€แžผแžŠ

df2 <- df

df[V1 == 1, V2 := 999]

data.table::fsetdiff(df, df2)

df2 <- data.table::copy(df)

df[V1 == 2, V2 := 999]

data.table::fsetdiff(df, df2)

แž“แŸแŸ‡แž”แž‰แŸ’แž…แž”แŸ‹แž€แžถแžšแžŽแŸ‚แž“แžถแŸ†แŸ” DT แž‚แžบแž‡แžถแž€แžถแžšแž”แž“แŸ’แžแž“แŸƒแž€แžถแžšแžขแž—แžทแžœแžŒแŸ’แžแž“แŸแžšแž…แž“แžถแžŸแž˜แŸ’แž–แŸแž“แŸ’แž’แž‘แžทแž“แŸ’แž“แž“แŸแž™แž“แŸ…แž€แŸ’แž“แžปแž„ R แžŠแŸ‚แž›แž€แžพแžแžกแžพแž„แž‡แžถแž…แž˜แŸ’แž”แž„แžŠแŸ„แž™แžŸแžถแžšแžแŸ‚แž€แžถแžšแž–แž„แŸ’แžšแžธแž€ แž“แžทแž„แž€แžถแžšแž”แž„แŸ’แž€แžพแž“แž›แŸ’แž”แžฟแž“แž“แŸƒแž”แŸ’แžšแžแžทแž”แžแŸ’แžแžทแž€แžถแžšแžŠแŸ‚แž›แž”แžถแž“แžขแž“แžปแžœแžแŸ’แžแž›แžพแžœแžแŸ’แžแžปแž“แŸƒแžแŸ’แž“แžถแž€แŸ‹ dataframe แŸ” แž‘แž“แŸ’แž‘แžนแž˜แž“แžนแž„แž“แŸ„แŸ‡ แž˜แžšแžแž€แž–แžธแž”แžปแž–แŸ’แžœแž€แžถแž›แž•แŸ’แžŸแŸแž„แž‘แŸ€แžแžแŸ’แžšแžผแžœแž”แžถแž“แžšแž€แŸ’แžŸแžถแž‘แžปแž€แŸ”

แžงแž‘แžถแž แžšแžŽแŸแž˜แžฝแž™แž…แŸ†แž“แžฝแž“แž“แŸƒแž€แžถแžšแž”แŸ’แžšแžพแž”แŸ’แžšแžถแžŸแŸ‹แž›แž€แŸ’แžแžŽแŸˆแžŸแž˜แŸ’แž”แžแŸ’แžแžท data.table

แžŠแžผแž…แž‡แžถแž”แž‰แŸ’แž‡แžธ ...

แž€แžถแžšแž’แŸ’แžœแžพแž˜แŸ’แžแž„แž‘แŸ€แžแž›แžพแž‡แžฝแžšแžŠแŸแž€แž“แŸƒ dataframe แžฌ DT แž˜แžทแž“แž˜แŸ‚แž“แž‡แžถแž‚แŸ†แž“แžทแžแž›แŸ’แžขแž‘แŸ แž–แŸ’แžšแŸ„แŸ‡แž›แŸแžแž€แžผแžŠแžšแž„แŸ’แžœแžทแž›แž‡แžปแŸ†แž‡แžถแž—แžถแžŸแžถ R แž™แžบแžแž‡แžถแž„ Cแž”แŸ‰แžปแž“แŸ’แžแŸ‚แžœแžถแž–แžทแžแž‡แžถแžขแžถแž…แž‘แŸ…แžšแžฝแž…แž€แŸ’แž“แžปแž„แž€แžถแžšแžšแž„แŸ’แžœแžทแž›แž‡แžปแŸ†แžแžถแž˜แž‡แžฝแžšแžˆแžš แžŠแŸ‚แž›แž‡แžถแž’แž˜แŸ’แž˜แžแžถแž˜แžถแž“แž‘แŸ†แž แŸ†แžแžผแž…แž‡แžถแž„แŸ” แž†แŸ’แž›แž„แž€แžถแžแŸ‹แž‡แžฝแžšแžˆแžš แžŸแžผแž˜แž…แžถแŸ†แžแžถแž‡แžฝแžšแžˆแžšแž“แžธแž˜แžฝแž™แŸ—แž‚แžบแž‡แžถแž’แžถแžแžปแž“แŸƒแž”แž‰แŸ’แž‡แžธ แžŠแŸ‚แž›แž‡แžถแž’แž˜แŸ’แž˜แžแžถแž˜แžถแž“แžœแŸ‰แžทแž…แž‘แŸแžšแŸ” แž แžพแž™แž”แŸ’แžšแžแžทแž”แžแŸ’แžแžทแž€แžถแžšแž›แžพแžœแŸ‰แžทแž…แž‘แŸแžšแžแŸ’แžšแžผแžœแž”แžถแž“แžœแŸ‰แžทแž…แž‘แŸแžšแž™แŸ‰แžถแž„แž›แŸ’แžขแž“แŸ…แž€แŸ’แž“แžปแž„แž˜แžปแžแž„แžถแžšแž˜แžผแž›แžŠแŸ’แž‹แžถแž“แž“แŸƒแž—แžถแžŸแžถแŸ” แžขแŸ’แž“แž€โ€‹แž€แŸโ€‹แžขแžถแž…โ€‹แž”แŸ’แžšแžพโ€‹แžŸแž‰แŸ’แž‰แžถโ€‹แž”แŸ’แžšแž˜แžถแžŽแžœแžทแž’แžธโ€‹แž‡แŸ’แžšแžพแžŸแžšแžพแžŸโ€‹แž‘แžผแž‘แŸ…โ€‹แž…แŸ†แž–แŸ„แŸ‡โ€‹แž”แž‰แŸ’แž‡แžธ แž“แžทแž„โ€‹แžœแŸ‰แžทแž…แž‘แŸแžšโ€‹แž•แž„แžŠแŸ‚แžšแŸ– `[[`, `$`.

แž›แŸแžแž€แžผแžŠ

## operations on data.tables ------------

#using list properties

df$'V1'[1]

df[['V1']]

df[[1]][1]

sapply(df, class)

sapply(df, function(x) sum(is.na(x)))

แžœแŸ‰แžทแž…แž‘แŸแžš

แž”แŸ’แžšแžŸแžทแž“แž”แžพแž˜แžถแž“แžแž˜แŸ’แžšแžผแžœแž€แžถแžšแž†แŸ’แž›แž„แž€แžถแžแŸ‹แž”แž“แŸ’แž‘แžถแžแŸ‹แž“แŸƒ DT แž’แŸ† แžŠแŸ†แžŽแŸ„แŸ‡แžŸแŸ’แžšแžถแž™แžŠแŸแž›แŸ’แžขแž”แŸ†แž•แžปแžแž‚แžบแžแŸ’แžšแžผแžœแžŸแžšแžŸแŸแžšแž˜แžปแžแž„แžถแžšแž‡แžถแž˜แžฝแž™แžœแŸ‰แžทแž…แž‘แŸแžšแŸ” แž”แŸ‰แžปแž“แŸ’แžแŸ‚แž”แŸ’แžšแžŸแžทแž“แž”แžพแžšแžฟแž„แž“แŸแŸ‡แž˜แžทแž“แžŠแŸ†แžŽแžพแžšแž€แžถแžšแž‘แŸแž“แŸ„แŸ‡แžขแŸ’แž“แž€แž‚แžฝแžšแžแŸ‚แž…แž„แž…แžถแŸ†แžแžถแžœแžŠแŸ’แž แž“แŸ…แžแžถแž„แž€แŸ’แž“แžปแž„ DT แž“แŸ…แžแŸ‚แž›แžฟแž“แž‡แžถแž„แžœแžŠแŸ’แž Rแž…แžถแž”แŸ‹แžแžถแŸ†แž„แž–แžธแžœแžถแžแŸ’แžšแžผแžœแž”แžถแž“แžขแž“แžปแžœแžแŸ’แžแž“แŸ…แž›แžพ C.

แžแŸ„แŸ‡แžŸแžถแž€แž›แŸ’แž”แž„แžœแžถแž“แŸ…แž›แžพแžงแž‘แžถแž แžšแžŽแŸแž’แŸ†แž‡แžถแž„แž‡แžถแž˜แžฝแž™แž‡แžฝแžš 100K แŸ” แž™แžพแž„แž“แžนแž„แžŠแž€แžŸแŸ’แžšแž„แŸ‹แžขแž€แŸ’แžŸแžšแž‘แžธแž˜แžฝแž™แž…แŸแž‰แž–แžธแž–แžถแž€แŸ’แž™แžŠแŸ‚แž›แž”แž‰แŸ’แž…แžผแž›แž€แŸ’แž“แžปแž„แž‡แžฝแžšแžˆแžšแžœแŸ‰แžทแž…แž‘แŸแžš w.

แž”แžถแž“แž”แž“แŸ’แž‘แžถแž“แŸ‹แžŸแž˜แŸแž™

แž›แŸแžแž€แžผแžŠ

library(magrittr)
library(microbenchmark)

## Bigger example ----

rown <- 100000

dt <- 
	data.table(
		w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

# vectorization

microbenchmark({
	dt[
		, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		, by = 1:nrow(dt)
	   ]
})

# second

first_l_f <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		do.call(rbind, .) %>%
		`[`(,1)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f(w))
		]
})

# third

first_l_f2 <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		unlist %>%
		matrix(nrow = 3) %>%
		`[`(1,)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f2(w))
		]
})

แžšแžแŸ‹โ€‹แž‡แžถโ€‹แž›แžพแž€โ€‹แžŠแŸ†แž”แžผแž„โ€‹แž›แžพโ€‹แž‡แžฝแžšโ€‹แžŠแŸแž€แŸ–

แžฏแž€แžแžถแŸ– แž˜แžทแž›แŸ’แž›แžธแžœแžทแž“แžถแž‘แžธ
expr แž“แžถแž‘แžธ
{ dt[, `:=`(first_l, unlist(strsplit(w, split = " ", fixed = T))[1]), by = 1:nrow(dt)] } 439.6217
lq แž˜แžถแž“แž“แŸแž™แžแžถแž˜แž’แŸ’แž™แž˜ uq max neval
451.9998 460.1593 456.2505 460.9147 621.4042 แž•แŸ’แž›แžผแžœแž‘แžธ 100th

แž€แžถแžšแžšแžแŸ‹แž‘แžธแž–แžธแžš แžŠแŸ‚แž›แž€แžถแžšแž”แŸ†แž”แŸ’แž›แŸ‚แž„แžœแŸ‰แžทแž…แž‘แŸแžšแž€แžพแžแžกแžพแž„แžŠแŸ„แž™แž”แž„แŸ’แžœแŸ‚แžšแž”แž‰แŸ’แž‡แžธแž‘แŸ…แž‡แžถแž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸ แž แžพแž™แž™แž€แž’แžถแžแžปแž“แŸ…แž›แžพแž…แŸ†แžŽแžทแžแž‡แžถแž˜แžฝแž™แž›แžทแž”แžทแž€แŸ’แžšแž˜ 1 (แž€แŸ’แžšแŸ„แž™แž˜แž€แž‘แŸ€แžแž‚แžบแž‡แžถแžœแŸ‰แžทแž…แž‘แŸแžšแžแŸ’แž›แžฝแž“แžฏแž„)แŸ” แž€แžถแžšแž€แŸ‚แžแž˜แŸ’แžšแžผแžœแŸ– แžœแŸ‰แžทแž…แž‘แŸแžšแž“แŸ…แž€แž˜แŸ’แžšแžทแžแž˜แžปแžแž„แžถแžš strsplitแžŠแŸ‚แž›แžขแžถแž…แž‘แž‘แžฝแž›แž™แž€แžœแŸ‰แžทแž…แž‘แŸแžšแž‡แžถแž€แžถแžšแž”แž‰แŸ’แž…แžผแž›แŸ” แžœแžถแž”แŸ’แžšแŸ‚แžแžถแž“แžธแžแžทแžœแžทแž’แžธแžŸแž˜แŸ’แžšแžถแž”แŸ‹แž”แž„แŸ’แžœแŸ‚แžšแž”แž‰แŸ’แž‡แžธแž‘แŸ…แž‡แžถแž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸแž‚แžบแž–แžทแž”แžถแž€แž‡แžถแž„แžœแžทแž…แž‘แŸแžšแžŠแŸ„แž™แžแŸ’แž›แžฝแž“แžœแžถแž”แŸ‰แžปแž“แŸ’แžแŸ‚แž€แŸ’แž“แžปแž„แž€แžšแžŽแžธแž“แŸแŸ‡แžœแžถแž›แžฟแž“แž‡แžถแž„แž€แŸ†แžŽแŸ‚แžŠแŸ‚แž›แž˜แžทแž“แž˜แžถแž“แž›แž€แŸ’แžแžŽแŸˆแžœแŸ‰แžทแž…แž‘แŸแžšแŸ”

แžฏแž€แžแžถแŸ– แž˜แžทแž›แŸ’แž›แžธแžœแžทแž“แžถแž‘แžธ
expr min lq แž˜แžถแž“แž“แŸแž™แžแžถ median uq max neval
{ dt[, `:=`(first_l, .(first_l_f(w)))] } 93.07916 112.1381 161.9267 149.6863 185.9893 442.5199 100

แž€แžถแžšแž”แž„แŸ’แž€แžพแž“แž›แŸ’แž”แžฟแž“แžŠแŸ„แž™แž˜แž’แŸ’แž™แž˜แž€แŸ’แž“แžปแž„ 3 แžŠแž„.

แž€แžถแžšแžšแžแŸ‹แž‘แžธแž”แžธแžŠแŸ‚แž›แž‚แŸ’แžšแŸ„แž„แž€แžถแžšแžŽแŸแž•แŸ’แž›แžถแžŸแŸ‹แž”แŸ’แžแžผแžšแž‘แŸ…แž‡แžถแž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸแžแŸ’แžšแžผแžœแž”แžถแž“แž•แŸ’แž›แžถแžŸแŸ‹แž”แŸ’แžแžผแžšแŸ”

แžฏแž€แžแžถแŸ– แž˜แžทแž›แŸ’แž›แžธแžœแžทแž“แžถแž‘แžธ
expr min lq แž˜แžถแž“แž“แŸแž™แžแžถ median uq max neval
{ dt[, `:=`(first_l, .(first_l_f2(w))] } 32.60481 34.13679 40.4544 35.57115 42.11975 222.972 100

แž€แžถแžšแž”แž„แŸ’แž€แžพแž“แž›แŸ’แž”แžฟแž“แžŠแŸ„แž™แž˜แž’แŸ’แž™แž˜แž€แŸ’แž“แžปแž„ 13 แžŠแž„.

แžขแŸ’แž“แž€แžแŸ’แžšแžผแžœแž–แžทแžŸแŸ„แž’แž“แŸแž›แžพแž”แž‰แŸ’แž แžถแž“แŸแŸ‡ แž€แžถแž“แŸ‹แžแŸ‚แž…แŸ’แžšแžพแž“ แžœแžถแž“แžนแž„แž€แžถแž“แŸ‹แžแŸ‚แž”แŸ’แžšแžŸแžพแžšแŸ”

แžงแž‘แžถแž แžšแžŽแŸแž˜แžฝแž™แž‘แŸ€แžแž‡แžถแž˜แžฝแž™แžœแŸ‰แžทแž…แž‘แŸแžš แžŠแŸ‚แž›แž‡แžถแž€แž“แŸ’แž›แŸ‚แž„แžŠแŸ‚แž›แž˜แžถแž“แžขแžแŸ’แžแž”แž‘แž•แž„แžŠแŸ‚แžš แž”แŸ‰แžปแž“แŸ’แžแŸ‚แžœแžถแž“แŸ…แž‡แžทแžแž›แž€แŸ’แžแžแžŽแŸ’แžŒแž–แžทแžแŸ– แž”แŸ’แžšแžœแŸ‚แž„แž–แžถแž€แŸ’แž™แžแžปแžŸแž‚แŸ’แž“แžถ แž…แŸ†แž“แžฝแž“แž–แžถแž€แŸ’แž™แž•แŸ’แžŸแŸแž„แž‚แŸ’แž“แžถแŸ” แžขแŸ’แž“แž€แžแŸ’แžšแžผแžœแž‘แž‘แžฝแž›แž”แžถแž“ 3 แž–แžถแž€แŸ’แž™แžŠแŸ†แž”แžผแž„แŸ” แžŠแžผแž…แž“แŸแŸ‡แŸ–

แž‡แžปแŸ†แžœแžทแž‰ data.table

แž“แŸ…แž‘แžธแž“แŸแŸ‡แž˜แžปแžแž„แžถแžšแž–แžธแž˜แžปแž“แž˜แžทแž“แžŠแŸ†แžŽแžพแžšแž€แžถแžšแž‘แŸ แžŠแŸ„แž™แžŸแžถแžšแžœแŸ‰แžทแž…แž‘แŸแžšแž˜แžถแž“แž”แŸ’แžšแžœแŸ‚แž„แžแžปแžŸแŸ—แž‚แŸ’แž“แžถ แž แžพแž™แž™แžพแž„แž€แŸ†แžŽแžแŸ‹แž‘แŸ†แž แŸ†แž˜แŸ‰แžถแž‘แŸ’แžšแžธแžŸแŸ” แž…แžผแžšแž™แžพแž„แž’แŸ’แžœแžพแžœแžถแžกแžพแž„แžœแžทแž‰แžŠแŸ„แž™แž€แžถแžšแž‡แžธแž€แž€แž€แžถแž™แž“แŸ…แž›แžพแžขแŸŠแžธแž“แž’แžบแžŽแžทแžแŸ”

แž›แŸแžแž€แžผแžŠ

# fourth

rown <- 100000

words <-
	sapply(
		seq_len(rown)
		, function(x){
			nwords <- rbinom(1, 10, 0.5)
			paste(
				sapply(
					seq_len(nwords)
					, function(x){
						paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
					}
				)
				, collapse = ' '
			)
		}
	)

dt <- 
	data.table(
		w = words
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

first_l_f3 <- function(sd, n)
{
	l <- strsplit(sd, split = ' ', fixed = T)
	
	maxl <- max(lengths(l))
	
	sapply(l, "length<-", maxl) %>%
		`[`(n,) %>%
		as.character
}

microbenchmark({
	dt[
		, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
		]
})

dt[
	, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
	]

แžฏแž€แžแžถแŸ– แž˜แžทแž›แŸ’แž›แžธแžœแžทแž“แžถแž‘แžธ
expr min lq แž˜แž’แŸ’แž™แž˜

{ dt[, `:=`((paste0(โ€œw_โ€, 1:3)), strsplit(w, split="", fixed = T))] } 851.7623 916.071 1054.5 1035.199
uq แžขแžแžทแž”แžšแž˜แžถ neval
1178.738 1356.816 100

แžŸแŸ’แž‚แŸ’แžšแžธแž”แžŠแŸ†แžŽแžพแžšแž€แžถแžšแž€แŸ’แž“แžปแž„แž›แŸ’แž”แžฟแž“แž‡แžถแž˜แž’แŸ’แž™แž˜ 1 แžœแžทแž“แžถแž‘แžธแŸ” แž˜แžทแž“แžขแžถแž€แŸ’แžšแž€แŸ‹โ€‹แž‘แŸแŸ”

แž—แŸ’แž‡แžถแž”แŸ‹แžŠแŸ„แž™แžแŸ’แžŸแŸ‚แžแŸ‚แž˜แžฝแž™...

แžขแŸ’แž“แž€แžขแžถแž…แž’แŸ’แžœแžพแž€แžถแžšแž‡แžถแž˜แžฝแž™แžœแžแŸ’แžแžป DT แžŠแŸ„แž™แž”แŸ’แžšแžพแžแŸ’แžŸแŸ‚แžŸแž„แŸ’แžœแžถแž€แŸ‹แŸ” แžœแžถแž˜แžพแž›แž‘แŸ…แžŠแžผแž…แž‡แžถแž€แžถแžšแž—แŸ’แž‡แžถแž”แŸ‹แžœแžถแž€แŸ’แž™แžŸแž˜แŸ’แž–แŸแž“แŸ’แž’แžแž„แŸ’แž€แŸ€แž”แž‘แŸ…แžแžถแž„แžŸแŸ’แžŠแžถแŸ† แžŠแŸ‚แž›แž‡แžถแžŸแŸ’แž€แžšแžŸแŸ†แžแžถแž“แŸ‹แŸ”

แž›แŸแžแž€แžผแžŠ

# chaining

res1 <- dt[a == 'a'][sample(.N, 100)]

res2 <- dt[, .N, a][, N]

res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]

แž แžผแžšแžแžถแž˜แž”แŸ†แž–แž„แŸ‹ ...

แž”แŸ’แžšแžแžทแž”แžแŸ’แžแžทแž€แžถแžšแžŠแžผแž…แž‚แŸ’แž“แžถแžขแžถแž…แžแŸ’แžšแžผแžœแž”แžถแž“แž’แŸ’แžœแžพแžแžถแž˜แžšแž™แŸˆแž”แŸ†แž–แž„แŸ‹ แžœแžถแž˜แžพแž›แž‘แŸ…แžŸแŸ’แžšแžŠแŸ€แž„แž‚แŸ’แž“แžถ แž”แŸ‰แžปแž“แŸ’แžแŸ‚แž˜แžถแž“แž˜แžปแžแž„แžถแžšแž…แŸ’แžšแžพแž“แž‡แžถแž„ แž–แŸ’แžšแŸ„แŸ‡แžขแŸ’แž“แž€แžขแžถแž…แž”แŸ’แžšแžพแžœแžทแž’แžธแžŸแžถแžŸแŸ’แžšแŸ’แžแžŽแžถแž˜แžฝแž™ แž˜แžทแž“แžแŸ’แžšแžนแž˜แžแŸ‚ DT แž”แŸ‰แžปแžŽแŸ’แžŽแŸ„แŸ‡แž‘แŸแŸ” แž…แžผแžšแž™แžพแž„แž‘แžถแž‰แž™แž€แž˜แŸแž‚แžปแžŽแžแŸ†แžšแŸ‚แžแŸ†แžšแž„แŸ‹ logistic แžŸแž˜แŸ’แžšแžถแž”แŸ‹แž‘แžทแž“แŸ’แž“แž“แŸแž™แžŸแŸ†แž™แŸ„แž‚แžšแž”แžŸแŸ‹แž™แžพแž„แž‡แžถแž˜แžฝแž™แž“แžนแž„แžแž˜แŸ’แžšแž„แž˜แžฝแž™แž…แŸ†แž“แžฝแž“แž“แŸ…แž›แžพ DT แŸ”

แž›แŸแžแž€แžผแžŠ

# piping

samplpe_b <- dt[a %in% head(letters), sample(b, 1)]

res4 <- 
	dt %>%
	.[a %in% head(letters)] %>%
	.[, 
	  {
	  	dt0 <- .SD[1:100]
	  	
	  	quants <- 
	  		dt0[, c] %>%
	  		quantile(seq(0.1, 1, 0.1), na.rm = T)
	  	
	  	.(q = quants)
	  }
	  , .(cond = b > samplpe_b)
	  ] %>%
	glm(
		cond ~ q -1
		, family = binomial(link = "logit")
		, data = .
	) %>%
	summary %>%
	.[[12]]

แžŸแŸ’แžแžทแžแžท แž€แžถแžšแžšแŸ€แž“แž˜แŸ‰แžถแžŸแŸŠแžธแž“ แž“แžทแž„แžขแŸ’แžœแžธแŸ—แž‡แžถแž…แŸ’แžšแžพแž“แž‘แŸ€แžแž“แŸ…แžแžถแž„แž€แŸ’แž“แžปแž„ DT

แžขแŸ’แž“แž€แžขแžถแž…แž”แŸ’แžšแžพแž˜แžปแžแž„แžถแžš lambda แž”แŸ‰แžปแž“แŸ’แžแŸ‚แž–แŸแž›แžแŸ’แž›แŸ‡แžœแžถแž›แŸ’แžขแž”แŸ’แžšแžŸแžพแžšแž‡แžถแž„แž€แŸ’แž“แžปแž„แž€แžถแžšแž”แž„แŸ’แž€แžพแžแž–แžฝแž€แžœแžถแžŠแŸ„แž™แžกแŸ‚แž€แž–แžธแž‚แŸ’แž“แžถ แžŸแžšแžŸแŸแžšแž”แŸ†แž–แž„แŸ‹แžœแžทแž—แžถแž‚แž‘แžทแž“แŸ’แž“แž“แŸแž™แž‘แžถแŸ†แž„แž˜แžผแž› แž แžพแž™แž”แž“แŸ’แžแž‘แŸ…แž˜แžปแžแž‘แŸ€แž - แž–แžฝแž€แž‚แŸแž’แŸ’แžœแžพแž€แžถแžšแž“แŸ…แžแžถแž„แž€แŸ’แž“แžปแž„ DT แŸ” แžงแž‘แžถแž แžšแžŽแŸแž‚แžบแžŸแŸ†แž”แžผแžšแž‘แŸ…แžŠแŸ„แž™แž›แž€แŸ’แžแžŽแŸˆแž–แžทแžŸแŸแžŸแž‘แžถแŸ†แž„แžขแžŸแŸ‹แžแžถแž„แž›แžพ แž”แžผแž€แžšแžฝแž˜แž‘แžถแŸ†แž„แžœแžแŸ’แžแžปแž˜แžถแž“แž”แŸ’แžšแž™แŸ„แž‡แž“แŸแž‡แžถแž…แŸ’แžšแžพแž“แž–แžธแžƒแŸ’แž›แžถแŸ†แž„แžขแžถแžœแžปแž’ DT (แžŠแžผแž…แž‡แžถแž€แžถแžšแž…แžผแž›แž”แŸ’แžšแžพ DT แžแŸ’แž›แžฝแž“แžœแžถแž“แŸ…แžแžถแž„แž€แŸ’แž“แžปแž„ DT แžแžถแž˜แžšแž™แŸˆแžแŸ†แžŽแž—แŸ’แž‡แžถแž”แŸ‹ แž‡แžฝแž“แž€แžถแž›แž”แž‰แŸ’แž…แžผแž›แž˜แžทแž“แž‡แžถแž”แŸ‹แž‚แŸ’แž“แžถ แž”แŸ‰แžปแž“แŸ’แžแŸ‚แžŠแžผแž…แŸ’แž“แŸแŸ‡แžœแžถแž‚แžบ)แŸ”

แž›แŸแžแž€แžผแžŠ

# function

rm(lm_preds)

lm_preds <- function(
	sd, by, n
)
{
	
	if(
		n < 100 | 
		!by[['a']] %in% head(letters, 4)
	   )
	{
		
		res <-
			list(
				low = NA
				, mean = NA
				, high = NA
				, coefs = NA
			)
		
	} else {

		lmm <- 
			lm(
				d ~ c + b
				, data = sd
			)
		
		preds <- 
			stats::predict.lm(
				lmm
				, sd
				, interval = "prediction"
				)
		
		res <-
			list(
				low = preds[, 2]
				, mean = preds[, 1]
				, high = preds[, 3]
				, coefs = coefficients(lmm)
			)
	}

	res
	
}

res5 <- 
	dt %>%
	.[e < 0] %>%
	.[.[, .I[b > 0]]] %>%
	.[, `:=` (
		low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
		, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
		, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
		, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
		, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
		, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
	)
	, a
	] %>%
	.[!is.na(mean), -'e', with = F]


# plot

plo <- 
	res5 %>%
	ggplot +
	facet_wrap(~ a) +
	geom_ribbon(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, ymin = low
			, ymax = high
			, fill = a
		)
		, size = 0.1
		, alpha = 0.1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = mean
			, color = a
		)
		, size = 1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = d
		)
		, size = 1
		, color = 'black'
	) +
	theme_minimal()

print(plo)

แžŸแŸแž…แž€แŸ’แžแžธแžŸแž“แŸ’แž“แžทแžŠแŸ’แž‹แžถแž“

แžแŸ’แž‰แžปแŸ†แžŸแž„แŸ’แžƒแžนแž˜แžแžถแžแŸ’แž‰แžปแŸ†แžขแžถแž…แž”แž„แŸ’แž€แžพแžแžšแžผแž”แž—แžถแž–แž“แŸƒแžœแžแŸ’แžแžปแžŠแžผแž…แž‡แžถ data.table แž–แŸแž‰แž›แŸแž‰ แž”แŸ‰แžปแž“แŸ’แžแŸ‚แž‡แžถแž€แžถแžšแž–แžทแžแžŽแžถแžŸแŸ‹ แžœแžถแž˜แžทแž“แž–แŸแž‰แž›แŸแž‰แž‘แŸ แžŠแŸ„แž™แž…แžถแž”แŸ‹แž•แŸ’แžแžพแž˜แž–แžธแž›แž€แŸ’แžแžŽแŸˆแžŸแž˜แŸ’แž”แžแŸ’แžแžทแžšแž”แžŸแŸ‹แžœแžถแžŠแŸ‚แž›แž‘แžถแž€แŸ‹แž‘แž„แž“แžนแž„แž€แžถแžšแž‘แž‘แžฝแž›แž˜แžšแžแž€แž–แžธแžแŸ’แž“แžถแž€แŸ‹ R แž แžพแž™แž”แž‰แŸ’แž…แž”แŸ‹แžŠแŸ„แž™แž›แž€แŸ’แžแžŽแŸˆแž–แžทแžŸแŸแžŸ แž“แžทแž„แž”แžšแžทแžŸแŸ’แžแžถแž“แžšแž”แžŸแŸ‹แžœแžถแž•แŸ’แž‘แžถแž›แŸ‹แž–แžธแž’แžถแžแžปแž…แž˜แŸ’แžšแžปแŸ‡แŸ” . แžแŸ’แž‰แžปแŸ†แžŸแž„แŸ’แžƒแžนแž˜แžแžถแžœแžถแž“แžนแž„แž‡แžฝแž™แžขแŸ’แž“แž€แžฑแŸ’แž™แž€แžถแž“แŸ‹แžแŸ‚แž”แŸ’แžšแžŸแžพแžšแžกแžพแž„แž€แŸ’แž“แžปแž„แž€แžถแžšแžšแŸ€แž“ แž“แžทแž„แž”แŸ’แžšแžพแž”แŸ’แžšแžถแžŸแŸ‹แž”แžŽแŸ’แžŽแžถแž›แŸแž™แž“แŸแŸ‡แžŸแž˜แŸ’แžšแžถแž”แŸ‹แž€แžถแžšแž„แžถแžš แž“แžทแž„ แž€แžถแžšแž€แŸ†แžŸแžถแž“แŸ’แž.

แž‡แžปแŸ†แžœแžทแž‰ data.table

แžŸแžผแž˜แžขแžšแž‚แžปแžŽ!

แž›แŸแžแž€แžผแžŠแž–แŸแž‰

แž›แŸแžแž€แžผแžŠ

## load libs ----------------

library(data.table)
library(ggplot2)
library(magrittr)
library(microbenchmark)


## arrays ---------

arrmatr <- array(1:20, c(4,5))

class(arrmatr)

typeof(arrmatr)

is.array(arrmatr)

is.matrix(arrmatr)


## lists ------------------

mylist <- as.list(arrmatr)

is.vector(mylist)

is.list(mylist)


## data.frames ------------

df <- as.data.frame(arrmatr)

is.list(df)

df$V6 <- df$V1 + df$V2


## data.tables -----------------------

data.table::setDT(df)

is.list(df)

is.data.frame(df)

is.data.table(df)

df2 <- df

df[V1 == 1, V2 := 999]

data.table::fsetdiff(df, df2)

df2 <- data.table::copy(df)

df[V1 == 2, V2 := 999]

data.table::fsetdiff(df, df2)


## operations on data.tables ------------

#using list properties

df$'V1'[1]

df[['V1']]

df[[1]][1]

sapply(df, class)

sapply(df, function(x) sum(is.na(x)))


## Bigger example ----

rown <- 100000

dt <- 
	data.table(
		w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

# vectorization

# zero - for loop

microbenchmark({
	for(i in 1:nrow(dt))
		{
		dt[
			i
			, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		]
	}
})

# first

microbenchmark({
	dt[
		, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		, by = 1:nrow(dt)
	   ]
})

# second

first_l_f <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		do.call(rbind, .) %>%
		`[`(,1)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f(w))
		]
})

# third

first_l_f2 <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		unlist %>%
		matrix(nrow = 3) %>%
		`[`(1,)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f2(w))
		]
})

# fourth

rown <- 100000

words <-
	sapply(
		seq_len(rown)
		, function(x){
			nwords <- rbinom(1, 10, 0.5)
			paste(
				sapply(
					seq_len(nwords)
					, function(x){
						paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
					}
				)
				, collapse = ' '
			)
		}
	)

dt <- 
	data.table(
		w = words
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

first_l_f3 <- function(sd, n)
{
	l <- strsplit(sd, split = ' ', fixed = T)
	
	maxl <- max(lengths(l))
	
	sapply(l, "length<-", maxl) %>%
		`[`(n,) %>%
		as.character
}

microbenchmark({
	dt[
		, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
		]
})

dt[
	, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
	]


# chaining

res1 <- dt[a == 'a'][sample(.N, 100)]

res2 <- dt[, .N, a][, N]

res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]

# piping

samplpe_b <- dt[a %in% head(letters), sample(b, 1)]

res4 <- 
	dt %>%
	.[a %in% head(letters)] %>%
	.[, 
	  {
	  	dt0 <- .SD[1:100]
	  	
	  	quants <- 
	  		dt0[, c] %>%
	  		quantile(seq(0.1, 1, 0.1), na.rm = T)
	  	
	  	.(q = quants)
	  }
	  , .(cond = b > samplpe_b)
	  ] %>%
	glm(
		cond ~ q -1
		, family = binomial(link = "logit")
		, data = .
	) %>%
	summary %>%
	.[[12]]


# function

rm(lm_preds)

lm_preds <- function(
	sd, by, n
)
{
	
	if(
		n < 100 | 
		!by[['a']] %in% head(letters, 4)
	   )
	{
		
		res <-
			list(
				low = NA
				, mean = NA
				, high = NA
				, coefs = NA
			)
		
	} else {

		lmm <- 
			lm(
				d ~ c + b
				, data = sd
			)
		
		preds <- 
			stats::predict.lm(
				lmm
				, sd
				, interval = "prediction"
				)
		
		res <-
			list(
				low = preds[, 2]
				, mean = preds[, 1]
				, high = preds[, 3]
				, coefs = coefficients(lmm)
			)
	}

	res
	
}

res5 <- 
	dt %>%
	.[e < 0] %>%
	.[.[, .I[b > 0]]] %>%
	.[, `:=` (
		low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
		, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
		, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
		, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
		, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
		, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
	)
	, a
	] %>%
	.[!is.na(mean), -'e', with = F]


# plot

plo <- 
	res5 %>%
	ggplot +
	facet_wrap(~ a) +
	geom_ribbon(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, ymin = low
			, ymax = high
			, fill = a
		)
		, size = 0.1
		, alpha = 0.1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = mean
			, color = a
		)
		, size = 1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = d
		)
		, size = 1
		, color = 'black'
	) +
	theme_minimal()

print(plo)

แž”แŸ’แžšแž—แž–: www.habr.com

แž‘แžทแž‰แž€แžถแžšแž”แž„แŸ’แž แŸ„แŸ‡แžŠแŸ‚แž›แžขแžถแž…แž‘แžปแž€แž…แžทแžแŸ’แžแž”แžถแž“แžŸแž˜แŸ’แžšแžถแž”แŸ‹แž‚แŸแž แž‘แŸ†แž–แŸแžšแžŠแŸ‚แž›แž˜แžถแž“แž€แžถแžšแž€แžถแžšแž–แžถแžš DDoS, แž˜แŸ‰แžถแžŸแŸŠแžธแž“แž˜แŸ VPS VDS ๐Ÿ”ฅ แž‘แžทแž‰แžŸแŸแžœแžถแž”แž„แŸ’แž แŸ„แŸ‡แž‚แŸแž แž‘แŸ†แž–แŸแžšแžŠแŸ‚แž›แžขแžถแž…แž‘แžปแž€แž…แžทแžแŸ’แžแž”แžถแž“แž‡แžถแž˜แžฝแž™แž“แžนแž„แž€แžถแžšแž€แžถแžšแž–แžถแžš DDoS แž“แžทแž„แž˜แŸ‰แžถแžŸแŸŠแžธแž“แž˜แŸ VPS VDS | ProHoster