Karibu data.table

Chapisho hili litawavutia wale wanaotumia jedwali la maktaba ya kuchakata data ya jedwali la R na wanaweza kufurahishwa kuona ubadilikaji wa matumizi yake katika mifano mbalimbali.

Imeongozwa na mfano mzuri wenzake, na nikitumai kuwa tayari umesoma nakala yake, ninapendekeza kuchimba zaidi katika uboreshaji wa nambari na utendaji kulingana na data.meza.

Utangulizi: Jedwali la data linatoka wapi?

Ni bora kuanza kufahamiana na maktaba kutoka mbali kidogo, yaani, na miundo ya data ambayo kitu cha meza ya data (hapa, DT) kinaweza kupatikana.

Mpangilio

Kanuni

## arrays ---------

arrmatr <- array(1:20, c(4,5))

class(arrmatr)

typeof(arrmatr)

is.array(arrmatr)

is.matrix(arrmatr)

Moja ya miundo kama hii ni safu (?msingi::safu) Kama ilivyo kwa lugha zingine, safu hapa ni za pande nyingi. Walakini, kinachovutia ni kwamba, kwa mfano, safu ya pande mbili huanza kurithi mali kutoka kwa darasa la matrix. (?msingi::matrix), na safu ya sura moja, ambayo pia ni muhimu, hairithi kutoka kwa vekta (?msingi::vekta).

Ni muhimu kuelewa kwamba aina ya data iliyo katika kitu chochote inapaswa kuchunguzwa na kazi msingi::aina ya, ambayo inarudisha maelezo ya ndani ya aina kulingana na R Ndani - itifaki ya lugha ya kawaida inayohusishwa na asili C.

Amri nyingine ya kuamua darasa la kitu, msingi:: darasa, katika kesi ya vectors, inarudi aina ya vector (ina jina tofauti kutoka kwa ndani, lakini pia inakuwezesha kuelewa aina ya data).

Orodha ya

Kutoka kwa safu ya pande mbili, pia inajulikana kama matrix, unaweza kwenda kwenye orodha (?msingi::orodha).

Kanuni

## lists ------------------

mylist <- as.list(arrmatr)

is.vector(mylist)

is.list(mylist)

Mambo kadhaa hutokea mara moja:

  • Mwelekeo wa pili wa matrix huanguka, yaani, tunapata orodha na vector kwa wakati mmoja.
  • Orodha kwa hivyo hurithi kutoka kwa madarasa haya. Kumbuka kwamba kila kipengele cha orodha kitalingana na thamani moja (ya ukubwa) kutoka kwa seli ya mkusanyiko wa matriki.

Kwa kuwa orodha pia ni vekta, kazi zingine za vekta zinaweza kutumika kwake.

Dataframe

Kutoka kwa orodha, matrix au vekta unaweza kwenda kwa sura ya data (?msingi::data.frame).

Kanuni

## data.frames ------------

df <- as.data.frame(arrmatr)
df2 <- as.data.frame(mylist)

is.list(df)

df$V6 <- df$V1 + df$V2

Kinachofurahisha juu yake ni kwamba mfumo wa data hurithi kutoka kwa orodha! Safu wima za mfumo wa data ni visanduku vya orodha. Hili litakuwa muhimu baadaye tunapotumia vipengele vinavyotumika kwenye orodha.

data.meza

Pata DT (?meza.data::meza.data) inaweza kutoka sura ya data, orodha, vekta, au tumbo. Kwa mfano, kama hii (mahali).

Kanuni

## data.tables -----------------------
library(data.table)

data.table::setDT(df)

is.list(df)

is.data.frame(df)

is.data.table(df)

Kinachofaa ni kwamba, kama mfumo wa data, DT hurithi sifa za orodha.

DT na kumbukumbu

Tofauti na vitu vingine vyote kwenye msingi wa R, DT hupitishwa kwa kumbukumbu. Ikiwa unahitaji kuzinakili kwenye eneo jipya la kumbukumbu, unahitaji chaguo la kukokotoa data.meza::nakala au unahitaji kufanya uteuzi kutoka kwa kitu cha zamani.

Kanuni

df2 <- df

df[V1 == 1, V2 := 999]

data.table::fsetdiff(df, df2)

df2 <- data.table::copy(df)

df[V1 == 2, V2 := 999]

data.table::fsetdiff(df, df2)

Hii inahitimisha utangulizi. DT ni mwendelezo wa ukuzaji wa miundo ya data katika R, haswa kupitia upanuzi na kuongeza kasi ya shughuli zinazofanywa kwenye vitu vya mfumo wa data. Wakati wa kudumisha urithi kutoka kwa primitives zingine.

Baadhi ya mifano ya kutumia sifa za data.table

Kama orodha…

Kurudia safu za safu ya data au DT sio wazo bora, kwani nambari ya kitanzi iko katika lugha. R polepole zaidi C, lakini kupiga kitanzi kupitia nguzo, ambazo kwa kawaida ni chache zaidi, kunawezekana kabisa. Unapopitia safu wima, kumbuka kwamba kila safu ni kipengele cha orodha, kwa kawaida huwa na vekta. Na utendakazi kwenye vekta umewekewa vekta vizuri katika kazi za msingi za lugha. Unaweza pia kutumia waendeshaji uteuzi asili ya orodha na vekta: `[[`, `$`.

Kanuni

## operations on data.tables ------------

#using list properties

df$'V1'[1]

df[['V1']]

df[[1]][1]

sapply(df, class)

sapply(df, function(x) sum(is.na(x)))

Vectorization

Ikiwa unahitaji kurudia kupitia safu za DT kubwa, suluhisho bora ni kuandika kazi ya vectorized. Lakini ikiwa hiyo haifanyi kazi, kumbuka kwamba kitanzi ndani DT bado ina kasi zaidi kuliko mzunguko ndani R, kwani inatekelezwa C.

Wacha tuijaribu kwa mfano mkubwa na safu mlalo 100K. Tutatoa herufi ya kwanza kutoka kwa maneno yaliyojumuishwa kwenye vekta ya safu wima. w.

Updated

Kanuni

library(magrittr)
library(microbenchmark)

## Bigger example ----

rown <- 100000

dt <- 
	data.table(
		w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

# vectorization

microbenchmark({
	dt[
		, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		, by = 1:nrow(dt)
	   ]
})

# second

first_l_f <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		do.call(rbind, .) %>%
		`[`(,1)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f(w))
		]
})

# third

first_l_f2 <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		unlist %>%
		matrix(nrow = 3) %>%
		`[`(1,)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f2(w))
		]
})

Kwanza endesha na kurudia juu ya safu:

Kitengo: milliseconds
expr dakika
{dt[, `:=`(first_l, unlist(strsplit(w, split = " ", fixed = T))[1]), by = 1:nrow(dt)] } 439.6217
lq maana ya wastani uq max neval
451.9998 460.1593 456.2505 460.9147 621.4042 100

Uendeshaji wa pili, ambapo vectorization hutokea kwa kubadilisha orodha kwenye tumbo na kuchukua vipengele vya kipande na index 1 (mwisho ni vectorization halisi). Marekebisho: vectorization katika ngazi ya kazi. strsplit, ambayo inaweza kukubali vekta kama pembejeo. Inabadilika kuwa utaratibu wa kubadilisha orodha kuwa matrix ni ngumu zaidi kuliko vectorization yenyewe, lakini hata katika kesi hii, ni haraka sana kuliko toleo lisilo la vector.

Kitengo: milliseconds
expr min lq maana wastani uq max neval
{dt[, `:=`(first_l, .(first_l_f(w)))] } 93.07916 112.1381 161.9267 149.6863 185.9893 442.5199 100

Kuongeza kasi kwa wastani katika mara 3.

Uendeshaji wa tatu, ambapo mpango wa ubadilishaji wa matrix ulibadilishwa.

Kitengo: milliseconds
expr min lq maana wastani uq max neval
{dt[, `:=`(first_l, .(first_l_f2(w)))] } 32.60481 34.13679 40.4544 35.57115 42.11975 222.972 100

Kuongeza kasi kwa wastani katika mara 13.

Unahitaji kujaribu na jambo hili, bora zaidi.

Huu hapa ni mfano mwingine wa uwekaji vekta, pia wenye maandishi, lakini karibu na hali halisi ya ulimwengu: urefu tofauti wa maneno na hesabu tofauti za maneno. Lengo ni kutoa maneno matatu ya kwanza. Kama hii:

Karibu data.table

Chaguo za kukokotoa za awali hazifanyi kazi hapa kwa sababu vekta ni za urefu tofauti, na tulibainisha ukubwa wa matrix. Hebu tulifanyie kazi upya hili kwa kuchimba mtandaoni.

Kanuni

# fourth

rown <- 100000

words <-
	sapply(
		seq_len(rown)
		, function(x){
			nwords <- rbinom(1, 10, 0.5)
			paste(
				sapply(
					seq_len(nwords)
					, function(x){
						paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
					}
				)
				, collapse = ' '
			)
		}
	)

dt <- 
	data.table(
		w = words
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

first_l_f3 <- function(sd, n)
{
	l <- strsplit(sd, split = ' ', fixed = T)
	
	maxl <- max(lengths(l))
	
	sapply(l, "length<-", maxl) %>%
		`[`(n,) %>%
		as.character
}

microbenchmark({
	dt[
		, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
		]
})

dt[
	, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
	]

Kitengo: milliseconds
expr min lq maana ya wastani

{dt[, `:=`((bandika0(“w_”, 1:3)), strsplit(w, split = " ", fasta = T))] } 851.7623 916.071 1054.5 1035.199
uq max neval
1178.738 1356.816 100

Hati ilienda kwa kasi ya wastani ya sekunde 1. Sio mbaya.

Imeunganishwa na mnyororo mmoja…

Unaweza kufanya kazi na vitu vya DT kwa kutumia minyororo. Hii inaonekana kama kufunga sintaksia ya mabano upande wa kulia, kimsingi mipako ya sukari.

Kanuni

# chaining

res1 <- dt[a == 'a'][sample(.N, 100)]

res2 <- dt[, .N, a][, N]

res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]

Inavuja kupitia mabomba...

Shughuli sawa zinaweza kufanywa kwa kutumia mabomba; inaonekana sawa, lakini ni tajiri zaidi kiutendaji, kwani njia zozote zinaweza kutumika, sio tu DT. Wacha tupate hesabu za urekebishaji wa vifaa kwa data yetu ya syntetisk na vichujio kadhaa vya DT.

Kanuni

# piping

samplpe_b <- dt[a %in% head(letters), sample(b, 1)]

res4 <- 
	dt %>%
	.[a %in% head(letters)] %>%
	.[, 
	  {
	  	dt0 <- .SD[1:100]
	  	
	  	quants <- 
	  		dt0[, c] %>%
	  		quantile(seq(0.1, 1, 0.1), na.rm = T)
	  	
	  	.(q = quants)
	  }
	  , .(cond = b > samplpe_b)
	  ] %>%
	glm(
		cond ~ q -1
		, family = binomial(link = "logit")
		, data = .
	) %>%
	summary %>%
	.[[12]]

Takwimu, kujifunza kwa mashine, na zaidi ndani ya DT

Vipengele vya kukokotoa vya Lambda vinaweza kutumika, lakini wakati mwingine ni bora kuziunda kando, kuandika bomba zima la uchanganuzi wa data, na ukienda zako—zinafanya kazi ndani ya DT. Mfano huo umeimarishwa na vipengele vyote vilivyotajwa hapo juu, pamoja na mambo kadhaa muhimu kutoka kwa arsenal ya DT (kama vile kufikia DT yenyewe ndani ya DT kupitia kiungo, wakati mwingine huingizwa nje ya mlolongo, lakini ili tu kuwa na uhakika).

Kanuni

# function

rm(lm_preds)

lm_preds <- function(
	sd, by, n
)
{
	
	if(
		n < 100 | 
		!by[['a']] %in% head(letters, 4)
	   )
	{
		
		res <-
			list(
				low = NA
				, mean = NA
				, high = NA
				, coefs = NA
			)
		
	} else {

		lmm <- 
			lm(
				d ~ c + b
				, data = sd
			)
		
		preds <- 
			stats::predict.lm(
				lmm
				, sd
				, interval = "prediction"
				)
		
		res <-
			list(
				low = preds[, 2]
				, mean = preds[, 1]
				, high = preds[, 3]
				, coefs = coefficients(lmm)
			)
	}

	res
	
}

res5 <- 
	dt %>%
	.[e < 0] %>%
	.[.[, .I[b > 0]]] %>%
	.[, `:=` (
		low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
		, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
		, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
		, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
		, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
		, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
	)
	, a
	] %>%
	.[!is.na(mean), -'e', with = F]


# plot

plo <- 
	res5 %>%
	ggplot +
	facet_wrap(~ a) +
	geom_ribbon(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, ymin = low
			, ymax = high
			, fill = a
		)
		, size = 0.1
		, alpha = 0.1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = mean
			, color = a
		)
		, size = 1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = d
		)
		, size = 1
		, color = 'black'
	) +
	theme_minimal()

print(plo)

Hitimisho

Natumai nimeweza kuunda picha ya kina, ingawa haijakamilika, ya kitu cha data.table, kutoka kwa sifa zake zinazohusiana na urithi kutoka kwa madarasa ya R hadi sifa zake na mazingira ya vipengele safi. Natumai hii itakusaidia kuelewa vyema na kutumia maktaba hii kwenye kazi yako. burudani.

Karibu data.table

Asante!

Msimbo kamili

Kanuni

## load libs ----------------

library(data.table)
library(ggplot2)
library(magrittr)
library(microbenchmark)


## arrays ---------

arrmatr <- array(1:20, c(4,5))

class(arrmatr)

typeof(arrmatr)

is.array(arrmatr)

is.matrix(arrmatr)


## lists ------------------

mylist <- as.list(arrmatr)

is.vector(mylist)

is.list(mylist)


## data.frames ------------

df <- as.data.frame(arrmatr)

is.list(df)

df$V6 <- df$V1 + df$V2


## data.tables -----------------------

data.table::setDT(df)

is.list(df)

is.data.frame(df)

is.data.table(df)

df2 <- df

df[V1 == 1, V2 := 999]

data.table::fsetdiff(df, df2)

df2 <- data.table::copy(df)

df[V1 == 2, V2 := 999]

data.table::fsetdiff(df, df2)


## operations on data.tables ------------

#using list properties

df$'V1'[1]

df[['V1']]

df[[1]][1]

sapply(df, class)

sapply(df, function(x) sum(is.na(x)))


## Bigger example ----

rown <- 100000

dt <- 
	data.table(
		w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

# vectorization

# zero - for loop

microbenchmark({
	for(i in 1:nrow(dt))
		{
		dt[
			i
			, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		]
	}
})

# first

microbenchmark({
	dt[
		, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		, by = 1:nrow(dt)
	   ]
})

# second

first_l_f <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		do.call(rbind, .) %>%
		`[`(,1)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f(w))
		]
})

# third

first_l_f2 <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		unlist %>%
		matrix(nrow = 3) %>%
		`[`(1,)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f2(w))
		]
})

# fourth

rown <- 100000

words <-
	sapply(
		seq_len(rown)
		, function(x){
			nwords <- rbinom(1, 10, 0.5)
			paste(
				sapply(
					seq_len(nwords)
					, function(x){
						paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
					}
				)
				, collapse = ' '
			)
		}
	)

dt <- 
	data.table(
		w = words
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

first_l_f3 <- function(sd, n)
{
	l <- strsplit(sd, split = ' ', fixed = T)
	
	maxl <- max(lengths(l))
	
	sapply(l, "length<-", maxl) %>%
		`[`(n,) %>%
		as.character
}

microbenchmark({
	dt[
		, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
		]
})

dt[
	, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
	]


# chaining

res1 <- dt[a == 'a'][sample(.N, 100)]

res2 <- dt[, .N, a][, N]

res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]

# piping

samplpe_b <- dt[a %in% head(letters), sample(b, 1)]

res4 <- 
	dt %>%
	.[a %in% head(letters)] %>%
	.[, 
	  {
	  	dt0 <- .SD[1:100]
	  	
	  	quants <- 
	  		dt0[, c] %>%
	  		quantile(seq(0.1, 1, 0.1), na.rm = T)
	  	
	  	.(q = quants)
	  }
	  , .(cond = b > samplpe_b)
	  ] %>%
	glm(
		cond ~ q -1
		, family = binomial(link = "logit")
		, data = .
	) %>%
	summary %>%
	.[[12]]


# function

rm(lm_preds)

lm_preds <- function(
	sd, by, n
)
{
	
	if(
		n < 100 | 
		!by[['a']] %in% head(letters, 4)
	   )
	{
		
		res <-
			list(
				low = NA
				, mean = NA
				, high = NA
				, coefs = NA
			)
		
	} else {

		lmm <- 
			lm(
				d ~ c + b
				, data = sd
			)
		
		preds <- 
			stats::predict.lm(
				lmm
				, sd
				, interval = "prediction"
				)
		
		res <-
			list(
				low = preds[, 2]
				, mean = preds[, 1]
				, high = preds[, 3]
				, coefs = coefficients(lmm)
			)
	}

	res
	
}

res5 <- 
	dt %>%
	.[e < 0] %>%
	.[.[, .I[b > 0]]] %>%
	.[, `:=` (
		low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
		, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
		, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
		, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
		, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
		, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
	)
	, a
	] %>%
	.[!is.na(mean), -'e', with = F]


# plot

plo <- 
	res5 %>%
	ggplot +
	facet_wrap(~ a) +
	geom_ribbon(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, ymin = low
			, ymax = high
			, fill = a
		)
		, size = 0.1
		, alpha = 0.1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = mean
			, color = a
		)
		, size = 1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = d
		)
		, size = 1
		, color = 'black'
	) +
	theme_minimal()

print(plo)

Chanzo: mapenzi.com

Kuongeza maoni