ãããããã«ïŒ
æšå¹Žã®ç§ãKaggle ã¯ææãã®çµµãåé¡ããã³ã³ãã¹ããQuick Draw Doodle Recognitionããäž»å¬ããŸãããããã«ã¯ããšããã R ç§åŠè
ã®ããŒã ãåå ããŸããã
ä»åã¯ã¡ãã«ãã¡ãŒã ã§ã¯ããŸããããŸããã§ããããå€ãã®è²ŽéãªçµéšãåŸãããã®ã§ãKagle ãæ¥åžžæ¥åã§æãèå³æ·±ã圹ç«ã€ããšã®æ°ã
ãã³ãã¥ããã£ã«äŒããŠãããããšæããŸãã è°è«ããããããã¯ã®äžã«ã¯ã次ã®ãã®ãå«ãŸããŸãïŒ OpenCVãJSON 解æ (ãããã®äŸã§ã¯ãR ã®ã¹ã¯ãªãããŸãã¯ããã±ãŒãžãžã® C++ ã³ãŒãã®çµ±åã調ã¹ãŸãã )ãã¹ã¯ãªããã®ãã©ã¡ãŒã¿åãæçµãœãªã¥ãŒã·ã§ã³ã® Docker åã å®è¡ã«é©ãã圢åŒã®ã¡ãã»ãŒãžã®ãã¹ãŠã®ã³ãŒãã¯ã次ã®å Žæã«ãããŸãã
å 容ïŒ
CSV ãã MonetDB ã«ããŒã¿ãå¹ççã«ããŒããã ãããã®æºå ããŒã¿ããŒã¹ããããããã¢ã³ããŒãããããã®ã€ãã¬ãŒã¿ ã¢ãã« ã¢ãŒããã¯ãã£ã®éžæ ã¹ã¯ãªããã®ãã©ã¡ãŒã¿å ã¹ã¯ãªããã® Docker å Google Cloud ã§ã®è€æ°ã® GPU ã®äœ¿çš 代ããã«ãçµè«ã®
1. CSV ãã MonetDB ããŒã¿ããŒã¹ã«ããŒã¿ãå¹ççã«ããŒããã
æ¬ã³ã³ãã¹ãã®ããŒã¿ã¯æ¢è£œã®ç»å圢åŒã§ã¯ãªããç¹åº§æšãå«ãJSONãå«ãCSVãã¡ã€ã«340åïŒã¯ã©ã¹ããšã«256ãã¡ã€ã«ïŒã®åœ¢åŒã§æäŸãããŸãã ãããã®ç¹ãç·ã§çµã¶ãšã256x7.4 ãã¯ã»ã«ã®æçµç»åãåŸãããŸãã ãŸããåã¬ã³ãŒãã«ã¯ãããŒã¿ã»ããã®åéæã«äœ¿çšãããåé¡åã«ãã£ãŠç»åãæ£ããèªèããããã©ããã瀺ãã©ãã«ãç»åã®äœæè ã®å± äœåœã® 20 æåã³ãŒããäžæã®èå¥åãã¿ã€ã ã¹ã¿ã³ããå«ãŸããŸãããã¡ã€ã«åãšäžèŽããã¯ã©ã¹åã å ã®ããŒã¿ã®ç°¡æããŒãžã§ã³ã®éãã¯ã¢ãŒã«ã€ãå 㧠240 GBã解ååŸã¯çŽ 50 GBã解ååŸã®å®å šãªããŒã¿ã¯ XNUMX GB ã«ãªããŸãã äž»å¬è ã¯äž¡æ¹ã®ããŒãžã§ã³ã§åãå³é¢ãåçŸããããšãä¿èšŒããŸãããããã¯ãå®å šçãåé·ã§ããããšãæå³ããŸãã ãããã«ãããXNUMX äžæã®ç»åãã°ã©ãã£ã㯠ãã¡ã€ã«ãŸãã¯é å圢åŒã§ä¿åããã®ã¯æ¡ç®ãåããªããšå€æãããã¢ãŒã«ã€ããããã¹ãŠã® CSV ãã¡ã€ã«ãçµåããããšã«ããŸããã train_simplified.zip ããŒã¿ããŒã¹ã«ä¿åããããã®åŸããããããšã«å¿ èŠãªãµã€ãºã®ç»åãããã®å Žã§ãçæãããŸãã
DBMS ãšããŠå®çžŸã®ããã·ã¹ãã ãéžæãããŸãã ã¢ãDBãã€ãŸãããã±ãŒãžãšããŠã® R ã®å®è£
con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))
XNUMX ã€ã®ããŒãã«ãäœæããå¿ èŠããããŸããXNUMX ã€ã¯ãã¹ãŠã®ããŒã¿çšã§ããã XNUMX ã€ã¯ããŠã³ããŒãããããã¡ã€ã«ã«é¢ãããµãŒãã¹æ å ±çšã§ã (äœãåé¡ãçºçããããã€ãã®ãã¡ã€ã«ãããŠã³ããŒãããåŸã«ããã»ã¹ãåéããå¿ èŠãããå Žåã«åœ¹ç«ã¡ãŸã)ã
ããŒãã«ã®äœæ
if (!DBI::dbExistsTable(con, "doodles")) {
DBI::dbCreateTable(
con = con,
name = "doodles",
fields = c(
"countrycode" = "char(2)",
"drawing" = "text",
"key_id" = "bigint",
"recognized" = "bool",
"timestamp" = "timestamp",
"word" = "text"
)
)
}
if (!DBI::dbExistsTable(con, "upload_log")) {
DBI::dbCreateTable(
con = con,
name = "upload_log",
fields = c(
"id" = "serial",
"file_name" = "text UNIQUE",
"uploaded" = "bool DEFAULT false"
)
)
}
ããŒã¿ãããŒã¿ããŒã¹ã«ããŒãããæéã®æ¹æ³ã¯ãSQL ã³ãã³ãã䜿çšã㊠CSV ãã¡ã€ã«ãçŽæ¥ã³ããŒããããšã§ããã COPY OFFSET 2 INTO tablename FROM path USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORT
ã©ã tablename
- ããŒãã«åãš path
- ãã¡ã€ã«ãžã®ãã¹ã ã¢ãŒã«ã€ããæäœããŠãããšãã«ãçµã¿èŸŒã¿ã®å®è£
ã unzip
R ã§ã¯ãã¢ãŒã«ã€ãããã®å€ãã®ãã¡ã€ã«ã§ã¯æ£ããåäœããªãããã次ã®ã·ã¹ãã ã䜿çšããŸããã unzip
(ãã©ã¡ãŒã¿ã䜿çšã㊠getOption("unzip")
).
ããŒã¿ããŒã¹ãžã®æžã蟌ã¿æ©èœ
#' @title ÐзвлеÑеМОе О загÑÑзка ÑайлПв
#'
#' @description
#' ÐзвлеÑеМОе CSV-ÑайлПв Оз ZIP-аÑÑ
Ова О загÑÑзка ОÑ
в Ð±Ð°Ð·Ñ ÐŽÐ°ÐœÐœÑÑ
#'
#' @param con ÐбÑÐµÐºÑ Ð¿ÐŸÐŽÐºÐ»ÑÑÐµÐœÐžÑ Ðº базе ЎаММÑÑ
(клаÑÑ `MonetDBEmbeddedConnection`).
#' @param tablename ÐазваМОе ÑаблОÑÑ Ð² базе ЎаММÑÑ
.
#' @oaram zipfile ÐÑÑÑ Ðº ZIP-аÑÑ
ОвÑ.
#' @oaram filename ÐÐŒÑ Ñайла вМÑÑО ZIP-аÑÑ
Ова.
#' @param preprocess ЀÑМкÑÐžÑ Ð¿ÑеЎПбÑабПÑкО, кПÑПÑÐ°Ñ Ð±ÑÐŽÐµÑ Ð¿ÑОЌеМеМа ОзвлеÑÑÐœÐœÐŸÐŒÑ ÑайлÑ.
#' ÐПлжМа пÑОМОЌаÑÑ ÐŸÐŽÐžÐœ аÑгÑÐŒÐµÐœÑ `data` (ПбÑÐµÐºÑ `data.table`).
#'
#' @return `TRUE`.
#'
upload_file <- function(con, tablename, zipfile, filename, preprocess = NULL) {
# ÐÑПвеÑка аÑгÑЌеМÑПв
checkmate::assert_class(con, "MonetDBEmbeddedConnection")
checkmate::assert_string(tablename)
checkmate::assert_string(filename)
checkmate::assert_true(DBI::dbExistsTable(con, tablename))
checkmate::assert_file_exists(zipfile, access = "r", extension = "zip")
checkmate::assert_function(preprocess, args = c("data"), null.ok = TRUE)
# ÐзвлеÑеМОе Ñайла
path <- file.path(tempdir(), filename)
unzip(zipfile, files = filename, exdir = tempdir(),
junkpaths = TRUE, unzip = getOption("unzip"))
on.exit(unlink(file.path(path)))
# ÐÑОЌеМÑеЌ ÑÑМкÑÐžÑ Ð¿ÑеЎПбÑабПÑкО
if (!is.null(preprocess)) {
.data <- data.table::fread(file = path)
.data <- preprocess(data = .data)
data.table::fwrite(x = .data, file = path, append = FALSE)
rm(.data)
}
# ÐапÑÐŸÑ Ðº ÐРМа ОЌпПÑÑ CSV
sql <- sprintf(
"COPY OFFSET 2 INTO %s FROM '%s' USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORT",
tablename, path
)
# ÐÑпПлМеМОе запÑПÑа к ÐÐ
DBI::dbExecute(con, sql)
# ÐПбавлеМОе запОÑО Пб ÑÑпеÑМПй загÑÑзке в ÑлÑжебМÑÑ ÑаблОÑÑ
DBI::dbExecute(con, sprintf("INSERT INTO upload_log(file_name, uploaded) VALUES('%s', true)",
filename))
return(invisible(TRUE))
}
ããŒã¿ããŒã¹ã«æžã蟌ãåã«ããŒãã«ãå€æããå¿
èŠãããå Žåã¯ãåŒæ°ãæž¡ãã ãã§ååã§ãã preprocess
ããŒã¿ãå€æããé¢æ°ã
ããŒã¿ãããŒã¿ããŒã¹ã«é 次ããŒãããã³ãŒã:
ããŒã¿ããŒã¹ãžã®ããŒã¿ã®æžã蟌ã¿
# СпОÑПк ÑайлПв ÐŽÐ»Ñ Ð·Ð°Ð¿ÐžÑО
files <- unzip(zipfile, list = TRUE)$Name
# СпОÑПк ОÑклÑÑеМОй, еÑлО ÑаÑÑÑ ÑайлПв Ñже бÑла загÑÑжеМа
to_skip <- DBI::dbGetQuery(con, "SELECT file_name FROM upload_log")[[1L]]
files <- setdiff(files, to_skip)
if (length(files) > 0L) {
# ÐапÑÑкаеЌ ÑайЌеÑ
tictoc::tic()
# ÐÑПгÑеÑÑ Ð±Ð°Ñ
pb <- txtProgressBar(min = 0L, max = length(files), style = 3)
for (i in seq_along(files)) {
upload_file(con = con, tablename = "doodles",
zipfile = zipfile, filename = files[i])
setTxtProgressBar(pb, i)
}
close(pb)
# ÐÑÑаМавлОваеЌ ÑайЌеÑ
tictoc::toc()
}
# 526.141 sec elapsed - кПпОÑПваМОе SSD->SSD
# 558.879 sec elapsed - кПпОÑПваМОе USB->SSD
ããŒã¿ã®ããŒãæéã¯ã䜿çšãããã©ã€ãã®é床ç¹æ§ã«ãã£ãŠç°ãªãå ŽåããããŸãã ç§ãã¡ã®å Žåã10 ã€ã® SSD å ã§ã®èªã¿åããšæžã蟌ã¿ããŸãã¯ãã©ãã·ã¥ ãã©ã€ã (ãœãŒã¹ ãã¡ã€ã«) ãã SSD (DB) ãžã®èªã¿åããšæžã蟌ã¿ã«ã¯ XNUMX åãããããŸããã
æŽæ°ã¯ã©ã¹ ã©ãã«ãšã€ã³ããã¯ã¹å (ORDERED INDEX
) ã«ã¯ããããäœææã«èŠ³æž¬å€ããµã³ããªã³ã°ãããè¡çªå·ãä»ããããŸãã
è¿œå ã®åãšã€ã³ããã¯ã¹ã®äœæ
message("Generate lables")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD label_int int"))
invisible(DBI::dbExecute(con, "UPDATE doodles SET label_int = dense_rank() OVER (ORDER BY word) - 1"))
message("Generate row numbers")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD id serial"))
invisible(DBI::dbExecute(con, "CREATE ORDERED INDEX doodles_id_ord_idx ON doodles(id)"))
ãªã³ã¶ãã©ã€ã§ããããäœæããåé¡ã解決ããã«ã¯ãããŒãã«ããã©ã³ãã ãªè¡ãæœåºããæ倧é床ãéæããå¿
èŠããããŸããã doodles
ã ãã®ããã« 3 ã€ã®ããªãã¯ã䜿çšããŸããã XNUMX ã€ç®ã¯ã芳枬 ID ãæ ŒçŽããåã®æ¬¡å
ãåæžããããšã§ããã å
ã®ããŒã¿ã»ããã§ã¯ãID ãä¿åããããã«å¿
èŠãªåã¯æ¬¡ã®ãšããã§ãã bigint
ãã ãã芳枬å€ã®æ°ã«ãããåºæ°ã«çããèå¥åãåã«åœãŠã¯ããããšãã§ããŸãã int
ã ãã®å Žåãæ€çŽ¢ã¯ã¯ããã«é«éã«ãªããŸãã XNUMXçªç®ã®ããªãã¯ã¯äœ¿çšããããšã§ãã ORDERED INDEX
â ç§ãã¡ã¯å©çšå¯èœãªãã¹ãŠã®ããšãæ€èšããçµæãçµéšçã«ãã®æ±ºå®ã«éããŸãã PREPARE
ãã®åŸãåãã¿ã€ãã®ã¯ãšãªãå€æ°äœæãããšãã«æºåãããåŒã䜿çšããŸãããå®éã«ã¯ãåçŽãªã¯ãšãªãšæ¯èŒããŠå©ç¹ããããŸãã SELECT
çµ±èšèª€å·®ã®ç¯å²å
ã§ããããšãå€æããŸããã
ããŒã¿ã®ã¢ããããŒãã®ããã»ã¹ã§æ¶è²»ããã RAM 㯠450 MB 以å ã§ãã ã€ãŸããããã§èª¬æããã¢ãããŒãã䜿çšãããšãã·ã³ã°ã«ããŒã ããã€ã¹ãå«ãã»ãšãã©ãã¹ãŠã®äºç®ã®ããŒããŠã§ã¢äžã§æ°åã®ã¬ãã€ãã®ããŒã¿ã»ããã移åã§ããããã«ãªããããã¯éåžžã«åªããŠããŸãã
æ®ã£ãŠããã®ã¯ã(ã©ã³ãã ) ããŒã¿ã®ååŸé床ã枬å®ããããŸããŸãªãµã€ãºã®ãããããµã³ããªã³ã°ãããšãã®ã¹ã±ãŒãªã³ã°ãè©äŸ¡ããããšã ãã§ãã
ããŒã¿ããŒã¹ãã³ãããŒã¯
library(ggplot2)
set.seed(0)
# ÐПЎклÑÑеМОе к базе ЎаММÑÑ
con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))
# ЀÑМкÑÐžÑ ÐŽÐ»Ñ Ð¿ÐŸÐŽÐ³ÐŸÑПвкО запÑПÑа Ма ÑÑПÑПМе ÑеÑвеÑа
prep_sql <- function(batch_size) {
sql <- sprintf("PREPARE SELECT id FROM doodles WHERE id IN (%s)",
paste(rep("?", batch_size), collapse = ","))
res <- DBI::dbSendQuery(con, sql)
return(res)
}
# ЀÑМкÑÐžÑ ÐŽÐ»Ñ ÐžÐ·Ð²Ð»ÐµÑÐµÐœÐžÑ ÐŽÐ°ÐœÐœÑÑ
fetch_data <- function(rs, batch_size) {
ids <- sample(seq_len(n), batch_size)
res <- DBI::dbFetch(DBI::dbBind(rs, as.list(ids)))
return(res)
}
# ÐÑПвеЎеМОе заЌеÑа
res_bench <- bench::press(
batch_size = 2^(4:10),
{
rs <- prep_sql(batch_size)
bench::mark(
fetch_data(rs, batch_size),
min_iterations = 50L
)
}
)
# ÐаÑаЌеÑÑÑ Ð±ÐµÐœÑЌаÑка
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]
# batch_size min median max `itr/sec` total_time n_itr
# <dbl> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:tm> <int>
# 1 16 23.6ms 54.02ms 93.43ms 18.8 2.6s 49
# 2 32 38ms 84.83ms 151.55ms 11.4 4.29s 49
# 3 64 63.3ms 175.54ms 248.94ms 5.85 8.54s 50
# 4 128 83.2ms 341.52ms 496.24ms 3.00 16.69s 50
# 5 256 232.8ms 653.21ms 847.44ms 1.58 31.66s 50
# 6 512 784.6ms 1.41s 1.98s 0.740 1.1m 49
# 7 1024 681.7ms 2.72s 4.06s 0.377 2.16m 49
ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
geom_point() +
geom_line() +
ylab("median time, s") +
theme_minimal()
DBI::dbDisconnect(con, shutdown = TRUE)
2. ãããã®æºå
ãããæºåããã»ã¹å šäœã¯æ¬¡ã®æé ã§æ§æãããŸãã
- ç¹ã®åº§æšãæã€æååã®ãã¯ãã«ãå«ãè€æ°ã® JSON ã解æããŸãã
- å¿ èŠãªãµã€ãº (256Ã256 ãŸã㯠128Ã128 ãªã©) ã®ç»åäžã®ç¹ã®åº§æšã«åºã¥ããŠè²ä»ãã®ç·ãæç»ããŸãã
- çµæã®ã€ã¡ãŒãžããã³ãœã«ã«å€æããŸãã
Python ã«ãŒãã«éã®ç«¶äºã®äžç°ãšããŠããã®åé¡ã¯äž»ã«æ¬¡ã®æ¹æ³ã䜿çšããŠè§£æ±ºãããŸããã OpenCVã R ã®æãåçŽãã€æçœãªé¡äŒŒç©ã® XNUMX ã€ã¯æ¬¡ã®ããã«ãªããŸãã
R ã§ã® JSON ãã Tensor ãžã®å€æã®å®è£
r_process_json_str <- function(json, line.width = 3,
color = TRUE, scale = 1) {
# ÐаÑÑОМг JSON
coords <- jsonlite::fromJSON(json, simplifyMatrix = FALSE)
tmp <- tempfile()
# УЎалÑеЌ вÑеЌеММÑй Ñайл пП завеÑÑÐµÐœÐžÑ ÑÑМкÑОО
on.exit(unlink(tmp))
png(filename = tmp, width = 256 * scale, height = 256 * scale, pointsize = 1)
# ÐÑÑÑПй гÑаÑОк
plot.new()
# Ð Ð°Ð·ÐŒÐµÑ ÐŸÐºÐœÐ° гÑаÑОка
plot.window(xlim = c(256 * scale, 0), ylim = c(256 * scale, 0))
# ЊвеÑа лОМОй
cols <- if (color) rainbow(length(coords)) else "#000000"
for (i in seq_along(coords)) {
lines(x = coords[[i]][[1]] * scale, y = coords[[i]][[2]] * scale,
col = cols[i], lwd = line.width)
}
dev.off()
# ÐÑеПбÑазПваМОе ОзПбÑÐ°Ð¶ÐµÐœÐžÑ Ð² 3-Ñ
ЌеÑÐœÑй ЌаÑÑОв
res <- png::readPNG(tmp)
return(res)
}
r_process_json_vector <- function(x, ...) {
res <- lapply(x, r_process_json_str, ...)
# ÐбÑеЎОМеМОе 3-Ñ
ЌеÑÐœÑÑ
ЌаÑÑОвПв каÑÑОМПк в 4-Ñ
ЌеÑÐœÑй в ÑеМзПÑ
res <- do.call(abind::abind, c(res, along = 0))
return(res)
}
æç»ã¯æšæºã® R ããŒã«ã䜿çšããŠå®è¡ãããRAM ã«ä¿åãããäžæ PNG ã«ä¿åãããŸã (Linux ã§ã¯ãäžæ R ãã£ã¬ã¯ããªã¯æ¬¡ã®ãã£ã¬ã¯ããªã«ãããŸã) /tmp
ãRAMã«ããŠã³ããããŸãïŒã ãã®ãã¡ã€ã«ã¯ã0 ãã 1 ã®ç¯å²ã®æ°å€ãæ〠XNUMX 次å
é
åãšããŠèªã¿åãããŸããåŸæ¥ã® BMP 㯠XNUMX é²ã«ã©ãŒ ã³ãŒããå«ãçã®é
åã«èªã¿èŸŒãŸãããããããã¯éèŠã§ãã
çµæããã¹ãããŠã¿ãŸããã:
zip_file <- file.path("data", "train_simplified.zip")
csv_file <- "cat.csv"
unzip(zip_file, files = csv_file, exdir = tempdir(),
junkpaths = TRUE, unzip = getOption("unzip"))
tmp_data <- data.table::fread(file.path(tempdir(), csv_file), sep = ",",
select = "drawing", nrows = 10000)
arr <- r_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256 3
plot(magick::image_read(arr))
ãããèªäœã¯æ¬¡ã®ããã«åœ¢æãããŸãã
res <- r_process_json_vector(tmp_data[1:4, drawing], scale = 0.5)
str(res)
# num [1:4, 1:128, 1:128, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, "dimnames")=List of 4
# ..$ : NULL
# ..$ : NULL
# ..$ : NULL
# ..$ : NULL
倧èŠæš¡ãªãããã®åœ¢æã«ã¯éåžžã«é·ãæéããããããããã®å®è£ ã¯ç§ãã¡ã«ãšã£ãŠæé©ãšã¯èšããŸããã§ããããããŠã匷åãªã©ã€ãã©ãªã䜿çšããŠååã®çµéšã掻çšããããšã«ããŸããã OpenCVã åœæãR çšã®æ¢è£œã®ããã±ãŒãžã¯ãããŸããã§ãã (çŸåšã¯ãããŸãã)ããã®ãããå¿ èŠãªæ©èœã®æå°éã®å®è£ 㯠C++ ã§èšè¿°ããã以äžã䜿çšã㊠R ã³ãŒãã«çµ±åãããŸããã .
ãã®åé¡ã解決ããããã«ã次ã®ããã±ãŒãžãšã©ã€ãã©ãªã䜿çšãããŸããã
-
OpenCV ç»åã®æäœãç·ã®æç»ã«äœ¿çšããŸãã ããªã€ã³ã¹ããŒã«ãããã·ã¹ãã ã©ã€ãã©ãªãšããã㌠ãã¡ã€ã«ãããã³ãã€ããã㯠ãªã³ã¯ã䜿çšããŸããã
-
ãšã¯ã¹ãã³ãœã« å€æ¬¡å é åãšãã³ãœã«ãæäœããããã®ãã®ã§ãã åãååã® R ããã±ãŒãžã«å«ãŸããããã㌠ãã¡ã€ã«ã䜿çšããŸããã ãã®ã©ã€ãã©ãªã䜿çšãããšãè¡åªå é åºãšååªå é åºã®äž¡æ¹ã§å€æ¬¡å é åãæäœã§ããŸãã
-
ã³ãžãœã³ JSON ã解æããããã ãã®ã©ã€ãã©ãªã¯ä»¥äžã§äœ¿çšãããŸã ãšã¯ã¹ãã³ãœã« ãããžã§ã¯ãå ã«ååšããå Žåã¯èªåçã«å®è¡ãããŸãã
-
Rcppã¹ã¬ãã JSON ããã®ãã¯ãã«ã®ãã«ãã¹ã¬ããåŠçãçµç¹åããŸãã ãã®ããã±ãŒãžã§æäŸãããããã㌠ãã¡ã€ã«ã䜿çšããŸããã ãã人æ°ã®ãããã®ãã RcppParallel ãã®ããã±ãŒãžã«ã¯ããšãããã«ãŒãå²ã蟌ã¿ã¡ã«ããºã ãçµã¿èŸŒãŸããŠããŸãã
ããã¯ãããšã¯æ³šç®ã«å€ããŸã ãšã¯ã¹ãã³ãœã« ããã¯å€©ã®æµã¿ã§ãããåºç¯ãªæ©èœãšé«ãããã©ãŒãã³ã¹ãåããŠããããšã«å ããŠãéçºè ã¯éåžžã«åå¿ãè¯ãã質åã«è¿ éãã€è©³çŽ°ã«çããŠãããŸããã 圌ãã®å©ãã«ãããOpenCV è¡åã® xtensor ãã³ãœã«ãžã®å€æãšã3 次å ç»åãã³ãœã«ãæ£ãã次å (ãããèªäœ) ã® 4 次å ãã³ãœã«ã«çµåããæ¹æ³ãå®è£ ããããšãã§ããŸããã
RcppãxtensorãRcppThreadãåŠã¶ããã®ææ
ã·ã¹ãã ãã¡ã€ã«ã䜿çšãããã¡ã€ã«ãšãã·ã¹ãã ã«ã€ã³ã¹ããŒã«ãããŠããã©ã€ãã©ãªãšã®åçãªã³ã¯ãã³ã³ãã€ã«ããã«ã¯ãããã±ãŒãžã«å®è£ ãããŠãããã©ââã°ã€ã³ ã¡ã«ããºã ã䜿çšããŸããã ã ãã¹ãšãã©ã°ãèªåçã«æ€çŽ¢ããããã«ãäžè¬ç㪠Linux ãŠãŒãã£ãªãã£ã䜿çšããŸããã ããã±ãŒãžæ§æ.
OpenCVã©ã€ãã©ãªã䜿çšããããã®Rcppãã©ã°ã€ã³ã®å®è£
Rcpp::registerPlugin("opencv", function() {
# ÐПзЌПжМÑе ÐœÐ°Ð·Ð²Ð°ÐœÐžÑ Ð¿Ð°ÐºÐµÑа
pkg_config_name <- c("opencv", "opencv4")
# ÐОМаÑÐœÑй Ñайл ÑÑОлОÑÑ pkg-config
pkg_config_bin <- Sys.which("pkg-config")
# ÐÑПвÑека МалОÑÐžÑ ÑÑОлОÑÑ Ð² ÑОÑÑеЌе
checkmate::assert_file_exists(pkg_config_bin, access = "x")
# ÐÑПвеÑка МалОÑÐžÑ Ñайла МаÑÑÑПек OpenCV ÐŽÐ»Ñ pkg-config
check <- sapply(pkg_config_name,
function(pkg) system(paste(pkg_config_bin, pkg)))
if (all(check != 0)) {
stop("OpenCV config for the pkg-config not found", call. = FALSE)
}
pkg_config_name <- pkg_config_name[check == 0]
list(env = list(
PKG_CXXFLAGS = system(paste(pkg_config_bin, "--cflags", pkg_config_name),
intern = TRUE),
PKG_LIBS = system(paste(pkg_config_bin, "--libs", pkg_config_name),
intern = TRUE)
))
})
ãã©ã°ã€ã³ã®æäœã®çµæãã³ã³ãã€ã« ããã»ã¹äžã«æ¬¡ã®å€ã眮ãæããããŸãã
Rcpp:::.plugins$opencv()$env
# $PKG_CXXFLAGS
# [1] "-I/usr/include/opencv"
#
# $PKG_LIBS
# [1] "-lopencv_shape -lopencv_stitching -lopencv_superres -lopencv_videostab -lopencv_aruco -lopencv_bgsegm -lopencv_bioinspired -lopencv_ccalib -lopencv_datasets -lopencv_dpm -lopencv_face -lopencv_freetype -lopencv_fuzzy -lopencv_hdf -lopencv_line_descriptor -lopencv_optflow -lopencv_video -lopencv_plot -lopencv_reg -lopencv_saliency -lopencv_stereo -lopencv_structured_light -lopencv_phase_unwrapping -lopencv_rgbd -lopencv_viz -lopencv_surface_matching -lopencv_text -lopencv_ximgproc -lopencv_calib3d -lopencv_features2d -lopencv_flann -lopencv_xobjdetect -lopencv_objdetect -lopencv_ml -lopencv_xphoto -lopencv_highgui -lopencv_videoio -lopencv_imgcodecs -lopencv_photo -lopencv_imgproc -lopencv_core"
JSON ã解æããã¢ãã«ã«éä¿¡ããããããçæããããã®å®è£ ã³ãŒãã¯ãã¹ãã€ã©ãŒã®äžã«ç€ºãããŠããŸãã ãŸããããã㌠ãã¡ã€ã« (ndjson ã«å¿ èŠ) ãæ€çŽ¢ããããã®ããŒã«ã« ãããžã§ã¯ã ãã£ã¬ã¯ããªãè¿œå ããŸãã
Sys.setenv("PKG_CXXFLAGS" = paste0("-I", normalizePath(file.path("src"))))
C++ ã§ã® JSON ãããã³ãœã«ãžã®å€æã®å®è£
// [[Rcpp::plugins(cpp14)]]
// [[Rcpp::plugins(opencv)]]
// [[Rcpp::depends(xtensor)]]
// [[Rcpp::depends(RcppThread)]]
#include <xtensor/xjson.hpp>
#include <xtensor/xadapt.hpp>
#include <xtensor/xview.hpp>
#include <xtensor-r/rtensor.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <Rcpp.h>
#include <RcppThread.h>
// Ð¡ÐžÐœÐŸÐœÐžÐŒÑ ÐŽÐ»Ñ ÑОпПв
using RcppThread::parallelFor;
using json = nlohmann::json;
using points = xt::xtensor<double,2>; // ÐзвлеÑÑММÑе Оз JSON кППÑЎОМаÑÑ ÑПÑек
using strokes = std::vector<points>; // ÐзвлеÑÑММÑе Оз JSON кППÑЎОМаÑÑ ÑПÑек
using xtensor3d = xt::xtensor<double, 3>; // Ð¢ÐµÐœÐ·ÐŸÑ ÐŽÐ»Ñ Ñ
ÑÐ°ÐœÐµÐœÐžÑ ÐŒÐ°ÑÑОÑÑ ÐžÐ·ÐŸÐŸÐ±ÑажеМОÑ
using xtensor4d = xt::xtensor<double, 4>; // Ð¢ÐµÐœÐ·ÐŸÑ ÐŽÐ»Ñ Ñ
ÑÐ°ÐœÐµÐœÐžÑ ÐŒÐœÐŸÐ¶ÐµÑÑва ОзПбÑажеМОй
using rtensor3d = xt::rtensor<double, 3>; // ÐбÑÑÑка ÐŽÐ»Ñ ÑкÑпПÑÑа в R
using rtensor4d = xt::rtensor<double, 4>; // ÐбÑÑÑка ÐŽÐ»Ñ ÑкÑпПÑÑа в R
// СÑаÑОÑеÑкОе кПМÑÑаМÑÑ
// Ð Ð°Ð·ÐŒÐµÑ ÐžÐ·ÐŸÐ±ÑÐ°Ð¶ÐµÐœÐžÑ Ð² пОкÑелÑÑ
const static int SIZE = 256;
// ТОп лОМОО
// СЌ. https://en.wikipedia.org/wiki/Pixel_connectivity#2-dimensional
const static int LINE_TYPE = cv::LINE_4;
// ТПлÑОМа лОМОО в пОкÑелÑÑ
const static int LINE_WIDTH = 3;
// ÐлгПÑОÑÐŒ ÑеÑайза
// https://docs.opencv.org/3.1.0/da/d54/group__imgproc__transform.html#ga5bb5a1fea74ea38e1a5445ca803ff121
const static int RESIZE_TYPE = cv::INTER_LINEAR;
// КаблПМ ÐŽÐ»Ñ ÐºÐŸÐœÐ²ÐµÑÑОÑÐŸÐ²Ð°ÐœÐžÑ OpenCV-ЌаÑÑОÑÑ Ð² ÑеМзПÑ
template <typename T, int NCH, typename XT=xt::xtensor<T,3,xt::layout_type::column_major>>
XT to_xt(const cv::Mat_<cv::Vec<T, NCH>>& src) {
// РазЌеÑМПÑÑÑ ÑелевПгП ÑеМзПÑа
std::vector<int> shape = {src.rows, src.cols, NCH};
// ÐбÑее кПлОÑеÑÑвП ÑлеЌеМÑПв в ЌаÑÑОве
size_t size = src.total() * NCH;
// ÐÑеПбÑазПваМОе cv::Mat в xt::xtensor
XT res = xt::adapt((T*) src.data, size, xt::no_ownership(), shape);
return res;
}
// ÐÑеПбÑазПваМОе JSON в ÑпОÑПк кППÑÐŽÐžÐœÐ°Ñ ÑПÑек
strokes parse_json(const std::string& x) {
auto j = json::parse(x);
// РезÑлÑÑÐ°Ñ Ð¿Ð°ÑÑОМга ЎПлжеМ бÑÑÑ ÐŒÐ°ÑÑОвПЌ
if (!j.is_array()) {
throw std::runtime_error("'x' must be JSON array.");
}
strokes res;
res.reserve(j.size());
for (const auto& a: j) {
// ÐажЎÑй ÑÐ»ÐµÐŒÐµÐœÑ ÐŒÐ°ÑÑОва ЎПлжеМ бÑÑÑ 2-ЌеÑÐœÑÐŒ ЌаÑÑОвПЌ
if (!a.is_array() || a.size() != 2) {
throw std::runtime_error("'x' must include only 2d arrays.");
}
// ÐзвлеÑеМОе векÑПÑа ÑПÑек
auto p = a.get<points>();
res.push_back(p);
}
return res;
}
// ÐÑÑОÑПвка лОМОй
// ЊвеÑа HSV
cv::Mat ocv_draw_lines(const strokes& x, bool color = true) {
// ÐÑÑ
ПЎМÑй ÑОп ЌаÑÑОÑÑ
auto stype = color ? CV_8UC3 : CV_8UC1;
// ÐÑПгПвÑй ÑОп ЌаÑÑОÑÑ
auto dtype = color ? CV_32FC3 : CV_32FC1;
auto bg = color ? cv::Scalar(0, 0, 255) : cv::Scalar(255);
auto col = color ? cv::Scalar(0, 255, 220) : cv::Scalar(0);
cv::Mat img = cv::Mat(SIZE, SIZE, stype, bg);
// ÐПлОÑеÑÑвП лОМОй
size_t n = x.size();
for (const auto& s: x) {
// ÐПлОÑеÑÑвП ÑПÑек в лОМОО
size_t n_points = s.shape()[1];
for (size_t i = 0; i < n_points - 1; ++i) {
// ТПÑка МаÑала ÑÑÑОÑ
а
cv::Point from(s(0, i), s(1, i));
// ТПÑка ПкПМÑÐ°ÐœÐžÑ ÑÑÑОÑ
а
cv::Point to(s(0, i + 1), s(1, i + 1));
// ÐÑÑОÑПвка лОМОО
cv::line(img, from, to, col, LINE_WIDTH, LINE_TYPE);
}
if (color) {
// ÐеМÑеЌ ÑÐ²ÐµÑ Ð»ÐžÐœÐžÐž
col[0] += 180 / n;
}
}
if (color) {
// ÐеМÑеЌ ÑвеÑПвПе пÑеЎÑÑавлеМОе Ма RGB
cv::cvtColor(img, img, cv::COLOR_HSV2RGB);
}
// ÐеМÑеЌ ÑПÑÐŒÐ°Ñ Ð¿ÑеЎÑÑÐ°Ð²Ð»ÐµÐœÐžÑ ÐœÐ° float32 Ñ ÐŽÐžÐ°Ð¿Ð°Ð·ÐŸÐœÐŸÐŒ [0, 1]
img.convertTo(img, dtype, 1 / 255.0);
return img;
}
// ÐбÑабПÑка JSON О пПлÑÑеМОе ÑеМзПÑа Ñ ÐŽÐ°ÐœÐœÑЌО ОзПбÑажеМОÑ
xtensor3d process(const std::string& x, double scale = 1.0, bool color = true) {
auto p = parse_json(x);
auto img = ocv_draw_lines(p, color);
if (scale != 1) {
cv::Mat out;
cv::resize(img, out, cv::Size(), scale, scale, RESIZE_TYPE);
cv::swap(img, out);
out.release();
}
xtensor3d arr = color ? to_xt<double,3>(img) : to_xt<double,1>(img);
return arr;
}
// [[Rcpp::export]]
rtensor3d cpp_process_json_str(const std::string& x,
double scale = 1.0,
bool color = true) {
xtensor3d res = process(x, scale, color);
return res;
}
// [[Rcpp::export]]
rtensor4d cpp_process_json_vector(const std::vector<std::string>& x,
double scale = 1.0,
bool color = false) {
size_t n = x.size();
size_t dim = floor(SIZE * scale);
size_t channels = color ? 3 : 1;
xtensor4d res({n, dim, dim, channels});
parallelFor(0, n, [&x, &res, scale, color](int i) {
xtensor3d tmp = process(x[i], scale, color);
auto view = xt::view(res, i, xt::all(), xt::all(), xt::all());
view = tmp;
});
return res;
}
ãã®ã³ãŒãã¯ãã¡ã€ã«ã«é
眮ããå¿
èŠããããŸã src/cv_xt.cpp
ã³ãã³ãã§ã³ã³ãã€ã«ããŸã Rcpp::sourceCpp(file = "src/cv_xt.cpp", env = .GlobalEnv)
; ä»äºã«ãå¿
èŠãª nlohmann/json.hpp
ã®
-
to_xt
â ç»åè¡åãå€æããããã®ãã³ãã¬ãŒãé¢æ° (cv::Mat
) ããã³ãœã«ã«xt::xtensor
; -
parse_json
â ãã®é¢æ°ã¯ JSON æååã解æããç¹ã®åº§æšãæœåºããŠãã¯ãã«ã«ããã¯ããŸãã -
ocv_draw_lines
â çµæãšããŠåŸãããç¹ã®ãã¯ãã«ãããè€æ°è²ã®ç·ãæç»ããŸãã -
process
â äžèšã®æ©èœãçµã¿åãããŠãçµæã®ç»åãæ¡å€§çž®å°ããæ©èœãè¿œå ããŸãã -
cpp_process_json_str
- é¢æ°ã®ã©ãããŒprocess
ãçµæã R ãªããžã§ã¯ã (å€æ¬¡å é å) ã«ãšã¯ã¹ããŒãããŸãã -
cpp_process_json_vector
- é¢æ°ã®ã©ãããŒcpp_process_json_str
ããã«ãããæååãã¯ãã«ããã«ãã¹ã¬ãã ã¢ãŒãã§åŠçã§ããããã«ãªããŸãã
è€æ°è²ã®ç·ãæç»ããã«ã¯ãHSV ã«ã©ãŒ ã¢ãã«ã䜿çšãããã®åŸ RGB ã«å€æããŸããã çµæããã¹ãããŠã¿ãŸããã:
arr <- cpp_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256 3
plot(magick::image_read(arr))
R ãš C++ ã§ã®å®è£
é床ã®æ¯èŒ
res_bench <- bench::mark(
r_process_json_str(tmp_data[4, drawing], scale = 0.5),
cpp_process_json_str(tmp_data[4, drawing], scale = 0.5),
check = FALSE,
min_iterations = 100
)
# ÐаÑаЌеÑÑÑ Ð±ÐµÐœÑЌаÑка
cols <- c("expression", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]
# expression min median max `itr/sec` total_time n_itr
# <chr> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:tm> <int>
# 1 r_process_json_str 3.49ms 3.55ms 4.47ms 273. 490ms 134
# 2 cpp_process_json_str 1.94ms 2.02ms 5.32ms 489. 497ms 243
library(ggplot2)
# ÐÑПвеЎеМОе заЌеÑа
res_bench <- bench::press(
batch_size = 2^(4:10),
{
.data <- tmp_data[sample(seq_len(.N), batch_size), drawing]
bench::mark(
r_process_json_vector(.data, scale = 0.5),
cpp_process_json_vector(.data, scale = 0.5),
min_iterations = 50,
check = FALSE
)
}
)
res_bench[, cols]
# expression batch_size min median max `itr/sec` total_time n_itr
# <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:tm> <int>
# 1 r 16 50.61ms 53.34ms 54.82ms 19.1 471.13ms 9
# 2 cpp 16 4.46ms 5.39ms 7.78ms 192. 474.09ms 91
# 3 r 32 105.7ms 109.74ms 212.26ms 7.69 6.5s 50
# 4 cpp 32 7.76ms 10.97ms 15.23ms 95.6 522.78ms 50
# 5 r 64 211.41ms 226.18ms 332.65ms 3.85 12.99s 50
# 6 cpp 64 25.09ms 27.34ms 32.04ms 36.0 1.39s 50
# 7 r 128 534.5ms 627.92ms 659.08ms 1.61 31.03s 50
# 8 cpp 128 56.37ms 58.46ms 66.03ms 16.9 2.95s 50
# 9 r 256 1.15s 1.18s 1.29s 0.851 58.78s 50
# 10 cpp 256 114.97ms 117.39ms 130.09ms 8.45 5.92s 50
# 11 r 512 2.09s 2.15s 2.32s 0.463 1.8m 50
# 12 cpp 512 230.81ms 235.6ms 261.99ms 4.18 11.97s 50
# 13 r 1024 4s 4.22s 4.4s 0.238 3.5m 50
# 14 cpp 1024 410.48ms 431.43ms 462.44ms 2.33 21.45s 50
ggplot(res_bench, aes(x = factor(batch_size), y = median,
group = expression, color = expression)) +
geom_point() +
geom_line() +
ylab("median time, s") +
theme_minimal() +
scale_color_discrete(name = "", labels = c("cpp", "r")) +
theme(legend.position = "bottom")
ã芧ã®ãšãããé床ã®åäžã¯éåžžã«å€§å¹ ã§ããããšãå€æããR ã³ãŒãã䞊ååããŠã C++ ã³ãŒãã«è¿œãã€ãããšã¯ã§ããŸããã
3. ããŒã¿ããŒã¹ããããããã¢ã³ããŒãããããã®ã€ãã¬ãŒã¿
R 㯠RAM ã«åãŸãããŒã¿åŠçã§å®è©ããããŸãããPython ã¯å埩çãªããŒã¿åŠçã«ããç¹åŸŽããããã¢ãŠããªãã³ã¢èšç® (å€éšã¡ã¢ãªã䜿çšããèšç®) ãç°¡åãã€èªç¶ã«å®è£ ã§ããŸãã 説æãããŠããåé¡ã®ã³ã³ããã¹ãã«ãããŠãå€å žçã§é¢é£æ§ã®ããäŸã¯ã芳枬å€ã®ããäžéšãŸãã¯ãããããã䜿çšããŠåã¹ãããã§ã®åŸé ãè¿äŒŒããåŸé éäžæ³ã«ãã£ãŠãã¬ãŒãã³ã°ããããã£ãŒã ãã¥ãŒã©ã« ãããã¯ãŒã¯ã§ãã
Python ã§æžããã深局åŠç¿ãã¬ãŒã ã¯ãŒã¯ã«ã¯ãããŒãã«ããã©ã«ããŒå ã®ç»åããã€ããªåœ¢åŒãªã©ã®ããŒã¿ã«åºã¥ããŠã€ãã¬ãŒã¿ãŒãå®è£ ããç¹å¥ãªã¯ã©ã¹ããããŸããæ¢è£œã®ãªãã·ã§ã³ã䜿çšããããšããç¹å®ã®ã¿ã¹ã¯çšã«ç¬èªã®ãªãã·ã§ã³ãäœæããããšãã§ããŸãã R ã§ã¯ãPython ã©ã€ãã©ãªã®ãã¹ãŠã®æ©èœãå©çšã§ããŸãã keras åãååã®ããã±ãŒãžã䜿çšããããŸããŸãªããã¯ãšã³ãã§ãããã±ãŒãžã®äžã§åäœããŸãã 網ç¶ã åŸè ã«ã€ããŠã¯ãå¥ã®é·ãèšäºãæžã䟡å€ããããŸãã R ãã Python ã³ãŒããå®è¡ã§ããã ãã§ãªããR ãš Python ã»ãã·ã§ã³éã§ãªããžã§ã¯ãã転éããå¿ èŠãªåå€æããã¹ãŠèªåçã«å®è¡ããããšãã§ããŸãã
MonetDBLite ã䜿çšããããšã§ããã¹ãŠã®ããŒã¿ã RAM ã«ä¿åããå¿ èŠããªããªããŸããããã¹ãŠã®ããã¥ãŒã©ã« ãããã¯ãŒã¯ãäœæ¥ã¯ Python ã®å ã®ã³ãŒãã«ãã£ãŠå®è¡ãããŸããäœãæºåãã§ããŠããªããããããŒã¿ã«ã€ãã¬ãŒã¿ãèšè¿°ããã ãã§æžã¿ãŸãããã®ãããªç¶æ³ã«ã¯ãR ãŸã㯠Python ã䜿çšããŸãã åºæ¬çã«ããã®èŠä»¶ã¯ XNUMX ã€ã ãã§ããç¡éã«ãŒãã§ããããè¿ãããšãšãå埩éã§ãã®ç¶æ ãä¿åããããšã§ã (R ã«ãããåŸè ã¯ãã¯ããŒãžã£ã䜿çšããæãåçŽãªæ¹æ³ã§å®è£ ãããŸã)ã 以åã¯ãã€ãã¬ãŒã¿å 㧠R é åã numpy é åã«æ瀺çã«å€æããå¿ èŠããããŸããããçŸåšã®ããŒãžã§ã³ã®ããã±ãŒãžã§ã¯ keras èªåã§ããã®ã§ãã
ãã¬ãŒãã³ã° ããŒã¿ãšæ€èšŒããŒã¿ã®ã€ãã¬ãŒã¿ã¯æ¬¡ã®ããã«ãªããŸããã
ãã¬ãŒãã³ã°ããã³æ€èšŒããŒã¿ã®ã€ãã¬ãŒã¿
train_generator <- function(db_connection = con,
samples_index,
num_classes = 340,
batch_size = 32,
scale = 1,
color = FALSE,
imagenet_preproc = FALSE) {
# ÐÑПвеÑка аÑгÑЌеМÑПв
checkmate::assert_class(con, "DBIConnection")
checkmate::assert_integerish(samples_index)
checkmate::assert_count(num_classes)
checkmate::assert_count(batch_size)
checkmate::assert_number(scale, lower = 0.001, upper = 5)
checkmate::assert_flag(color)
checkmate::assert_flag(imagenet_preproc)
# ÐеÑеЌеÑОваеЌ, ÑÑÐŸÐ±Ñ Ð±ÑаÑÑ Ðž ÑЎалÑÑÑ ÐžÑпПлÑзПваММÑе ОМЎекÑÑ Ð±Ð°ÑÑей пП пПÑÑЎкÑ
dt <- data.table::data.table(id = sample(samples_index))
# ÐÑПÑÑавлÑеЌ МПЌеÑа баÑÑей
dt[, batch := (.I - 1L) %/% batch_size + 1L]
# ÐÑÑавлÑеЌ ÑПлÑкП пПлМÑе баÑÑО О ОМЎекÑОÑÑеЌ
dt <- dt[, if (.N == batch_size) .SD, keyby = batch]
# УÑÑаМавлОваеЌ ÑÑÑÑÑОк
i <- 1
# ÐПлОÑеÑÑвП баÑÑей
max_i <- dt[, max(batch)]
# ÐПЎгПÑПвка вÑÑÐ°Ð¶ÐµÐœÐžÑ ÐŽÐ»Ñ Ð²ÑгÑÑзкО
sql <- sprintf(
"PREPARE SELECT drawing, label_int FROM doodles WHERE id IN (%s)",
paste(rep("?", batch_size), collapse = ",")
)
res <- DBI::dbSendQuery(con, sql)
# ÐМалПг keras::to_categorical
to_categorical <- function(x, num) {
n <- length(x)
m <- numeric(n * num)
m[x * n + seq_len(n)] <- 1
dim(m) <- c(n, num)
return(m)
}
# ÐаЌÑкаМОе
function() {
# ÐаÑОМаеЌ МПвÑÑ ÑпПÑ
Ñ
if (i > max_i) {
dt[, id := sample(id)]
data.table::setkey(dt, batch)
# СбÑаÑÑваеЌ ÑÑÑÑÑОк
i <<- 1
max_i <<- dt[, max(batch)]
}
# ID ÐŽÐ»Ñ Ð²ÑгÑÑзкО ЎаММÑÑ
batch_ind <- dt[batch == i, id]
# ÐÑгÑÑзка ЎаММÑÑ
batch <- DBI::dbFetch(DBI::dbBind(res, as.list(batch_ind)), n = -1)
# УвелОÑОваеЌ ÑÑÑÑÑОк
i <<- i + 1
# ÐаÑÑОМг JSON О пПЎгПÑПвка ЌаÑÑОва
batch_x <- cpp_process_json_vector(batch$drawing, scale = scale, color = color)
if (imagenet_preproc) {
# КкалОÑПваМОе c ОМÑеÑвала [0, 1] Ма ОМÑеÑвал [-1, 1]
batch_x <- (batch_x - 0.5) * 2
}
batch_y <- to_categorical(batch$label_int, num_classes)
result <- list(batch_x, batch_y)
return(result)
}
}
ãã®é¢æ°ã¯ãããŒã¿ããŒã¹ãžã®æ¥ç¶ã䜿çšãããè¡æ°ãã¯ã©ã¹æ°ãããã ãµã€ãºãã¹ã±ãŒã« (scale = 1
256x256 ãã¯ã»ã«ã®ã¬ã³ããªã³ã°ç»åã«å¯Ÿå¿ãã scale = 0.5
â 128x128 ãã¯ã»ã«)ãã«ã©ãŒã€ã³ãžã±ãŒã¿ãŒ (color = FALSE
䜿çšæã«ã°ã¬ãŒã¹ã±ãŒã«ã§ã®ã¬ã³ããªã³ã°ãæå®ããŸã color = TRUE
åã¹ãããŒã¯ã¯æ°ããè²ã§æç»ãããŸã) ãšãimagenet ã§äºåãã¬ãŒãã³ã°ããããããã¯ãŒã¯ã®ååŠçã€ã³ãžã±ãŒã¿ãŒã§ãã åŸè
ã¯ããã¯ã»ã«å€ãéé [0, 1] ããéé [-1, 1] ã«ã¹ã±ãŒãªã³ã°ããããã«å¿
èŠã§ããããã¯ãæäŸããããã¬ãŒãã³ã°æã«äœ¿çšãããŸããã keras ã¢ãã«ã
å€éšé¢æ°ã«ã¯åŒæ°ã®åãã§ãã¯ãããŒãã«ãå«ãŸããŠããŸãã data.table
ã©ã³ãã ã«æ··åãããè¡çªå· samples_index
ãããçªå·ãã«ãŠã³ã¿ããããã®æ倧æ°ãããã³ããŒã¿ããŒã¹ããããŒã¿ãã¢ã³ããŒãããããã® SQL åŒã ããã«ãé¢æ°ã®é«éãªé¡äŒŒç©ãå
éšã«å®çŸ©ããŸããã keras::to_categorical()
ã ã»ãŒãã¹ãŠã®ããŒã¿ããã¬ãŒãã³ã°ã«äœ¿çšããååã®ããŒã¿ãæ€èšŒçšã«æ®ããããããšãã㯠ãµã€ãºã¯ãã©ã¡ãŒã¿ã«ãã£ãŠå¶éãããŸããã steps_per_epoch
åŒã°ãããšã keras::fit_generator()
ãããã³æ¡ä»¶ if (i > max_i)
æ€èšŒã€ãã¬ãŒã¿ã§ã®ã¿æ©èœããŸããã
å
éšé¢æ°ã§ã¯ã次ã®ãããã®è¡ã€ã³ããã¯ã¹ãååŸãããããã ã«ãŠã³ã¿ãŒãå¢å ããŠã¬ã³ãŒããããŒã¿ããŒã¹ããã¢ã³ããŒããããJSON 解æ (é¢æ° cpp_process_json_vector()
ãC++ ã§æžãããŠããŸãïŒãç»åã«å¯Ÿå¿ããé
åãäœæããŸãã 次ã«ãã¯ã©ã¹ ã©ãã«ãæã€ã¯ã³ããã ãã¯ãã«ãäœæããããã¯ã»ã«å€ãšã©ãã«ãæã€é
åãæ»ãå€ã®ãªã¹ãã«çµåãããŸãã äœæ¥ãé«éåããããã«ãããŒãã«å
ã«ã€ã³ããã¯ã¹ãäœæããŸããã data.table
ããã³ãªã³ã¯çµç±ã®å€æŽ - ãããã®ããã±ãŒãžããããããªã ããŒã¿è¡š R ã§å€§éã®ããŒã¿ãå¹æçã«åŠçããããšãæ³åããã®ã¯éåžžã«å°é£ã§ãã
Core i5 ã©ãããããã§ã®é床枬å®ã®çµæã¯æ¬¡ã®ãšããã§ãã
ã€ãã¬ãŒã¿ã®ãã³ãããŒã¯
library(Rcpp)
library(keras)
library(ggplot2)
source("utils/rcpp.R")
source("utils/keras_iterator.R")
con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))
ind <- seq_len(DBI::dbGetQuery(con, "SELECT count(*) FROM doodles")[[1L]])
num_classes <- DBI::dbGetQuery(con, "SELECT max(label_int) + 1 FROM doodles")[[1L]]
# ÐМЎекÑÑ ÐŽÐ»Ñ ÐŸÐ±ÑÑаÑÑей вÑбПÑкО
train_ind <- sample(ind, floor(length(ind) * 0.995))
# ÐМЎекÑÑ ÐŽÐ»Ñ Ð¿ÑПвеÑПÑМПй вÑбПÑкО
val_ind <- ind[-train_ind]
rm(ind)
# ÐПÑÑÑОÑÐžÐµÐœÑ ÐŒÐ°ÑÑÑаба
scale <- 0.5
# ÐÑПвеЎеМОе заЌеÑа
res_bench <- bench::press(
batch_size = 2^(4:10),
{
it1 <- train_generator(
db_connection = con,
samples_index = train_ind,
num_classes = num_classes,
batch_size = batch_size,
scale = scale
)
bench::mark(
it1(),
min_iterations = 50L
)
}
)
# ÐаÑаЌеÑÑÑ Ð±ÐµÐœÑЌаÑка
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]
# batch_size min median max `itr/sec` total_time n_itr
# <dbl> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:tm> <int>
# 1 16 25ms 64.36ms 92.2ms 15.9 3.09s 49
# 2 32 48.4ms 118.13ms 197.24ms 8.17 5.88s 48
# 3 64 69.3ms 117.93ms 181.14ms 8.57 5.83s 50
# 4 128 157.2ms 240.74ms 503.87ms 3.85 12.71s 49
# 5 256 359.3ms 613.52ms 988.73ms 1.54 30.5s 47
# 6 512 884.7ms 1.53s 2.07s 0.674 1.11m 45
# 7 1024 2.7s 3.83s 5.47s 0.261 2.81m 44
ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
geom_point() +
geom_line() +
ylab("median time, s") +
theme_minimal()
DBI::dbDisconnect(con, shutdown = TRUE)
ååãªéã® RAM ãããå Žåã¯ãããŒã¿ããŒã¹ãåã RAM ã«è»¢éããããšã§ãããŒã¿ããŒã¹ã®åäœã倧å¹
ã«é«éåã§ããŸã (ãã®ã¿ã¹ã¯ã«ã¯ 32 GB ã§ååã§ã)ã Linux ã§ã¯ãããŒãã£ã·ã§ã³ã¯ããã©ã«ãã§ããŠã³ããããŸã /dev/shm
ãRAM容éã®æ倧ååãå æããŸãã ç·šéããããšã§ããã«åŒ·èª¿è¡šç€ºã§ããŸã /etc/fstab
ã®ãããªã¬ã³ãŒããååŸããã«ã¯ tmpfs /dev/shm tmpfs defaults,size=25g 0 0
ã å¿
ãåèµ·åããã³ãã³ããå®è¡ããŠçµæã確èªããŠãã ããã df -h
.
ãã¹ã ããŒã¿ã»ããã¯å®å šã« RAM ã«åãŸãããããã¹ã ããŒã¿ã®ã€ãã¬ãŒã¿ã¯éåžžã«åçŽã«èŠããŸãã
ãã¹ãããŒã¿ã®ã€ãã¬ãŒã¿
test_generator <- function(dt,
batch_size = 32,
scale = 1,
color = FALSE,
imagenet_preproc = FALSE) {
# ÐÑПвеÑка аÑгÑЌеМÑПв
checkmate::assert_data_table(dt)
checkmate::assert_count(batch_size)
checkmate::assert_number(scale, lower = 0.001, upper = 5)
checkmate::assert_flag(color)
checkmate::assert_flag(imagenet_preproc)
# ÐÑПÑÑавлÑеЌ МПЌеÑа баÑÑей
dt[, batch := (.I - 1L) %/% batch_size + 1L]
data.table::setkey(dt, batch)
i <- 1
max_i <- dt[, max(batch)]
# ÐаЌÑкаМОе
function() {
batch_x <- cpp_process_json_vector(dt[batch == i, drawing],
scale = scale, color = color)
if (imagenet_preproc) {
# КкалОÑПваМОе c ОМÑеÑвала [0, 1] Ма ОМÑеÑвал [-1, 1]
batch_x <- (batch_x - 0.5) * 2
}
result <- list(batch_x)
i <<- i + 1
return(result)
}
}
4. ã¢ãã«ã¢ãŒããã¯ãã£ã®éžæ
æåã«äœ¿çšãããã¢ãŒããã¯ãã£ã¯ (batch, height, width, 3)
ã€ãŸãããã£ãã«æ°ã¯å€æŽã§ããŸããã Python ã«ã¯ãã®ãããªå¶éããªããããå
ã®èšäºã«åŸã£ãŠ (keras ããŒãžã§ã³ã«ããããããã¢ãŠããªãã§) æ¥ãã§ãã®ã¢ãŒããã¯ãã£ã®ç¬èªã®å®è£
ãäœæããŸããã
ã¢ãã€ã«ããã v1 ã¢ãŒããã¯ãã£
library(keras)
top_3_categorical_accuracy <- custom_metric(
name = "top_3_categorical_accuracy",
metric_fn = function(y_true, y_pred) {
metric_top_k_categorical_accuracy(y_true, y_pred, k = 3)
}
)
layer_sep_conv_bn <- function(object,
filters,
alpha = 1,
depth_multiplier = 1,
strides = c(2, 2)) {
# NB! depth_multiplier != resolution multiplier
# https://github.com/keras-team/keras/issues/10349
layer_depthwise_conv_2d(
object = object,
kernel_size = c(3, 3),
strides = strides,
padding = "same",
depth_multiplier = depth_multiplier
) %>%
layer_batch_normalization() %>%
layer_activation_relu() %>%
layer_conv_2d(
filters = filters * alpha,
kernel_size = c(1, 1),
strides = c(1, 1)
) %>%
layer_batch_normalization() %>%
layer_activation_relu()
}
get_mobilenet_v1 <- function(input_shape = c(224, 224, 1),
num_classes = 340,
alpha = 1,
depth_multiplier = 1,
optimizer = optimizer_adam(lr = 0.002),
loss = "categorical_crossentropy",
metrics = c("categorical_crossentropy",
top_3_categorical_accuracy)) {
inputs <- layer_input(shape = input_shape)
outputs <- inputs %>%
layer_conv_2d(filters = 32, kernel_size = c(3, 3), strides = c(2, 2), padding = "same") %>%
layer_batch_normalization() %>%
layer_activation_relu() %>%
layer_sep_conv_bn(filters = 64, strides = c(1, 1)) %>%
layer_sep_conv_bn(filters = 128, strides = c(2, 2)) %>%
layer_sep_conv_bn(filters = 128, strides = c(1, 1)) %>%
layer_sep_conv_bn(filters = 256, strides = c(2, 2)) %>%
layer_sep_conv_bn(filters = 256, strides = c(1, 1)) %>%
layer_sep_conv_bn(filters = 512, strides = c(2, 2)) %>%
layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
layer_sep_conv_bn(filters = 1024, strides = c(2, 2)) %>%
layer_sep_conv_bn(filters = 1024, strides = c(1, 1)) %>%
layer_global_average_pooling_2d() %>%
layer_dense(units = num_classes) %>%
layer_activation_softmax()
model <- keras_model(
inputs = inputs,
outputs = outputs
)
model %>% compile(
optimizer = optimizer,
loss = loss,
metrics = metrics
)
return(model)
}
ãã®ã¢ãããŒãã®æ¬ ç¹ã¯æããã§ãã ããããã®ã¢ãã«ããã¹ããããã®ã§ãããéã«ãåã¢ãŒããã¯ãã£ãæåã§æžãçŽãã®ã¯ãããããªãã®ã§ãã ãŸããimagenet ã§äºåãã¬ãŒãã³ã°ãããã¢ãã«ã®éã¿ã䜿çšããæ©äŒã奪ãããŸããã ãã€ãã®ããã«ãããã¥ã¡ã³ããèªãããšã圹ã«ç«ã¡ãŸããã é¢æ° get_config()
ç·šéã«é©ãã圢åŒã§ã¢ãã«ã®èª¬æãååŸã§ããŸã (base_model_conf$layers
- éåžžã® R ãªã¹ã)ãããã³é¢æ° from_config()
ã¢ãã« ãªããžã§ã¯ããžã®éå€æãå®è¡ããŸãã
base_model_conf <- get_config(base_model)
base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
base_model <- from_config(base_model_conf)
ããã§ãæäŸãããé¢æ°ã®ãããããååŸãããŠãããŒãµã«é¢æ°ãäœæããã®ã¯é£ãããªããªããŸããã keras imagenet ã§ãã¬ãŒãã³ã°ãããéã¿ã®æç¡ã«ãããããã¢ãã«:
æ¢è£œã®ã¢ãŒããã¯ãã£ãããŒãããæ©èœ
get_model <- function(name = "mobilenet_v2",
input_shape = NULL,
weights = "imagenet",
pooling = "avg",
num_classes = NULL,
optimizer = keras::optimizer_adam(lr = 0.002),
loss = "categorical_crossentropy",
metrics = NULL,
color = TRUE,
compile = FALSE) {
# ÐÑПвеÑка аÑгÑЌеМÑПв
checkmate::assert_string(name)
checkmate::assert_integerish(input_shape, lower = 1, upper = 256, len = 3)
checkmate::assert_count(num_classes)
checkmate::assert_flag(color)
checkmate::assert_flag(compile)
# ÐПлÑÑаеЌ ПбÑÐµÐºÑ ÐžÐ· пакеÑа keras
model_fun <- get0(paste0("application_", name), envir = asNamespace("keras"))
# ÐÑПвеÑка МалОÑÐžÑ ÐŸÐ±ÑекÑа в пакеÑе
if (is.null(model_fun)) {
stop("Model ", shQuote(name), " not found.", call. = FALSE)
}
base_model <- model_fun(
input_shape = input_shape,
include_top = FALSE,
weights = weights,
pooling = pooling
)
# ÐÑлО ОзПбÑажеМОе Ме ÑвеÑМПе, ЌеМÑеЌ ÑазЌеÑМПÑÑÑ Ð²Ñ
ПЎа
if (!color) {
base_model_conf <- keras::get_config(base_model)
base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
base_model <- keras::from_config(base_model_conf)
}
predictions <- keras::get_layer(base_model, "global_average_pooling2d_1")$output
predictions <- keras::layer_dense(predictions, units = num_classes, activation = "softmax")
model <- keras::keras_model(
inputs = base_model$input,
outputs = predictions
)
if (compile) {
keras::compile(
object = model,
optimizer = optimizer,
loss = loss,
metrics = metrics
)
}
return(model)
}
ã·ã³ã°ã«ãã£ãã«ç»åã䜿çšããå Žåãäºåãã¬ãŒãã³ã°ãããéã¿ã¯äœ¿çšãããŸããã ããã¯ä¿®æ£ã§ããå¯èœæ§ããããŸã: é¢æ°ã䜿çšãã get_weights()
R é
åã®ãªã¹ãã®åœ¢åŒã§ã¢ãã«ã®éã¿ãååŸãããã®ãªã¹ãã®æåã®èŠçŽ ã®æ¬¡å
ãå€æŽã (XNUMX ã€ã®ã«ã©ãŒ ãã£ãã«ãååŸããããXNUMX ã€ãã¹ãŠãå¹³åããããšã«ãã£ãŠ)ãé¢æ°ã䜿çšããŠéã¿ãã¢ãã«ã«ããŒããçŽããŸãã set_weights()
ã ãã®æ®µéã§ã¯ãã«ã©ãŒç»åãåŠçããæ¹ãçç£æ§ãé«ãããšããã§ã«æããã§ãã£ãããããã®æ©èœã¯è¿œå ãããŸããã§ããã
ã»ãšãã©ã®å®éšã¯ãmobilenet ããŒãžã§ã³ 1 ãš 2ãããã³ resnet34 ã䜿çšããŠå®è¡ããŸããã SE-ResNeXt ãªã©ã®ææ°ã®ã¢ãŒããã¯ãã£ããã®ç«¶äºã§å¥œæ瞟ãåããŸããã æ®å¿µãªãããèªç±ã«äœ¿ããæ¢è£œã®å®è£ ããªããç¬èªã®å®è£ ãäœæããŸããã§ãã (ãã ããå¿ ãäœæããŸã)ã
5. ã¹ã¯ãªããã®ãã©ã¡ãŒã¿å
䟿å®äžããã¬ãŒãã³ã°ãéå§ããããã®ãã¹ãŠã®ã³ãŒãã¯ã次ã䜿çšããŠãã©ã¡ãŒã¿åãããåäžã®ã¹ã¯ãªãããšããŠèšèšãããŸããã
doc <- '
Usage:
train_nn.R --help
train_nn.R --list-models
train_nn.R [options]
Options:
-h --help Show this message.
-l --list-models List available models.
-m --model=<model> Neural network model name [default: mobilenet_v2].
-b --batch-size=<size> Batch size [default: 32].
-s --scale-factor=<ratio> Scale factor [default: 0.5].
-c --color Use color lines [default: FALSE].
-d --db-dir=<path> Path to database directory [default: Sys.getenv("db_dir")].
-r --validate-ratio=<ratio> Validate sample ratio [default: 0.995].
-n --n-gpu=<number> Number of GPUs [default: 1].
'
args <- docopt::docopt(doc)
ããã±ãŒãž ããã¯ãã å®è£
ãè¡šããŸã Rscript bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db
ãŸã㯠./bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db
ããã¡ã€ã«ã®å Žå train_nn.R
å®è¡å¯èœã§ã (ãã®ã³ãã³ãã¯ã¢ãã«ã®ãã¬ãŒãã³ã°ãéå§ããŸã) resnet50
128x128 ãã¯ã»ã«ã® XNUMX è²ç»åã®å ŽåãããŒã¿ããŒã¹ã¯æ¬¡ã®ãã©ã«ããŒã«é
眮ããå¿
èŠããããŸãã /home/andrey/doodle_db
ïŒã åŠç¿é床ããªããã£ãã€ã¶ãŒã®ã¿ã€ãããã®ä»ã®ã«ã¹ã¿ãã€ãºå¯èœãªãã©ã¡ãŒã¿ãŒããªã¹ãã«è¿œå ã§ããŸãã åºçç©ã®æºåã®éçšã§ãã¢ãŒããã¯ãã£ã次ã®ãšããã§ããããšãå€æããŸããã mobilenet_v2
çŸåšã®ããŒãžã§ã³ãã keras R䜿çšäž
ãã®ã¢ãããŒãã«ãããRStudio ã§ã¹ã¯ãªãããèµ·åããåŸæ¥ã®æ¹æ³ãšæ¯èŒããŠãããŸããŸãªã¢ãã«ã§ã®å®éšã倧å¹
ã«é«éåããããšãã§ããŸãã (代æ¿æ段ãšããŠããã±ãŒãžã«æ³šç®ããŸã)
6. ã¹ã¯ãªããã® Docker å
Docker ã䜿çšããŠãããŒã ã¡ã³ããŒéã§ã¢ãã«ããã¬ãŒãã³ã°ããã¯ã©ãŠãã«è¿
éã«ãããã€ããããã®ç°å¢ã®ç§»æ€æ§ã確ä¿ããŸããã R ããã°ã©ããŒã«ãšã£ãŠã¯æ¯èŒççãããã®ããŒã«ã«ã€ããŠã¯ã次ã®æé ã§å§ããããšãã§ããŸãã
Docker ã䜿çšãããšãç¬èªã®ã€ã¡ãŒãžãæåããäœæããããšããä»ã®ã€ã¡ãŒãžãç¬èªã®ã€ã¡ãŒãžãäœæããããã®ããŒã¹ãšããŠäœ¿çšããããšãã§ããŸãã å©çšå¯èœãªãªãã·ã§ã³ãåæãããšãããNVIDIAãCUDA+cuDNN ãã©ã€ããŒãããã³ Python ã©ã€ãã©ãªã®ã€ã³ã¹ããŒã«ãã€ã¡ãŒãžã®ããªãã®éšåã§ãããšããçµè«ã«éããå
¬åŒã€ã¡ãŒãžãåºç€ãšããŠæ¡çšããããšã«ããŸããã tensorflow/tensorflow:1.12.0-gpu
ãããã«å¿
èŠãª R ããã±ãŒãžãè¿œå ããŸãã
æçµç㪠docker ãã¡ã€ã«ã¯æ¬¡ã®ããã«ãªããŸãã
ããã«ãŒãã¡ã€ã«
FROM tensorflow/tensorflow:1.12.0-gpu
MAINTAINER Artem Klevtsov <[email protected]>
SHELL ["/bin/bash", "-c"]
ARG LOCALE="en_US.UTF-8"
ARG APT_PKG="libopencv-dev r-base r-base-dev littler"
ARG R_BIN_PKG="futile.logger checkmate data.table rcpp rapidjsonr dbi keras jsonlite curl digest remotes"
ARG R_SRC_PKG="xtensor RcppThread docopt MonetDBLite"
ARG PY_PIP_PKG="keras"
ARG DIRS="/db /app /app/data /app/models /app/logs"
RUN source /etc/os-release &&
echo "deb https://cloud.r-project.org/bin/linux/ubuntu ${UBUNTU_CODENAME}-cran35/" > /etc/apt/sources.list.d/cran35.list &&
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 &&
add-apt-repository -y ppa:marutter/c2d4u3.5 &&
add-apt-repository -y ppa:timsc/opencv-3.4 &&
apt-get update &&
apt-get install -y locales &&
locale-gen ${LOCALE} &&
apt-get install -y --no-install-recommends ${APT_PKG} &&
ln -s /usr/lib/R/site-library/littler/examples/install.r /usr/local/bin/install.r &&
ln -s /usr/lib/R/site-library/littler/examples/install2.r /usr/local/bin/install2.r &&
ln -s /usr/lib/R/site-library/littler/examples/installGithub.r /usr/local/bin/installGithub.r &&
echo 'options(Ncpus = parallel::detectCores())' >> /etc/R/Rprofile.site &&
echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site &&
apt-get install -y $(printf "r-cran-%s " ${R_BIN_PKG}) &&
install.r ${R_SRC_PKG} &&
pip install ${PY_PIP_PKG} &&
mkdir -p ${DIRS} &&
chmod 777 ${DIRS} &&
rm -rf /tmp/downloaded_packages/ /tmp/*.rds &&
rm -rf /var/lib/apt/lists/*
COPY utils /app/utils
COPY src /app/src
COPY tests /app/tests
COPY bin/*.R /app/
ENV DBDIR="/db"
ENV CUDA_HOME="/usr/local/cuda"
ENV PATH="/app:${PATH}"
WORKDIR /app
VOLUME /db
VOLUME /app
CMD bash
䟿å®äžã䜿çšãããããã±ãŒãžã¯å€æ°ã«å
¥ããããŸããã æžãããã¹ã¯ãªããã®å€§éšåã¯ãã¢ã»ã³ããªäžã«ã³ã³ããå
ã«ã³ããŒãããŸãã ã³ãã³ãã·ã§ã«ã次ã®ããã«å€æŽããŸããã /bin/bash
ã³ã³ãã³ãã䜿ããããããããã« /etc/os-release
ã ããã«ãããã³ãŒãå
㧠OS ããŒãžã§ã³ãæå®ããå¿
èŠããªããªããŸããã
ããã«ãããŸããŸãªã³ãã³ãã䜿çšããŠã³ã³ãããŒãèµ·åã§ããããã«ããå°ã㪠bash ã¹ã¯ãªãããäœæãããŸããã ããšãã°ããããã¯ãã³ã³ããå ã«ä»¥åã«é 眮ãããŠãããã¥ãŒã©ã« ãããã¯ãŒã¯ããã¬ãŒãã³ã°ããããã®ã¹ã¯ãªããããŸãã¯ã³ã³ããã®åäœããããã°ããã³ç£èŠããããã®ã³ãã³ã ã·ã§ã«ã§ããå¯èœæ§ããããŸãã
ã³ã³ãããèµ·åããã¹ã¯ãªãã
#!/bin/sh
DBDIR=${PWD}/db
LOGSDIR=${PWD}/logs
MODELDIR=${PWD}/models
DATADIR=${PWD}/data
ARGS="--runtime=nvidia --rm -v ${DBDIR}:/db -v ${LOGSDIR}:/app/logs -v ${MODELDIR}:/app/models -v ${DATADIR}:/app/data"
if [ -z "$1" ]; then
CMD="Rscript /app/train_nn.R"
elif [ "$1" = "bash" ]; then
ARGS="${ARGS} -ti"
else
CMD="Rscript /app/train_nn.R $@"
fi
docker run ${ARGS} doodles-tf ${CMD}
ãã® bash ã¹ã¯ãªããããã©ã¡ãŒã¿ãŒãªãã§å®è¡ãããšãã¹ã¯ãªããã¯ã³ã³ãããŒå
ã§åŒã³åºãããŸãã train_nn.R
ããã©ã«ãå€ã䜿çšããŸãã æåã®äœçœ®åŒæ°ããbashãã®å Žåãã³ã³ãããŒã¯ã³ãã³ã ã·ã§ã«ã䜿çšããŠå¯Ÿè©±çã«èµ·åããŸãã ãã以å€ã®å Žåã¯ãã¹ãŠãäœçœ®åŒæ°ã®å€ã眮ãæããããŸãã CMD="Rscript /app/train_nn.R $@"
.
ãœãŒã¹ ããŒã¿ãšããŒã¿ããŒã¹ãå«ããã£ã¬ã¯ããªãããã³ãã¬ãŒãã³ã°æžã¿ã¢ãã«ãä¿åãããã£ã¬ã¯ããªããã¹ã ã·ã¹ãã ããã³ã³ããå ã«ããŠã³ãããããããäžèŠãªæäœãè¡ããã«ã¹ã¯ãªããã®çµæã«ã¢ã¯ã»ã¹ã§ããããšã«æ³šç®ããŠãã ããã
7. Google Cloud ã§ã®è€æ°ã® GPU ã®äœ¿çš
ã³ã³ãã¹ãã®ç¹åŸŽã® 1 ã€ã¯ãéåžžã«ãã€ãºã®å€ãããŒã¿ã§ãã (ODS ã¹ã©ãã¯ã® @Leigh.plt ããåçšããã¿ã€ãã«ç»åãåç
§)ã 倧èŠæš¡ãªãããã¯ããã«å¯ŸåŠããã®ã«åœ¹ç«ã¡ãŸããXNUMX GPU ãæèŒãã PC ã§å®éšããåŸãã¯ã©ãŠãå
ã®è€æ°ã® GPU ã§ãã¬ãŒãã³ã° ã¢ãã«ããã¹ã¿ãŒããããšã«ããŸããã GoogleCloudã䜿çšïŒdev/shm
.
æãèå³æ·±ãã®ã¯ãè€æ°ã® GPU ã®äœ¿çšãæ åœããã³ãŒã ãã©ã°ã¡ã³ãã§ãã ãŸããPython ãšåæ§ã«ãã³ã³ããã¹ã ãããŒãžã£ãŒã䜿çšã㊠CPU äžã«ã¢ãã«ãäœæãããŸãã
with(tensorflow::tf$device("/cpu:0"), {
model_cpu <- get_model(
name = model_name,
input_shape = input_shape,
weights = weights,
metrics =(top_3_categorical_accuracy,
compile = FALSE
)
})
次ã«ãã³ã³ãã€ã«ãããŠããªã (ããã¯éèŠã§ã) ã¢ãã«ããæå®ãããæ°ã®äœ¿çšå¯èœãª GPU ã«ã³ããŒããããã®åŸã§ã®ã¿ã³ã³ãã€ã«ãããŸãã
model <- keras::multi_gpu_model(model_cpu, gpus = n_gpu)
keras::compile(
object = model,
optimizer = keras::optimizer_adam(lr = 0.0004),
loss = "categorical_crossentropy",
metrics = c(top_3_categorical_accuracy)
)
æåŸã®å±€ãé€ããã¹ãŠã®å±€ãããªãŒãºããæåŸã®å±€ããã¬ãŒãã³ã°ããã¢ãã«å šäœãããªãŒãºè§£é€ããŠåãã¬ãŒãã³ã°ãããšããå€å žçãªææ³ã¯ãè€æ°ã® GPU ã«å¯ŸããŠå®è£ ã§ããŸããã§ããã
ãã¬ãŒãã³ã°ã¯äœ¿çšããã«ç£èŠãããŸããã ãã³ãœã«ããŒããåãšããã¯ã®åŸã«ãã°ãèšé²ããæçãªååãä»ããŠã¢ãã«ãä¿åããããšã«éå®ããŸãã
ã³ãŒã«ããã¯
# КаблПМ ОЌеМО Ñайла лПга
log_file_tmpl <- file.path("logs", sprintf(
"%s_%d_%dch_%s.csv",
model_name,
dim_size,
channels,
format(Sys.time(), "%Y%m%d%H%M%OS")
))
# КаблПМ ОЌеМО Ñайла ЌПЎелО
model_file_tmpl <- file.path("models", sprintf(
"%s_%d_%dch_{epoch:02d}_{val_loss:.2f}.h5",
model_name,
dim_size,
channels
))
callbacks_list <- list(
keras::callback_csv_logger(
filename = log_file_tmpl
),
keras::callback_early_stopping(
monitor = "val_loss",
min_delta = 1e-4,
patience = 8,
verbose = 1,
mode = "min"
),
keras::callback_reduce_lr_on_plateau(
monitor = "val_loss",
factor = 0.5, # ÑЌеМÑÑаеЌ lr в 2 Ñаза
patience = 4,
verbose = 1,
min_delta = 1e-4,
mode = "min"
),
keras::callback_model_checkpoint(
filepath = model_file_tmpl,
monitor = "val_loss",
save_best_only = FALSE,
save_weights_only = FALSE,
mode = "min"
)
)
8. çµè«ã§ã¯ãªã
ç§ãã¡ãééããå€ãã®åé¡ã¯ãŸã å æãããŠããŸããã
- в keras æé©ãªåŠç¿çãèªåçã«æ€çŽ¢ããæ¢è£œã®æ©èœã¯ãããŸããïŒã¢ããã°ïŒ
lr_finder
å³æžé€šã§ é«é.ai); ããçšåºŠã®åªåãããã°ããµãŒãããŒãã£ã®å®è£ ã R ã«ç§»æ€ããããšãã§ããŸããããšãã°ã次ã®ããã«ãªããŸãããã® ; - åã®ç¹ã®çµæãšããŠãè€æ°ã® GPU ã䜿çšããå Žåãæ£ãããã¬ãŒãã³ã°é床ãéžæããããšãã§ããŸããã§ããã
- ææ°ã®ãã¥ãŒã©ã« ãããã¯ãŒã¯ ã¢ãŒããã¯ãã£ãç¹ã« imagenet ã§äºåãã¬ãŒãã³ã°ããããã®ãäžè¶³ããŠããŸãã
- XNUMX ãµã€ã¯ã« ããªã·ãŒãšèå¥åŠç¿çã¯ãããŸãã (ã³ãµã€ã³ ã¢ããŒãªã³ã°ã¯ç§ãã¡ã®èŠæã«å¿ããŠè¡ãããŸãã)
å®è£ ããã ãããããšãããããŸããã¹ã«ã€ãã³ ).
ãã®ã³ã³ãã¹ãããåŠãã æçãªããšã¯æ¬¡ã®ãšããã§ãã
- æ¯èŒçäœé»åã®ããŒããŠã§ã¢ã§ã¯ããŸãšã㪠(RAM ã®äœåãã®ãµã€ãºã®) ããªã¥ãŒã ã®ããŒã¿ãèŠçãªãåŠçã§ããŸãã ãããŒã«è¢ ããŒã¿è¡š ããŒãã«ã®ã€ã³ãã¬ãŒã¹å€æŽã«ããã¡ã¢ãªãç¯çŽãããããŒãã«ã®ã³ããŒãåé¿ãããŸãããŸããæ£ãã䜿çšãããšããã®æ©èœã¯ã»ãšãã©ã®å Žåãã¹ã¯ãªããèšèªçšãšããŠç¥ãããŠãããã¹ãŠã®ããŒã«ã®äžã§æé«ã®é床ã瀺ããŸãã ããŒã¿ãããŒã¿ããŒã¹ã«ä¿åãããšãå€ãã®å ŽåãããŒã¿ã»ããå šäœã RAM ã«è©°ã蟌ãå¿ èŠæ§ã«ã€ããŠãŸã£ããèããå¿ èŠããªããªããŸãã
- R ã®é ãé¢æ°ã¯ãããã±ãŒãžã䜿çšã㊠C++ ã®éãé¢æ°ã«çœ®ãæããããšãã§ããŸã ã ããã«äœ¿çšããå Žå Rcppã¹ã¬ãã ãŸã㯠RcppParallelãã¯ãã¹ãã©ãããã©ãŒã ã®ãã«ãã¹ã¬ããå®è£ ãåŸããããããR ã¬ãã«ã§ã³ãŒãã䞊ååããå¿ èŠã¯ãããŸããã
- ããã±ãŒãžå¥  C++ ã®æ·±ãç¥èããªããŠã䜿çšã§ããŸãããæäœéå¿
èŠãªãã®ãæŠèª¬ãããŠããŸã
ãã㧠ã 次ã®ãããªå€ãã®åªãã C ã©ã€ãã©ãªã®ããã㌠ãã¡ã€ã« ãšã¯ã¹ãã³ãœã« ã€ãŸããæ¢è£œã®é«æ§èœ C++ ã³ãŒãã R ã«çµ±åãããããžã§ã¯ãã®å®è£ ã®ããã®ã€ã³ãã©ã¹ãã©ã¯ãã£ã圢æãããŠããŸãã ããã«äŸ¿å©ãªã®ã¯ãRStudio ã®æ§æãã€ã©ã€ããšéç C++ ã³ãŒã ã¢ãã©ã€ã¶ãŒã§ãã - ããã¯ãã ãã©ã¡ãŒã¿ãŒã䜿çšããŠèªå·±å®çµåã¹ã¯ãªãããå®è¡ã§ããŸãã ããã¯ããªã¢ãŒã ãµãŒããŒã§äœ¿çšããå Žåã«äŸ¿å©ã§ãã ããã«ãŒã®äžã RStudio ã§ã¯ããã¥ãŒã©ã« ãããã¯ãŒã¯ããã¬ãŒãã³ã°ããããã®äœæéãã®å®éšãè¡ãã®ã¯äžäŸ¿ã§ããããµãŒããŒèªäœã« IDE ãã€ã³ã¹ããŒã«ããããšãå¿ ãããæ£åœã§ãããšã¯éããŸããã
- Docker ã¯ãç°ãªãããŒãžã§ã³ã® OS ãã©ã€ãã©ãªã䜿çšããéçºè éã§ã®ã³ãŒãã®ç§»æ€æ§ãšçµæã®åçŸæ§ãããã³ãµãŒããŒã§ã®å®è¡ã®å®¹æããä¿èšŒããŸãã ãã£ã XNUMX ã€ã®ã³ãã³ãã§ãã¬ãŒãã³ã° ãã€ãã©ã€ã³å šäœãèµ·åã§ããŸãã
- Google Cloud ã¯ãé«äŸ¡ãªããŒããŠã§ã¢ãè©Šãããã®äºç®ã«åªããæ¹æ³ã§ãããæ§æãæ éã«éžæããå¿ èŠããããŸãã
- åã ã®ã³ãŒããã©ã°ã¡ã³ãã®é床ã枬å®ããããšã¯ãç¹ã« R ãš C++ ãçµã¿åãããããããã±ãŒãžãšçµã¿åããããããå Žåã«éåžžã«åœ¹ç«ã¡ãŸãã ãã³ã - ãããéåžžã«ç°¡åã§ãã
å šäœãšããŠããã®çµéšã¯éåžžã«äŸ¡å€ã®ãããã®ã§ãããç§ãã¡ã¯æèµ·ãããåé¡ã®ããã€ãã解決ããããã«åŒãç¶ãåãçµãã§ããŸãã
åºæïŒ habr.com