ื–ื™ื”ื•ื™ ืฉืจื‘ื•ื˜ื™ื ืžื”ื™ืจื”: ืื™ืš ืœื”ืชื™ื™ื“ื“ ืขื R, C++ ื•ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช

ื–ื™ื”ื•ื™ ืฉืจื‘ื•ื˜ื™ื ืžื”ื™ืจื”: ืื™ืš ืœื”ืชื™ื™ื“ื“ ืขื R, C++ ื•ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช

ื”ื™ื™ ื”ื‘ืจ!

ื‘ืกืชื™ื• ื”ืื—ืจื•ืŸ ืื™ืจื—ื” Kaggle ืชื—ืจื•ืช ืœืกื™ื•ื•ื’ ืชืžื•ื ื•ืช ืžืฆื•ื™ืจื•ืช ื‘ื™ื“, Quick Draw Doodle Recognition, ืฉื‘ื”, ื‘ื™ืŸ ื”ื™ืชืจ, ืœืงื— ื—ืœืง ืฆื•ื•ืช ืฉืœ ืžื“ืขื ื™ R: ืืจื˜ื ืงืœื‘ื˜ืกื•ื‘ื”, ืžื ื”ืœ ืคื™ืœื™ืคื” ะธ ืื ื“ืจื™ื™ ืื•ื’ื•ืจื˜ืกื•ื‘. ืœื ื ืชืืจ ืืช ื”ืชื—ืจื•ืช ื‘ืคื™ืจื•ื˜; ื–ื” ื›ื‘ืจ ื ืขืฉื” ื‘ ืคืจืกื•ื ืื—ืจื•ืŸ.

ื”ืคืขื ื–ื” ืœื ื”ืกืชื“ืจ ืขื ื—ืงืœืื•ืช ืžื“ืœื™ื•ืช, ืื‘ืœ ื ืฆื‘ืจ ื”ืจื‘ื” ื ื™ืกื™ื•ืŸ ืจื‘ ืขืจืš, ืื– ืื ื™ ืจื•ืฆื” ืœืกืคืจ ืœืงื”ื™ืœื” ืขืœ ื›ืžื” ืžื”ื“ื‘ืจื™ื ื”ืžืขื ื™ื™ื ื™ื ื•ื”ืฉื™ืžื•ืฉื™ื™ื ื‘ื™ื•ืชืจ ื‘ืงื™ื’ืœ ื•ื‘ืขื‘ื•ื“ื” ื”ื™ื•ืžื™ื•ืžื™ืช. ื‘ื™ืŸ ื”ื ื•ืฉืื™ื ืฉื ื“ื•ื ื•: ื—ื™ื™ื ืงืฉื™ื ื‘ืœื™ OpenCV, ื ื™ืชื•ื— JSON (ื“ื•ื’ืžืื•ืช ืืœื” ื‘ื•ื—ื ื•ืช ืืช ื”ืฉื™ืœื•ื‘ ืฉืœ ืงื•ื“ C++ ืœืชื•ืš ืกืงืจื™ืคื˜ื™ื ืื• ื—ื‘ื™ืœื•ืช ื‘-R ื‘ืืžืฆืขื•ืช Rcpp), ืคืจืžื˜ืจื™ื–ืฆื™ื” ืฉืœ ืกืงืจื™ืคื˜ื™ื ื•ื“ื•ืงืจื™ื–ืฆื™ื” ืฉืœ ื”ืคืชืจื•ืŸ ื”ืกื•ืคื™. ื›ืœ ื”ืงื•ื“ ืžื”ื”ื•ื“ืขื” ื‘ืฆื•ืจื” ืžืชืื™ืžื” ืœื‘ื™ืฆื•ืข ื–ืžื™ืŸ ื‘ ืžืื’ืจื™ื.

ืชื•ื›ืŸ:

  1. ื˜ืขืŸ ื ืชื•ื ื™ื ืž-CSV ื‘ื™ืขื™ืœื•ืช ืืœ MonetDB
  2. ื”ื›ื ืช ืงื‘ื•ืฆื•ืช
  3. ืื™ื˜ืจื˜ื•ืจื™ื ืœืคืจื™ืงืช ืืฆื•ื•ืช ืžืžืกื“ ื”ื ืชื•ื ื™ื
  4. ื‘ื—ื™ืจืช ืืจื›ื™ื˜ืงื˜ื•ืจืช ืžื•ื“ืœ
  5. ืคืจืžื˜ืจื™ื–ืฆื™ื” ืฉืœ ืกืงืจื™ืคื˜
  6. ืขื™ื’ื•ืŸ ืฉืœ ืกืงืจื™ืคื˜ื™ื
  7. ืฉื™ืžื•ืฉ ื‘ืžืกืคืจ GPUs ื‘-Google Cloud
  8. ื‘ืžืงื•ื ืžืกืงื ื”

1. ื˜ืขืŸ ื ืชื•ื ื™ื ืž-CSV ื‘ื™ืขื™ืœื•ืช ืœืžืกื“ ื”ื ืชื•ื ื™ื ืฉืœ MonetDB

ื”ื ืชื•ื ื™ื ื‘ืชื—ืจื•ืช ื–ื• ืžืกื•ืคืงื™ื ืœื ื‘ืฆื•ืจื” ืฉืœ ืชืžื•ื ื•ืช ืžื•ื›ื ื•ืช, ืืœื ื‘ืฆื•ืจื” ืฉืœ 340 ืงื‘ืฆื™ CSV (ืงื•ื‘ืฅ ืื—ื“ ืœื›ืœ ืžื—ืœืงื”) ื”ืžื›ื™ืœื™ื JSONs ืขื ืงื•ืื•ืจื“ื™ื ื˜ื•ืช ื ืงื•ื“ื•ืช. ืขืœ ื™ื“ื™ ื—ื™ื‘ื•ืจ ื ืงื•ื“ื•ืช ืืœื• ื‘ืงื•ื•ื™ื, ืื ื• ืžืงื‘ืœื™ื ืชืžื•ื ื” ืกื•ืคื™ืช ื‘ื’ื•ื“ืœ 256x256 ืคื™ืงืกืœื™ื. ื›ืžื• ื›ืŸ, ืœื›ืœ ืจืฉื•ืžื” ื™ืฉ ืชื•ื•ื™ืช ื”ืžืฆื™ื™ื ืช ืื ื”ืชืžื•ื ื” ื–ื•ื”ืชื” ื›ื”ืœื›ื” ืขืœ ื™ื“ื™ ื”ืžืกื•ื•ื’ ืฉืฉื™ืžืฉ ื‘ื–ืžืŸ ืื™ืกื•ืฃ ืžืขืจืš ื”ื ืชื•ื ื™ื, ืงื•ื“ ื‘ืŸ ืฉืชื™ ืื•ืชื™ื•ืช ืฉืœ ืžื“ื™ื ืช ื”ืžื’ื•ืจื™ื ืฉืœ ืžื—ื‘ืจ ื”ืชืžื•ื ื”, ืžื–ื”ื” ื™ื™ื—ื•ื“ื™, ื—ื•ืชืžืช ื–ืžืŸ ื•ืฉื ืžื—ืœืงื” ื”ืชื•ืื ืœืฉื ื”ืงื•ื‘ืฅ. ื’ืจืกื” ืคืฉื•ื˜ื” ืฉืœ โ€‹โ€‹ื”ื ืชื•ื ื™ื ื”ืžืงื•ืจื™ื™ื ืฉื•ืงืœืช ื‘ืืจื›ื™ื•ืŸ 7.4 ื’'ื™ื’ื”-ื‘ื™ื™ื˜ ื•ื›-20 ื’'ื™ื’ื”-ื‘ื™ื™ื˜ ืœืื—ืจ ื”ืคืจื™ืงื”, ื”ื ืชื•ื ื™ื ื”ืžืœืื™ื ืœืื—ืจ ื”ืคืจื™ืงื” ืชื•ืคืกื™ื 240 ื’'ื™ื’ื”-ื‘ื™ื™ื˜. ื”ืžืืจื’ื ื™ื ื“ืื’ื• ืฉืฉืชื™ ื”ื’ืจืกืื•ืช ื™ืฉื—ื–ืจื• ืืช ืื•ืชื ืฆื™ื•ืจื™ื, ื›ืœื•ืžืจ ื”ื’ืจืกื” ื”ืžืœืื” ืžื™ื•ืชืจืช. ื‘ื›ืœ ืžืงืจื”, ืื—ืกื•ืŸ ืฉืœ 50 ืžื™ืœื™ื•ืŸ ืชืžื•ื ื•ืช ื‘ืงื‘ืฆื™ื ื’ืจืคื™ื™ื ืื• ื‘ืฆื•ืจืช ืžืขืจื›ื™ื ื ื—ืฉื‘ ืžื™ื“ ืœืœื ืจื•ื•ื— ื•ื”ื—ืœื˜ื ื• ืœืžื–ื’ ืืช ื›ืœ ืงื‘ืฆื™ ื”-CSV ืžื”ืืจื›ื™ื•ืŸ train_simplified.zip ืœืชื•ืš ืžืกื“ ื”ื ืชื•ื ื™ื ืขื ื”ื“ื•ืจ ื”ื‘ื ืฉืœ ืชืžื•ื ื•ืช ื‘ื’ื•ื“ืœ ื”ื ื“ืจืฉ "ืขืœ ืชื ื•ืขื”" ืขื‘ื•ืจ ื›ืœ ืืฆื•ื•ื”.

ืžืขืจื›ืช ืžื•ื›ื—ืช ื”ื™ื˜ื‘ ื ื‘ื—ืจื” ื›-DBMS MonetDB, ื›ืœื•ืžืจ ืžื™ืžื•ืฉ ืขื‘ื•ืจ R ื›ื—ื‘ื™ืœื” MonetDBLite. ื”ื—ื‘ื™ืœื” ื›ื•ืœืœืช ื’ืจืกื” ืžืฉื•ื‘ืฆืช ืฉืœ ืฉืจืช ืžืกื“ ื”ื ืชื•ื ื™ื ื•ืžืืคืฉืจืช ืœืืกื•ืฃ ืืช ื”ืฉืจืช ื™ืฉื™ืจื•ืช ืžืกืฉืŸ R ื•ืœืขื‘ื•ื“ ืื™ืชื• ืฉื. ื™ืฆื™ืจืช ืžืกื“ ื ืชื•ื ื™ื ื•ื”ืชื—ื‘ืจื•ืช ืืœื™ื• ืžืชื‘ืฆืขื™ื ื‘ืคืงื•ื“ื” ืื—ืช:

con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

ื ืฆื˜ืจืš ืœื™ืฆื•ืจ ืฉืชื™ ื˜ื‘ืœืื•ืช: ืื—ืช ืขื‘ื•ืจ ื›ืœ ื”ื ืชื•ื ื™ื, ื”ืฉื ื™ื™ื” ืขื‘ื•ืจ ืžื™ื“ืข ืฉื™ืจื•ืช ืขืœ ืงื‘ืฆื™ื ืฉื”ื•ืจื“ื• (ืฉื™ืžื•ืฉื™ ืื ืžืฉื”ื• ืžืฉืชื‘ืฉ ื•ื™ืฉ ืœื—ื“ืฉ ืืช ื”ืชื”ืœื™ืš ืœืื—ืจ ื”ื•ืจื“ืช ืžืกืคืจ ืงื‘ืฆื™ื):

ื™ืฆื™ืจืช ื˜ื‘ืœืื•ืช

if (!DBI::dbExistsTable(con, "doodles")) {
  DBI::dbCreateTable(
    con = con,
    name = "doodles",
    fields = c(
      "countrycode" = "char(2)",
      "drawing" = "text",
      "key_id" = "bigint",
      "recognized" = "bool",
      "timestamp" = "timestamp",
      "word" = "text"
    )
  )
}

if (!DBI::dbExistsTable(con, "upload_log")) {
  DBI::dbCreateTable(
    con = con,
    name = "upload_log",
    fields = c(
      "id" = "serial",
      "file_name" = "text UNIQUE",
      "uploaded" = "bool DEFAULT false"
    )
  )
}

ื”ื“ืจืš ื”ืžื”ื™ืจื” ื‘ื™ื•ืชืจ ืœื˜ืขื•ืŸ ื ืชื•ื ื™ื ืœืžืกื“ ื”ื ืชื•ื ื™ื ื”ื™ื™ืชื” ืœื”ืขืชื™ืง ื™ืฉื™ืจื•ืช ืงื‘ืฆื™ CSV ื‘ืืžืฆืขื•ืช SQL - command COPY OFFSET 2 INTO tablename FROM path USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORTืื™ืคื” tablename - ืฉื ื˜ื‘ืœื” ื• path - ื”ื ืชื™ื‘ ืœืงื•ื‘ืฅ. ืชื•ืš ื›ื“ื™ ืขื‘ื•ื“ื” ืขื ื”ืืจื›ื™ื•ืŸ, ื”ืชื’ืœื” ื›ื™ ื”ื™ื™ืฉื•ื ื”ืžื•ื‘ื ื” unzip in R ืœื ืขื•ื‘ื“ ื›ื”ืœื›ื” ืขื ืžืกืคืจ ืงื‘ืฆื™ื ืžื”ืืจื›ื™ื•ืŸ, ืื– ื”ืฉืชืžืฉื ื• ื‘ืžืขืจื›ืช unzip (ื‘ืืžืฆืขื•ืช ื”ืคืจืžื˜ืจ getOption("unzip")).

ืคื•ื ืงืฆื™ื” ืœื›ืชื™ื‘ื” ืœืžืกื“ ื”ื ืชื•ื ื™ื

#' @title ะ˜ะทะฒะปะตั‡ะตะฝะธะต ะธ ะทะฐะณั€ัƒะทะบะฐ ั„ะฐะนะปะพะฒ
#'
#' @description
#' ะ˜ะทะฒะปะตั‡ะตะฝะธะต CSV-ั„ะฐะนะปะพะฒ ะธะท ZIP-ะฐั€ั…ะธะฒะฐ ะธ ะทะฐะณั€ัƒะทะบะฐ ะธั… ะฒ ะฑะฐะทัƒ ะดะฐะฝะฝั‹ั…
#'
#' @param con ะžะฑัŠะตะบั‚ ะฟะพะดะบะปัŽั‡ะตะฝะธั ะบ ะฑะฐะทะต ะดะฐะฝะฝั‹ั… (ะบะปะฐัั `MonetDBEmbeddedConnection`).
#' @param tablename ะะฐะทะฒะฐะฝะธะต ั‚ะฐะฑะปะธั†ั‹ ะฒ ะฑะฐะทะต ะดะฐะฝะฝั‹ั….
#' @oaram zipfile ะŸัƒั‚ัŒ ะบ ZIP-ะฐั€ั…ะธะฒัƒ.
#' @oaram filename ะ˜ะผั ั„ะฐะนะปะฐ ะฒะฝัƒั€ะธ ZIP-ะฐั€ั…ะธะฒะฐ.
#' @param preprocess ะคัƒะฝะบั†ะธั ะฟั€ะตะดะพะฑั€ะฐะฑะพั‚ะบะธ, ะบะพั‚ะพั€ะฐั ะฑัƒะดะตั‚ ะฟั€ะธะผะตะฝะตะฝะฐ ะธะทะฒะปะตั‡ั‘ะฝะฝะพะผัƒ ั„ะฐะนะปัƒ.
#'   ะ”ะพะปะถะฝะฐ ะฟั€ะธะฝะธะผะฐั‚ัŒ ะพะดะธะฝ ะฐั€ะณัƒะผะตะฝั‚ `data` (ะพะฑัŠะตะบั‚ `data.table`).
#'
#' @return `TRUE`.
#'
upload_file <- function(con, tablename, zipfile, filename, preprocess = NULL) {
  # ะŸั€ะพะฒะตั€ะบะฐ ะฐั€ะณัƒะผะตะฝั‚ะพะฒ
  checkmate::assert_class(con, "MonetDBEmbeddedConnection")
  checkmate::assert_string(tablename)
  checkmate::assert_string(filename)
  checkmate::assert_true(DBI::dbExistsTable(con, tablename))
  checkmate::assert_file_exists(zipfile, access = "r", extension = "zip")
  checkmate::assert_function(preprocess, args = c("data"), null.ok = TRUE)

  # ะ˜ะทะฒะปะตั‡ะตะฝะธะต ั„ะฐะนะปะฐ
  path <- file.path(tempdir(), filename)
  unzip(zipfile, files = filename, exdir = tempdir(), 
        junkpaths = TRUE, unzip = getOption("unzip"))
  on.exit(unlink(file.path(path)))

  # ะŸั€ะธะผะตะฝัะตะผ ั„ัƒะฝะบั†ะธั ะฟั€ะตะดะพะฑั€ะฐะฑะพั‚ะบะธ
  if (!is.null(preprocess)) {
    .data <- data.table::fread(file = path)
    .data <- preprocess(data = .data)
    data.table::fwrite(x = .data, file = path, append = FALSE)
    rm(.data)
  }

  # ะ—ะฐะฟั€ะพั ะบ ะ‘ะ” ะฝะฐ ะธะผะฟะพั€ั‚ CSV
  sql <- sprintf(
    "COPY OFFSET 2 INTO %s FROM '%s' USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORT",
    tablename, path
  )
  # ะ’ั‹ะฟะพะปะฝะตะฝะธะต ะทะฐะฟั€ะพัะฐ ะบ ะ‘ะ”
  DBI::dbExecute(con, sql)

  # ะ”ะพะฑะฐะฒะปะตะฝะธะต ะทะฐะฟะธัะธ ะพะฑ ัƒัะฟะตัˆะฝะพะน ะทะฐะณั€ัƒะทะบะต ะฒ ัะปัƒะถะตะฑะฝัƒัŽ ั‚ะฐะฑะปะธั†ัƒ
  DBI::dbExecute(con, sprintf("INSERT INTO upload_log(file_name, uploaded) VALUES('%s', true)",
                              filename))

  return(invisible(TRUE))
}

ืื ืืชื” ืฆืจื™ืš ืœืฉื ื•ืช ืืช ื”ื˜ื‘ืœื” ืœืคื ื™ ื›ืชื™ื‘ืชื” ืœืžืกื“ ื”ื ืชื•ื ื™ื, ื–ื” ืžืกืคื™ืง ื›ื“ื™ ืœื”ืขื‘ื™ืจ ืืช ื”ืืจื’ื•ืžื ื˜ preprocess ืคื•ื ืงืฆื™ื” ืฉืชืฉื ื” ืืช ื”ื ืชื•ื ื™ื.

ืงื•ื“ ืœื˜ืขื™ื ืช ื ืชื•ื ื™ื ื‘ืจืฆืฃ ืœืžืกื“ ื”ื ืชื•ื ื™ื:

ื›ืชื™ื‘ืช ื ืชื•ื ื™ื ืœืžืกื“ ื”ื ืชื•ื ื™ื

# ะกะฟะธัะพะบ ั„ะฐะนะปะพะฒ ะดะปั ะทะฐะฟะธัะธ
files <- unzip(zipfile, list = TRUE)$Name

# ะกะฟะธัะพะบ ะธัะบะปัŽั‡ะตะฝะธะน, ะตัะปะธ ั‡ะฐัั‚ัŒ ั„ะฐะนะปะพะฒ ัƒะถะต ะฑั‹ะปะฐ ะทะฐะณั€ัƒะถะตะฝะฐ
to_skip <- DBI::dbGetQuery(con, "SELECT file_name FROM upload_log")[[1L]]
files <- setdiff(files, to_skip)

if (length(files) > 0L) {
  # ะ—ะฐะฟัƒัะบะฐะตะผ ั‚ะฐะนะผะตั€
  tictoc::tic()
  # ะŸั€ะพะณั€ะตัั ะฑะฐั€
  pb <- txtProgressBar(min = 0L, max = length(files), style = 3)
  for (i in seq_along(files)) {
    upload_file(con = con, tablename = "doodles", 
                zipfile = zipfile, filename = files[i])
    setTxtProgressBar(pb, i)
  }
  close(pb)
  # ะžัั‚ะฐะฝะฐะฒะปะธะฒะฐะตะผ ั‚ะฐะนะผะตั€
  tictoc::toc()
}

# 526.141 sec elapsed - ะบะพะฟะธั€ะพะฒะฐะฝะธะต SSD->SSD
# 558.879 sec elapsed - ะบะพะฟะธั€ะพะฒะฐะฝะธะต USB->SSD

ื–ืžืŸ ื˜ืขื™ื ืช ื”ื ืชื•ื ื™ื ืขืฉื•ื™ ืœื”ืฉืชื ื•ืช ื‘ื”ืชืื ืœืžืืคื™ื™ื ื™ ื”ืžื”ื™ืจื•ืช ืฉืœ ื”ื›ื•ื ืŸ ื”ืžืฉืžืฉ. ื‘ืžืงืจื” ืฉืœื ื•, ืงืจื™ืื” ื•ื›ืชื™ื‘ื” ื‘ืชื•ืš SSD ืื—ื“ ืื• ืžื›ื•ื ืŸ ื”ื‘ื–ืง (ืงื•ื‘ืฅ ืžืงื•ืจ) ืœ-SSD (DB) ืœื•ืงื— ืคื—ื•ืช ืž-10 ื“ืงื•ืช.

ืœื•ืงื— ืขื•ื“ ื›ืžื” ืฉื ื™ื•ืช ืœื™ืฆื•ืจ ืขืžื•ื“ื” ืขื ืชื•ื•ื™ืช ืžื—ืœืงื” ืฉืœืžื” ื•ืขืžื•ื“ืช ืื™ื ื“ืงืก (ORDERED INDEX) ืขื ืžืกืคืจื™ ืฉื•ืจื” ืฉืœืคื™ื”ื ื™ื™ื“ื’ืžื• ืชืฆืคื™ื•ืช ื‘ืขืช ื™ืฆื™ืจืช ืืฆื•ื•ืช:

ื™ืฆื™ืจืช ืขืžื•ื“ื•ืช ื•ืื™ื ื“ืงืก ื ื•ืกืคื™ื

message("Generate lables")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD label_int int"))
invisible(DBI::dbExecute(con, "UPDATE doodles SET label_int = dense_rank() OVER (ORDER BY word) - 1"))

message("Generate row numbers")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD id serial"))
invisible(DBI::dbExecute(con, "CREATE ORDERED INDEX doodles_id_ord_idx ON doodles(id)"))

ื›ื“ื™ ืœืคืชื•ืจ ืืช ื”ื‘ืขื™ื” ืฉืœ ื™ืฆื™ืจืช ืืฆื•ื•ื” ืชื•ืš ื›ื“ื™ ืชื ื•ืขื”, ื”ื™ื™ื ื• ืฆืจื™ื›ื™ื ืœื”ืฉื™ื’ ืืช ื”ืžื”ื™ืจื•ืช ื”ืžืจื‘ื™ืช ืฉืœ ื—ื™ืœื•ืฅ ืฉื•ืจื•ืช ืืงืจืื™ื•ืช ืžื”ื˜ื‘ืœื” doodles. ืœืฉื ื›ืš ื”ืฉืชืžืฉื ื• ื‘-3 ื˜ืจื™ืงื™ื. ื”ืจืืฉื•ืŸ ื”ื™ื” ืœื”ืคื—ื™ืช ืืช ื”ืžืžื“ื™ื•ืช ืฉืœ ื”ืกื•ื’ ื”ืžืื—ืกืŸ ืืช ืžื–ื”ื” ื”ืชืฆืคื™ืช. ื‘ืžืขืจืš ื”ื ืชื•ื ื™ื ื”ืžืงื•ืจื™, ื”ืกื•ื’ ื”ื ื“ืจืฉ ืœืื—ืกื•ืŸ ื”ืžื–ื”ื” ื”ื•ื bigint, ืืš ืžืกืคืจ ื”ืชืฆืคื™ื•ืช ืžืืคืฉืจ ืœื”ืชืื™ื ืืช ื”ืžื–ื”ื™ื ืฉืœื”ื, ื”ืฉื•ื•ื™ื ืœืžืกืคืจ ื”ืกื™ื“ื•ืจื™, ืœืกื•ื’ int. ื”ื—ื™ืคื•ืฉ ื”ื•ื ื”ืจื‘ื” ื™ื•ืชืจ ืžื”ื™ืจ ื‘ืžืงืจื” ื–ื”. ื”ื˜ืจื™ืง ื”ืฉื ื™ ื”ื™ื” ืœื”ืฉืชืžืฉ ORDERED INDEX - ื”ื’ืขื ื• ืœื”ื—ืœื˜ื” ื”ื–ื• ื‘ืื•ืคืŸ ืืžืคื™ืจื™, ืœืื—ืจ ืฉืขื‘ืจื ื• ืืช ื›ืœ ื”ืืคืฉืจื•ื™ื•ืช ืื•ืคืฆื™ื•ืช. ื”ืฉืœื™ืฉื™ ื”ื™ื” ืœื”ืฉืชืžืฉ ื‘ืฉืื™ืœืชื•ืช ืขื ืคืจืžื˜ืจื™ื. ื”ืžื”ื•ืช ืฉืœ ื”ืฉื™ื˜ื” ื”ื™ื ืœื‘ืฆืข ืืช ื”ืคืงื•ื“ื” ืคืขื ืื—ืช PREPARE ืขื ืฉื™ืžื•ืฉ ืœืื—ืจ ืžื›ืŸ ื‘ื‘ื™ื˜ื•ื™ ืžื•ื›ืŸ ื‘ืขืช โ€‹โ€‹ื™ืฆื™ืจืช ืฆืจื•ืจ ืฉืื™ืœืชื•ืช ืžืื•ืชื• ืกื•ื’, ืืš ืœืžืขืฉื” ื™ืฉ ื™ืชืจื•ืŸ ื‘ื”ืฉื•ื•ืื” ืœืฉืื™ืœืชื” ืคืฉื•ื˜ื” SELECT ื”ืชื‘ืจืจ ื›ื ืžืฆื ื‘ื˜ื•ื•ื— ื”ืฉื’ื™ืื” ื”ืกื˜ื˜ื™ืกื˜ื™ืช.

ืชื”ืœื™ืš ื”ืขืœืืช ื”ื ืชื•ื ื™ื ืฆื•ืจืš ืœื ื™ื•ืชืจ ืž-450 MB ืฉืœ ื–ื™ื›ืจื•ืŸ RAM. ื›ืœื•ืžืจ, ื”ื’ื™ืฉื” ื”ืžืชื•ืืจืช ืžืืคืฉืจืช ืœื”ืขื‘ื™ืจ ืžืขืจื›ื™ ื ืชื•ื ื™ื ื‘ืžืฉืงืœ ืฉืœ ืขืฉืจื•ืช ื’ื™ื’ื”-ื‘ื™ื™ื˜ ื›ืžืขื˜ ื‘ื›ืœ ื—ื•ืžืจื” ืชืงืฆื™ื‘ื™ืช, ื›ื•ืœืœ ื›ืžื” ืžื›ืฉื™ืจื™ื ืขื ืœื•ื— ื‘ื•ื“ื“, ื•ื–ื” ื“ื™ ืžื’ื ื™ื‘.

ื›ืœ ืžื” ืฉื ื•ืชืจ ื”ื•ื ืœืžื“ื•ื“ ืืช ืžื”ื™ืจื•ืช ืื—ื–ื•ืจ ื”ื ืชื•ื ื™ื (ืืงืจืื™) ื•ืœื”ืขืจื™ืš ืืช ืงื ื” ื”ืžื™ื“ื” ื‘ืขืช ื“ื’ื™ืžืช ืงื‘ื•ืฆื•ืช ื‘ื’ื“ืœื™ื ืฉื•ื ื™ื:

ืจืฃ ืžืกื“ ื ืชื•ื ื™ื

library(ggplot2)

set.seed(0)
# ะŸะพะดะบะปัŽั‡ะตะฝะธะต ะบ ะฑะฐะทะต ะดะฐะฝะฝั‹ั…
con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

# ะคัƒะฝะบั†ะธั ะดะปั ะฟะพะดะณะพั‚ะพะฒะบะธ ะทะฐะฟั€ะพัะฐ ะฝะฐ ัั‚ะพั€ะพะฝะต ัะตั€ะฒะตั€ะฐ
prep_sql <- function(batch_size) {
  sql <- sprintf("PREPARE SELECT id FROM doodles WHERE id IN (%s)",
                 paste(rep("?", batch_size), collapse = ","))
  res <- DBI::dbSendQuery(con, sql)
  return(res)
}

# ะคัƒะฝะบั†ะธั ะดะปั ะธะทะฒะปะตั‡ะตะฝะธั ะดะฐะฝะฝั‹ั…
fetch_data <- function(rs, batch_size) {
  ids <- sample(seq_len(n), batch_size)
  res <- DBI::dbFetch(DBI::dbBind(rs, as.list(ids)))
  return(res)
}

# ะŸั€ะพะฒะตะดะตะฝะธะต ะทะฐะผะตั€ะฐ
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    rs <- prep_sql(batch_size)
    bench::mark(
      fetch_data(rs, batch_size),
      min_iterations = 50L
    )
  }
)
# ะŸะฐั€ะฐะผะตั‚ั€ั‹ ะฑะตะฝั‡ะผะฐั€ะบะฐ
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   batch_size      min   median      max `itr/sec` total_time n_itr
#        <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
# 1         16   23.6ms  54.02ms  93.43ms     18.8        2.6s    49
# 2         32     38ms  84.83ms 151.55ms     11.4       4.29s    49
# 3         64   63.3ms 175.54ms 248.94ms     5.85       8.54s    50
# 4        128   83.2ms 341.52ms 496.24ms     3.00      16.69s    50
# 5        256  232.8ms 653.21ms 847.44ms     1.58      31.66s    50
# 6        512  784.6ms    1.41s    1.98s     0.740       1.1m    49
# 7       1024  681.7ms    2.72s    4.06s     0.377      2.16m    49

ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
  geom_point() +
  geom_line() +
  ylab("median time, s") +
  theme_minimal()

DBI::dbDisconnect(con, shutdown = TRUE)

ื–ื™ื”ื•ื™ ืฉืจื‘ื•ื˜ื™ื ืžื”ื™ืจื”: ืื™ืš ืœื”ืชื™ื™ื“ื“ ืขื R, C++ ื•ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช

2. ื”ื›ื ืช ืžื ื•ืช

ื›ืœ ืชื”ืœื™ืš ื”ื›ื ืช ื”ืืฆื•ื•ื” ืžื•ืจื›ื‘ ืžื”ืฉืœื‘ื™ื ื”ื‘ืื™ื:

  1. ื ื™ืชื•ื— ืžืกืคืจ JSONs ื”ืžื›ื™ืœื™ื ื•ืงื˜ื•ืจื™ื ืฉืœ ืžื—ืจื•ื–ื•ืช ืขื ืงื•ืื•ืจื“ื™ื ื˜ื•ืช ืฉืœ ื ืงื•ื“ื•ืช.
  2. ืฆื™ื•ืจ ืงื•ื•ื™ื ืฆื‘ืขื•ื ื™ื™ื ืขืœ ืกืžืš ืงื•ืื•ืจื“ื™ื ื˜ื•ืช ืฉืœ ื ืงื•ื“ื•ืช ืขืœ ืชืžื•ื ื” ื‘ื’ื•ื“ืœ ื”ื ื“ืจืฉ (ืœื“ื•ื’ืžื”, 256ร—256 ืื• 128ร—128).
  3. ื”ืžืจืช ื”ืชืžื•ื ื•ืช ื”ืžืชืงื‘ืœื•ืช ืœื˜ื ื–ื•ืจ.

ื›ื—ืœืง ืžื”ืชื—ืจื•ืช ื‘ื™ืŸ ื’ืจืขื™ื ื™ Python, ื”ื‘ืขื™ื” ื ืคืชืจื” ื‘ืขื™ืงืจ ื‘ืืžืฆืขื•ืช OpenCV. ืื—ื“ ื”ืื ืœื•ื’ื™ื ื”ืคืฉื•ื˜ื™ื ื•ื”ื‘ืจื•ืจื™ื ื‘ื™ื•ืชืจ ื‘-R ื™ื™ืจืื” ื›ืš:

ื”ื˜ืžืขืช ื”ืžืจืช JSON ืœื˜ื ื–ื•ืจ ื‘-R

r_process_json_str <- function(json, line.width = 3, 
                               color = TRUE, scale = 1) {
  # ะŸะฐั€ัะธะฝะณ JSON
  coords <- jsonlite::fromJSON(json, simplifyMatrix = FALSE)
  tmp <- tempfile()
  # ะฃะดะฐะปัะตะผ ะฒั€ะตะผะตะฝะฝั‹ะน ั„ะฐะนะป ะฟะพ ะทะฐะฒะตั€ัˆะตะฝะธัŽ ั„ัƒะฝะบั†ะธะธ
  on.exit(unlink(tmp))
  png(filename = tmp, width = 256 * scale, height = 256 * scale, pointsize = 1)
  # ะŸัƒัั‚ะพะน ะณั€ะฐั„ะธะบ
  plot.new()
  # ะ ะฐะทะผะตั€ ะพะบะฝะฐ ะณั€ะฐั„ะธะบะฐ
  plot.window(xlim = c(256 * scale, 0), ylim = c(256 * scale, 0))
  # ะฆะฒะตั‚ะฐ ะปะธะฝะธะน
  cols <- if (color) rainbow(length(coords)) else "#000000"
  for (i in seq_along(coords)) {
    lines(x = coords[[i]][[1]] * scale, y = coords[[i]][[2]] * scale, 
          col = cols[i], lwd = line.width)
  }
  dev.off()
  # ะŸั€ะตะพะฑั€ะฐะทะพะฒะฐะฝะธะต ะธะทะพะฑั€ะฐะถะตะฝะธั ะฒ 3-ั… ะผะตั€ะฝั‹ะน ะผะฐััะธะฒ
  res <- png::readPNG(tmp)
  return(res)
}

r_process_json_vector <- function(x, ...) {
  res <- lapply(x, r_process_json_str, ...)
  # ะžะฑัŠะตะดะธะฝะตะฝะธะต 3-ั… ะผะตั€ะฝั‹ั… ะผะฐััะธะฒะพะฒ ะบะฐั€ั‚ะธะฝะพะบ ะฒ 4-ั… ะผะตั€ะฝั‹ะน ะฒ ั‚ะตะฝะทะพั€
  res <- do.call(abind::abind, c(res, along = 0))
  return(res)
}

ื”ืฆื™ื•ืจ ืžืชื‘ืฆืข ื‘ืืžืฆืขื•ืช ื›ืœื™ R ืกื˜ื ื“ืจื˜ื™ื™ื ื•ื ืฉืžืจ ื‘-PNG ื–ืžื ื™ ื”ืžืื•ื—ืกืŸ ื‘-RAM (ื‘-Linux, ืกืคืจื™ื•ืช R ื–ืžื ื™ื•ืช ื ืžืฆืื•ืช ื‘ืกืคืจื™ื™ื” /tmp, ืžื•ืชืงืŸ ื‘-RAM). ืงื•ื‘ืฅ ื–ื” ื ืงืจื ืœืื—ืจ ืžื›ืŸ ื›ืžืขืจืš ืชืœืช ืžื™ืžื“ื™ ืขื ืžืกืคืจื™ื ื”ื ืขื™ื ื‘ื™ืŸ 0 ืœ-1. ื–ื” ื—ืฉื•ื‘ ืžื›ื™ื•ื•ืŸ ืฉ-BMP ืงื•ื ื‘ื ืฆื™ื•ื ืœื™ ื™ื•ืชืจ ื™ื™ืงืจื ืœืžืขืจืš ื’ื•ืœืžื™ ืขื ืงื•ื“ื™ ืฆื‘ืข ืžืฉื•ืฉื”.

ื‘ื•ืื• ื ื‘ื“ื•ืง ืืช ื”ืชื•ืฆืื”:

zip_file <- file.path("data", "train_simplified.zip")
csv_file <- "cat.csv"
unzip(zip_file, files = csv_file, exdir = tempdir(), 
      junkpaths = TRUE, unzip = getOption("unzip"))
tmp_data <- data.table::fread(file.path(tempdir(), csv_file), sep = ",", 
                              select = "drawing", nrows = 10000)
arr <- r_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256   3
plot(magick::image_read(arr))

ื–ื™ื”ื•ื™ ืฉืจื‘ื•ื˜ื™ื ืžื”ื™ืจื”: ืื™ืš ืœื”ืชื™ื™ื“ื“ ืขื R, C++ ื•ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช

ื”ืืฆื•ื•ื” ืขืฆืžื” ืชื™ื•ื•ืฆืจ ื‘ืื•ืคืŸ ื”ื‘ื:

res <- r_process_json_vector(tmp_data[1:4, drawing], scale = 0.5)
str(res)
 # num [1:4, 1:128, 1:128, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
 # - attr(*, "dimnames")=List of 4
 #  ..$ : NULL
 #  ..$ : NULL
 #  ..$ : NULL
 #  ..$ : NULL

ื™ื™ืฉื•ื ื–ื” ื ืจืื” ืœื ื• ืœื ืื•ืคื˜ื™ืžืœื™, ืžื›ื™ื•ื•ืŸ ืฉื™ืฆื™ืจืช ืืฆื•ื•ืช ื’ื“ื•ืœื•ืช ื ืžืฉื›ืช ื–ืžืŸ ืจื‘ ื‘ืื•ืคืŸ ืžื’ื•ื ื”, ื•ื”ื—ืœื˜ื ื• ืœื ืฆืœ ืืช ื”ื ื™ืกื™ื•ืŸ ืฉืœ ืขืžื™ืชื™ื ื• ืขืœ ื™ื“ื™ ืฉื™ืžื•ืฉ ื‘ืกืคืจื™ื™ื” ืจื‘ืช ืขื•ืฆืžื” OpenCV. ื‘ืื•ืชื” ืชืงื•ืคื” ืœื ื”ื™ื™ืชื” ื—ื‘ื™ืœื” ืžื•ื›ื ื” ืขื‘ื•ืจ R (ืื™ืŸ ื›ื–ื• ื›ืขืช), ื•ืœื›ืŸ ื™ื™ืฉื•ื ืžื™ื ื™ืžืœื™ ืฉืœ ื”ืคื•ื ืงืฆื™ื•ื ืœื™ื•ืช ื”ื ื“ืจืฉืช ื ื›ืชื‘ ื‘-C++ ืขื ืื™ื ื˜ื’ืจืฆื™ื” ื‘ืงื•ื“ R ื‘ืืžืฆืขื•ืช Rcpp.

ื›ื“ื™ ืœืคืชื•ืจ ืืช ื”ื‘ืขื™ื”, ื ืขืฉื” ืฉื™ืžื•ืฉ ื‘ื—ื‘ื™ืœื•ืช ื•ื‘ืกืคืจื™ื•ืช ื”ื‘ืื•ืช:

  1. OpenCV ืœืขื‘ื•ื“ื” ืขื ืชืžื•ื ื•ืช ื•ืฆื™ื•ืจ ืงื•ื•ื™ื. ื”ืฉืชืžืฉื• ื‘ืกืคืจื™ื•ืช ืžืขืจื›ืช ื•ืงื•ื‘ืฆื™ ื›ื•ืชืจื•ืช ืžื•ืชืงื ื•ืช ืžืจืืฉ, ื›ืžื• ื’ื ืงื™ืฉื•ืจ ื“ื™ื ืžื™.

  2. xtensor ืœืขื‘ื•ื“ื” ืขื ืžืขืจื›ื™ื ื•ื˜ื ื–ื•ืจื™ื ืจื‘ ืžื™ืžื“ื™ื™ื. ื”ืฉืชืžืฉื ื• ื‘ืงื‘ืฆื™ ื›ื•ืชืจื•ืช ื”ื›ืœื•ืœื™ื ื‘ื—ื‘ื™ืœืช R ื‘ืื•ืชื• ืฉื. ื”ืกืคืจื™ื™ื” ืžืืคืฉืจืช ืœืš ืœืขื‘ื•ื“ ืขื ืžืขืจื›ื™ื ืจื‘ ืžื™ืžื“ื™ื™ื, ื”ืŸ ื‘ืกื“ืจ ืขื™ืงืจื™ ื‘ืฉื•ืจื” ื•ื”ืŸ ื‘ืกื“ืจ ืขืžื•ื“ื” ืขื™ืงืจื™.

  3. ndjson ืœื ื™ืชื•ื— JSON. ื”ืกืคืจื™ื™ื” ื”ื–ื• ืžืฉืžืฉืช ื‘ xtensor ื‘ืื•ืคืŸ ืื•ื˜ื•ืžื˜ื™ ืื ื”ื•ื ืงื™ื™ื ื‘ืคืจื•ื™ืงื˜.

  4. RcppThread ืœืืจื’ื•ืŸ ืขื™ื‘ื•ื“ ืžืจื•ื‘ื” ื”ืœื™ื›ื™ ื•ืงื˜ื•ืจ ืž-JSON. ื”ืฉืชืžืฉ ื‘ืงื‘ืฆื™ ื”ื›ื•ืชืจื•ืช ืฉืกื•ืคืงื• ืขืœ ื™ื“ื™ ื—ื‘ื™ืœื” ื–ื•. ืžืคื•ืคื•ืœืจื™ ื™ื•ืชืจ RcppParallel ื‘ื—ื‘ื™ืœื”, ื‘ื™ืŸ ื”ื™ืชืจ, ืžื ื’ื ื•ืŸ ื”ืคืกืงืช ืœื•ืœืื” ืžื•ื‘ื ื”.

ื™ืฉ ืœืฆื™ื™ืŸ ื›ื™ xtensor ื”ืชื‘ืจืจ ื›ืžืชื ื” ืžืฉืžื™ื: ื‘ื ื•ืกืฃ ืœืขื•ื‘ื“ื” ืฉื™ืฉ ืœื• ืคื•ื ืงืฆื™ื•ื ืœื™ื•ืช ื ืจื—ื‘ืช ื•ื‘ื™ืฆื•ืขื™ื ื’ื‘ื•ื”ื™ื, ื”ืชื‘ืจืจ ืฉื”ืžืคืชื—ื™ื ืฉืœื• ืžื’ื™ื‘ื™ื ืœืžื“ื™ ื•ืขื ื• ืขืœ ืฉืืœื•ืช ื‘ืžื”ื™ืจื•ืช ื•ื‘ืคื™ืจื•ื˜. ื‘ืขื–ืจืชื, ื ื™ืชืŸ ื”ื™ื” ืœื™ื™ืฉื ื˜ืจื ืกืคื•ืจืžืฆื™ื•ืช ืฉืœ ืžื˜ืจื™ืฆื•ืช OpenCV ืœื˜ื ืกื•ืจื™ xtensor, ื•ื›ืŸ ื“ืจืš ืœืฉืœื‘ ื˜ื ืกื•ืจ ืชืžื•ื ื” ืชืœืช ืžื™ืžื“ื™ืช ืœื˜ื ื–ื•ืจ 3 ืžื™ืžื“ื™ ื‘ืžืžื“ ื”ื ื›ื•ืŸ (ื”ืืฆื•ื•ื” ืขืฆืžื”).

ื—ื•ืžืจื™ื ืœืœื™ืžื•ื“ Rcpp, xtensor ื•-RcppThread

https://thecoatlessprofessor.com/programming/unofficial-rcpp-api-documentation

https://docs.opencv.org/4.0.1/d7/dbd/group__imgproc.html

https://xtensor.readthedocs.io/en/latest/

https://xtensor.readthedocs.io/en/latest/file_loading.html#loading-json-data-into-xtensor

https://cran.r-project.org/web/packages/RcppThread/vignettes/RcppThread-vignette.pdf

ื›ื“ื™ ืœื”ื“ืจ ืงื‘ืฆื™ื ื”ืžืฉืชืžืฉื™ื ื‘ืงื‘ืฆื™ ืžืขืจื›ืช ื•ืงื™ืฉื•ืจ ื“ื™ื ืžื™ ืขื ืกืคืจื™ื•ืช ื”ืžื•ืชืงื ื•ืช ื‘ืžืขืจื›ืช, ื”ืฉืชืžืฉื ื• ื‘ืžื ื’ื ื•ืŸ ื”ืคืœืื’ื™ืŸ ื”ืžื™ื•ืฉื ื‘ื—ื‘ื™ืœื” Rcpp. ื›ื“ื™ ืœืžืฆื•ื ื‘ืื•ืคืŸ ืื•ื˜ื•ืžื˜ื™ ื ืชื™ื‘ื™ื ื•ื“ื’ืœื™ื, ื”ืฉืชืžืฉื ื• ื‘ื›ืœื™ ืขื–ืจ ืคื•ืคื•ืœืจื™ ืฉืœ ืœื™ื ื•ืงืก pkg-config.

ื™ื™ืฉื•ื ืชื•ืกืฃ Rcpp ืœืฉื™ืžื•ืฉ ื‘ืกืคืจื™ื™ืช OpenCV

Rcpp::registerPlugin("opencv", function() {
  # ะ’ะพะทะผะพะถะฝั‹ะต ะฝะฐะทะฒะฐะฝะธั ะฟะฐะบะตั‚ะฐ
  pkg_config_name <- c("opencv", "opencv4")
  # ะ‘ะธะฝะฐั€ะฝั‹ะน ั„ะฐะนะป ัƒั‚ะธะปะธั‚ั‹ pkg-config
  pkg_config_bin <- Sys.which("pkg-config")
  # ะŸั€ะพะฒั€ะตะบะฐ ะฝะฐะปะธั‡ะธั ัƒั‚ะธะปะธั‚ั‹ ะฒ ัะธัั‚ะตะผะต
  checkmate::assert_file_exists(pkg_config_bin, access = "x")
  # ะŸั€ะพะฒะตั€ะบะฐ ะฝะฐะปะธั‡ะธั ั„ะฐะนะปะฐ ะฝะฐัั‚ั€ะพะตะบ OpenCV ะดะปั pkg-config
  check <- sapply(pkg_config_name, 
                  function(pkg) system(paste(pkg_config_bin, pkg)))
  if (all(check != 0)) {
    stop("OpenCV config for the pkg-config not found", call. = FALSE)
  }

  pkg_config_name <- pkg_config_name[check == 0]
  list(env = list(
    PKG_CXXFLAGS = system(paste(pkg_config_bin, "--cflags", pkg_config_name), 
                          intern = TRUE),
    PKG_LIBS = system(paste(pkg_config_bin, "--libs", pkg_config_name), 
                      intern = TRUE)
  ))
})

ื›ืชื•ืฆืื” ืžืคืขื•ืœืช ื”ืชื•ืกืฃ, ื”ืขืจื›ื™ื ื”ื‘ืื™ื ื™ื•ื—ืœืคื• ื‘ืžื”ืœืš ืชื”ืœื™ืš ื”ื”ื™ื“ื•ืจ:

Rcpp:::.plugins$opencv()$env

# $PKG_CXXFLAGS
# [1] "-I/usr/include/opencv"
#
# $PKG_LIBS
# [1] "-lopencv_shape -lopencv_stitching -lopencv_superres -lopencv_videostab -lopencv_aruco -lopencv_bgsegm -lopencv_bioinspired -lopencv_ccalib -lopencv_datasets -lopencv_dpm -lopencv_face -lopencv_freetype -lopencv_fuzzy -lopencv_hdf -lopencv_line_descriptor -lopencv_optflow -lopencv_video -lopencv_plot -lopencv_reg -lopencv_saliency -lopencv_stereo -lopencv_structured_light -lopencv_phase_unwrapping -lopencv_rgbd -lopencv_viz -lopencv_surface_matching -lopencv_text -lopencv_ximgproc -lopencv_calib3d -lopencv_features2d -lopencv_flann -lopencv_xobjdetect -lopencv_objdetect -lopencv_ml -lopencv_xphoto -lopencv_highgui -lopencv_videoio -lopencv_imgcodecs -lopencv_photo -lopencv_imgproc -lopencv_core"

ืงื•ื“ ื”ื™ื™ืฉื•ื ืœื ื™ืชื•ื— JSON ื•ื™ืฆื™ืจืช ืืฆื•ื•ื” ืœืฉื™ื“ื•ืจ ืœื“ื’ื ื ื™ืชืŸ ืชื—ืช ื”ืกืคื•ื™ืœืจ. ืจืืฉื™ืช, ื”ื•ืกืฃ ืกืคืจื™ื™ืช ืคืจื•ื™ืงื˜ ืžืงื•ืžื™ืช ื›ื“ื™ ืœื—ืคืฉ ืงื‘ืฆื™ ื›ื•ืชืจื•ืช (ื“ืจื•ืฉ ืขื‘ื•ืจ ndjson):

Sys.setenv("PKG_CXXFLAGS" = paste0("-I", normalizePath(file.path("src"))))

ื”ื˜ืžืขืช ื”ืžืจืช JSON ืœื˜ื ื–ื•ืจ ื‘-C++

// [[Rcpp::plugins(cpp14)]]
// [[Rcpp::plugins(opencv)]]
// [[Rcpp::depends(xtensor)]]
// [[Rcpp::depends(RcppThread)]]

#include <xtensor/xjson.hpp>
#include <xtensor/xadapt.hpp>
#include <xtensor/xview.hpp>
#include <xtensor-r/rtensor.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <Rcpp.h>
#include <RcppThread.h>

// ะกะธะฝะพะฝะธะผั‹ ะดะปั ั‚ะธะฟะพะฒ
using RcppThread::parallelFor;
using json = nlohmann::json;
using points = xt::xtensor<double,2>;     // ะ˜ะทะฒะปะตั‡ั‘ะฝะฝั‹ะต ะธะท JSON ะบะพะพั€ะดะธะฝะฐั‚ั‹ ั‚ะพั‡ะตะบ
using strokes = std::vector<points>;      // ะ˜ะทะฒะปะตั‡ั‘ะฝะฝั‹ะต ะธะท JSON ะบะพะพั€ะดะธะฝะฐั‚ั‹ ั‚ะพั‡ะตะบ
using xtensor3d = xt::xtensor<double, 3>; // ะขะตะฝะทะพั€ ะดะปั ั…ั€ะฐะฝะตะฝะธั ะผะฐั‚ั€ะธั†ั‹ ะธะทะพะพะฑั€ะฐะถะตะฝะธั
using xtensor4d = xt::xtensor<double, 4>; // ะขะตะฝะทะพั€ ะดะปั ั…ั€ะฐะฝะตะฝะธั ะผะฝะพะถะตัั‚ะฒะฐ ะธะทะพะฑั€ะฐะถะตะฝะธะน
using rtensor3d = xt::rtensor<double, 3>; // ะžะฑั‘ั€ั‚ะบะฐ ะดะปั ัะบัะฟะพั€ั‚ะฐ ะฒ R
using rtensor4d = xt::rtensor<double, 4>; // ะžะฑั‘ั€ั‚ะบะฐ ะดะปั ัะบัะฟะพั€ั‚ะฐ ะฒ R

// ะกั‚ะฐั‚ะธั‡ะตัะบะธะต ะบะพะฝัั‚ะฐะฝั‚ั‹
// ะ ะฐะทะผะตั€ ะธะทะพะฑั€ะฐะถะตะฝะธั ะฒ ะฟะธะบัะตะปัั…
const static int SIZE = 256;
// ะขะธะฟ ะปะธะฝะธะธ
// ะกะผ. https://en.wikipedia.org/wiki/Pixel_connectivity#2-dimensional
const static int LINE_TYPE = cv::LINE_4;
// ะขะพะปั‰ะธะฝะฐ ะปะธะฝะธะธ ะฒ ะฟะธะบัะตะปัั…
const static int LINE_WIDTH = 3;
// ะะปะณะพั€ะธั‚ะผ ั€ะตัะฐะนะทะฐ
// https://docs.opencv.org/3.1.0/da/d54/group__imgproc__transform.html#ga5bb5a1fea74ea38e1a5445ca803ff121
const static int RESIZE_TYPE = cv::INTER_LINEAR;

// ะจะฐะฑะปะพะฝ ะดะปั ะบะพะฝะฒะตั€ั‚ะธั€ะพะฒะฐะฝะธั OpenCV-ะผะฐั‚ั€ะธั†ั‹ ะฒ ั‚ะตะฝะทะพั€
template <typename T, int NCH, typename XT=xt::xtensor<T,3,xt::layout_type::column_major>>
XT to_xt(const cv::Mat_<cv::Vec<T, NCH>>& src) {
  // ะ ะฐะทะผะตั€ะฝะพัั‚ัŒ ั†ะตะปะตะฒะพะณะพ ั‚ะตะฝะทะพั€ะฐ
  std::vector<int> shape = {src.rows, src.cols, NCH};
  // ะžะฑั‰ะตะต ะบะพะปะธั‡ะตัั‚ะฒะพ ัะปะตะผะตะฝั‚ะพะฒ ะฒ ะผะฐััะธะฒะต
  size_t size = src.total() * NCH;
  // ะŸั€ะตะพะฑั€ะฐะทะพะฒะฐะฝะธะต cv::Mat ะฒ xt::xtensor
  XT res = xt::adapt((T*) src.data, size, xt::no_ownership(), shape);
  return res;
}

// ะŸั€ะตะพะฑั€ะฐะทะพะฒะฐะฝะธะต JSON ะฒ ัะฟะธัะพะบ ะบะพะพั€ะดะธะฝะฐั‚ ั‚ะพั‡ะตะบ
strokes parse_json(const std::string& x) {
  auto j = json::parse(x);
  // ะ ะตะทัƒะปัŒั‚ะฐั‚ ะฟะฐั€ัะธะฝะณะฐ ะดะพะปะถะตะฝ ะฑั‹ั‚ัŒ ะผะฐััะธะฒะพะผ
  if (!j.is_array()) {
    throw std::runtime_error("'x' must be JSON array.");
  }
  strokes res;
  res.reserve(j.size());
  for (const auto& a: j) {
    // ะšะฐะถะดั‹ะน ัะปะตะผะตะฝั‚ ะผะฐััะธะฒะฐ ะดะพะปะถะตะฝ ะฑั‹ั‚ัŒ 2-ะผะตั€ะฝั‹ะผ ะผะฐััะธะฒะพะผ
    if (!a.is_array() || a.size() != 2) {
      throw std::runtime_error("'x' must include only 2d arrays.");
    }
    // ะ˜ะทะฒะปะตั‡ะตะฝะธะต ะฒะตะบั‚ะพั€ะฐ ั‚ะพั‡ะตะบ
    auto p = a.get<points>();
    res.push_back(p);
  }
  return res;
}

// ะžั‚ั€ะธัะพะฒะบะฐ ะปะธะฝะธะน
// ะฆะฒะตั‚ะฐ HSV
cv::Mat ocv_draw_lines(const strokes& x, bool color = true) {
  // ะ˜ัั…ะพะดะฝั‹ะน ั‚ะธะฟ ะผะฐั‚ั€ะธั†ั‹
  auto stype = color ? CV_8UC3 : CV_8UC1;
  // ะ˜ั‚ะพะณะพะฒั‹ะน ั‚ะธะฟ ะผะฐั‚ั€ะธั†ั‹
  auto dtype = color ? CV_32FC3 : CV_32FC1;
  auto bg = color ? cv::Scalar(0, 0, 255) : cv::Scalar(255);
  auto col = color ? cv::Scalar(0, 255, 220) : cv::Scalar(0);
  cv::Mat img = cv::Mat(SIZE, SIZE, stype, bg);
  // ะšะพะปะธั‡ะตัั‚ะฒะพ ะปะธะฝะธะน
  size_t n = x.size();
  for (const auto& s: x) {
    // ะšะพะปะธั‡ะตัั‚ะฒะพ ั‚ะพั‡ะตะบ ะฒ ะปะธะฝะธะธ
    size_t n_points = s.shape()[1];
    for (size_t i = 0; i < n_points - 1; ++i) {
      // ะขะพั‡ะบะฐ ะฝะฐั‡ะฐะปะฐ ัˆั‚ั€ะธั…ะฐ
      cv::Point from(s(0, i), s(1, i));
      // ะขะพั‡ะบะฐ ะพะบะพะฝั‡ะฐะฝะธั ัˆั‚ั€ะธั…ะฐ
      cv::Point to(s(0, i + 1), s(1, i + 1));
      // ะžั‚ั€ะธัะพะฒะบะฐ ะปะธะฝะธะธ
      cv::line(img, from, to, col, LINE_WIDTH, LINE_TYPE);
    }
    if (color) {
      // ะœะตะฝัะตะผ ั†ะฒะตั‚ ะปะธะฝะธะธ
      col[0] += 180 / n;
    }
  }
  if (color) {
    // ะœะตะฝัะตะผ ั†ะฒะตั‚ะพะฒะพะต ะฟั€ะตะดัั‚ะฐะฒะปะตะฝะธะต ะฝะฐ RGB
    cv::cvtColor(img, img, cv::COLOR_HSV2RGB);
  }
  // ะœะตะฝัะตะผ ั„ะพั€ะผะฐั‚ ะฟั€ะตะดัั‚ะฐะฒะปะตะฝะธั ะฝะฐ float32 ั ะดะธะฐะฟะฐะทะพะฝะพะผ [0, 1]
  img.convertTo(img, dtype, 1 / 255.0);
  return img;
}

// ะžะฑั€ะฐะฑะพั‚ะบะฐ JSON ะธ ะฟะพะปัƒั‡ะตะฝะธะต ั‚ะตะฝะทะพั€ะฐ ั ะดะฐะฝะฝั‹ะผะธ ะธะทะพะฑั€ะฐะถะตะฝะธั
xtensor3d process(const std::string& x, double scale = 1.0, bool color = true) {
  auto p = parse_json(x);
  auto img = ocv_draw_lines(p, color);
  if (scale != 1) {
    cv::Mat out;
    cv::resize(img, out, cv::Size(), scale, scale, RESIZE_TYPE);
    cv::swap(img, out);
    out.release();
  }
  xtensor3d arr = color ? to_xt<double,3>(img) : to_xt<double,1>(img);
  return arr;
}

// [[Rcpp::export]]
rtensor3d cpp_process_json_str(const std::string& x, 
                               double scale = 1.0, 
                               bool color = true) {
  xtensor3d res = process(x, scale, color);
  return res;
}

// [[Rcpp::export]]
rtensor4d cpp_process_json_vector(const std::vector<std::string>& x, 
                                  double scale = 1.0, 
                                  bool color = false) {
  size_t n = x.size();
  size_t dim = floor(SIZE * scale);
  size_t channels = color ? 3 : 1;
  xtensor4d res({n, dim, dim, channels});
  parallelFor(0, n, [&x, &res, scale, color](int i) {
    xtensor3d tmp = process(x[i], scale, color);
    auto view = xt::view(res, i, xt::all(), xt::all(), xt::all());
    view = tmp;
  });
  return res;
}

ื™ืฉ ืœืžืงื ืืช ื”ืงื•ื“ ื”ื–ื” ื‘ืงื•ื‘ืฅ src/cv_xt.cpp ื•ื”ื™ื“ื•ืจ ืขื ื”ืคืงื•ื“ื” Rcpp::sourceCpp(file = "src/cv_xt.cpp", env = .GlobalEnv); ื ื“ืจืฉ ื’ื ืœืขื‘ื•ื“ื” nlohmann/json.hpp ืฉืœ ืžืื’ืจ. ื”ืงื•ื“ ืžื—ื•ืœืง ืœืžืกืคืจ ืคื•ื ืงืฆื™ื•ืช:

  • to_xt - ืคื•ื ืงืฆื™ื” ื‘ืชื‘ื ื™ืช ืœื”ืคื™ื›ืช ืžื˜ืจื™ืฆืช ืชืžื•ื ื” (cv::Mat) ืœื˜ื ื–ื•ืจ xt::xtensor;

  • parse_json - ื”ืคื•ื ืงืฆื™ื” ืžื ืชื—ืช ืžื—ืจื•ื–ืช JSON, ืžื—ืœืฆืช ืืช ื”ืงื•ืื•ืจื“ื™ื ื˜ื•ืช ืฉืœ ื ืงื•ื“ื•ืช, ืื•ืจื–ืช ืื•ืชืŸ ืœื•ืงื˜ื•ืจ;

  • ocv_draw_lines - ืžื”ื•ื•ืงื˜ื•ืจ ื”ืžืชืงื‘ืœ ืฉืœ ื ืงื•ื“ื•ืช, ืžืฆื™ื™ืจ ืงื•ื•ื™ื ืจื‘ ืฆื‘ืขื•ื ื™ื™ื;

  • process - ืžืฉืœื‘ ืืช ื”ืคื•ื ืงืฆื™ื•ืช ืœืขื™ืœ ื•ืžื•ืกื™ืฃ ื’ื ืืช ื”ื™ื›ื•ืœืช ืœืฉื ื•ืช ืืช ืงื ื” ื”ืžื™ื“ื” ืฉืœ ื”ืชืžื•ื ื” ื”ืžืชืงื‘ืœืช;

  • cpp_process_json_str - ืขื•ื˜ืฃ ืืช ื”ืคื•ื ืงืฆื™ื” process, ื”ืžื™ื™ืฆื ืืช ื”ืชื•ืฆืื” ืœ-R-object (ืžืขืจืš ืจื‘-ืžืžื“ื™);

  • cpp_process_json_vector - ืขื•ื˜ืฃ ืืช ื”ืคื•ื ืงืฆื™ื” cpp_process_json_str, ื”ืžืืคืฉืจ ืœืš ืœืขื‘ื“ ื•ืงื˜ื•ืจ ืžื—ืจื•ื–ืช ื‘ืžืฆื‘ ืจื™ื‘ื•ื™ ื”ืœื™ื›ื™.

ื›ื“ื™ ืœืฆื™ื™ืจ ืงื•ื•ื™ื ืžืจื•ื‘ื™ ืฆื‘ืขื™ื, ื ืขืฉื” ืฉื™ืžื•ืฉ ื‘ืžื•ื“ืœ ื”ืฆื‘ืข HSV, ื•ืœืื—ืจ ืžื›ืŸ ื”ืžืจื” ืœ-RGB. ื‘ื•ืื• ื ื‘ื“ื•ืง ืืช ื”ืชื•ืฆืื”:

arr <- cpp_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256   3
plot(magick::image_read(arr))

ื–ื™ื”ื•ื™ ืฉืจื‘ื•ื˜ื™ื ืžื”ื™ืจื”: ืื™ืš ืœื”ืชื™ื™ื“ื“ ืขื R, C++ ื•ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช
ื”ืฉื•ื•ืื” ืฉืœ ืžื”ื™ืจื•ืช ื”ื”ื˜ืžืขื•ืช ื‘-R ื•-C++

res_bench <- bench::mark(
  r_process_json_str(tmp_data[4, drawing], scale = 0.5),
  cpp_process_json_str(tmp_data[4, drawing], scale = 0.5),
  check = FALSE,
  min_iterations = 100
)
# ะŸะฐั€ะฐะผะตั‚ั€ั‹ ะฑะตะฝั‡ะผะฐั€ะบะฐ
cols <- c("expression", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   expression                min     median       max `itr/sec` total_time  n_itr
#   <chr>                <bch:tm>   <bch:tm>  <bch:tm>     <dbl>   <bch:tm>  <int>
# 1 r_process_json_str     3.49ms     3.55ms    4.47ms      273.      490ms    134
# 2 cpp_process_json_str   1.94ms     2.02ms    5.32ms      489.      497ms    243

library(ggplot2)
# ะŸั€ะพะฒะตะดะตะฝะธะต ะทะฐะผะตั€ะฐ
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    .data <- tmp_data[sample(seq_len(.N), batch_size), drawing]
    bench::mark(
      r_process_json_vector(.data, scale = 0.5),
      cpp_process_json_vector(.data,  scale = 0.5),
      min_iterations = 50,
      check = FALSE
    )
  }
)

res_bench[, cols]

#    expression   batch_size      min   median      max `itr/sec` total_time n_itr
#    <chr>             <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
#  1 r                   16   50.61ms  53.34ms  54.82ms    19.1     471.13ms     9
#  2 cpp                 16    4.46ms   5.39ms   7.78ms   192.      474.09ms    91
#  3 r                   32   105.7ms 109.74ms 212.26ms     7.69        6.5s    50
#  4 cpp                 32    7.76ms  10.97ms  15.23ms    95.6     522.78ms    50
#  5 r                   64  211.41ms 226.18ms 332.65ms     3.85      12.99s    50
#  6 cpp                 64   25.09ms  27.34ms  32.04ms    36.0        1.39s    50
#  7 r                  128   534.5ms 627.92ms 659.08ms     1.61      31.03s    50
#  8 cpp                128   56.37ms  58.46ms  66.03ms    16.9        2.95s    50
#  9 r                  256     1.15s    1.18s    1.29s     0.851     58.78s    50
# 10 cpp                256  114.97ms 117.39ms 130.09ms     8.45       5.92s    50
# 11 r                  512     2.09s    2.15s    2.32s     0.463       1.8m    50
# 12 cpp                512  230.81ms  235.6ms 261.99ms     4.18      11.97s    50
# 13 r                 1024        4s    4.22s     4.4s     0.238       3.5m    50
# 14 cpp               1024  410.48ms 431.43ms 462.44ms     2.33      21.45s    50

ggplot(res_bench, aes(x = factor(batch_size), y = median, 
                      group =  expression, color = expression)) +
  geom_point() +
  geom_line() +
  ylab("median time, s") +
  theme_minimal() +
  scale_color_discrete(name = "", labels = c("cpp", "r")) +
  theme(legend.position = "bottom") 

ื–ื™ื”ื•ื™ ืฉืจื‘ื•ื˜ื™ื ืžื”ื™ืจื”: ืื™ืš ืœื”ืชื™ื™ื“ื“ ืขื R, C++ ื•ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช

ื›ืคื™ ืฉื ื™ืชืŸ ืœืจืื•ืช, ื”ืขืœืืช ื”ืžื”ื™ืจื•ืช ื”ืชื‘ืจืจื” ื›ืžืฉืžืขื•ืชื™ืช ื‘ื™ื•ืชืจ, ื•ืœื ื ื™ืชืŸ ืœื”ื“ื‘ื™ืง ืืช ืงื•ื“ C++ ืขืœ ื™ื“ื™ ื”ืงื‘ื™ืœื” ืœืงื•ื“ R.

3. ืื™ื˜ืจื˜ื•ืจื™ื ืœืคืจื™ืงืช ืืฆื•ื•ืช ืžืžืกื“ ื”ื ืชื•ื ื™ื

ืœ-R ื™ืฉ ืžื•ื ื™ื˜ื™ืŸ ืจืื•ื™ ืœืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืฉืžืชืื™ื ืœ-RAM, ื‘ืขื•ื“ ืฉ-Python ืžืชืืคื™ื™ืŸ ื™ื•ืชืจ ื‘ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืื™ื˜ืจื˜ื™ื‘ื™, ื”ืžืืคืฉืจ ืœืš ืœื™ื™ืฉื ื‘ืงืœื•ืช ื•ื‘ื˜ื‘ืขื™ื•ืช ื—ื™ืฉื•ื‘ื™ื ืžื—ื•ืฅ ืœืœื™ื‘ื” (ื—ื™ืฉื•ื‘ื™ื ื‘ืืžืฆืขื•ืช ื–ื™ื›ืจื•ืŸ ื—ื™ืฆื•ื ื™). ื“ื•ื’ืžื” ืงืœืืกื™ืช ื•ืจืœื•ื•ื ื˜ื™ืช ืขื‘ื•ืจื ื• ื‘ื”ืงืฉืจ ืœื‘ืขื™ื” ื”ืžืชื•ืืจืช ื”ื™ื ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช ืขืžื•ืงื•ืช ื”ืžืื•ืžื ื•ืช ื‘ืฉื™ื˜ืช ื”ื™ืจื™ื“ื” ื‘ื“ืจื’ื” ืขื ืงื™ืจื•ื‘ ืฉืœ ื”ืฉื™ืคื•ืข ื‘ื›ืœ ืฉืœื‘ ื‘ืืžืฆืขื•ืช ื—ืœืง ืงื˜ืŸ ืžื”ืชืฆืคื™ื•ืช, ืื• ืžื™ื ื™-ืืฆื˜.

ืœืžืกื’ืจื•ืช ืœืžื™ื“ื” ืขืžื•ืงื” ืฉื ื›ืชื‘ื• ื‘-Python ื™ืฉ ืฉื™ืขื•ืจื™ื ืžื™ื•ื—ื“ื™ื ืฉืžื™ื™ืฉืžื™ื ืื™ื˜ืจื˜ื•ืจื™ื ืขืœ ืกืžืš ื ืชื•ื ื™ื: ื˜ื‘ืœืื•ืช, ืชืžื•ื ื•ืช ื‘ืชื™ืงื™ื•ืช, ืคื•ืจืžื˜ื™ื ื‘ื™ื ืืจื™ื ื•ื›ื•'. ื ื™ืชืŸ ืœื”ืฉืชืžืฉ ื‘ืืคืฉืจื•ื™ื•ืช ืžื•ื›ื ื•ืช ืื• ืœื›ืชื•ื‘ ืžืฉืœืš ืœืžืฉื™ืžื•ืช ืกืคืฆื™ืคื™ื•ืช. ื‘-R ื ื•ื›ืœ ืœื ืฆืœ ืืช ื›ืœ ื”ืชื›ื•ื ื•ืช ืฉืœ ืกืคืจื™ื™ืช Python keras ืขื ื”ื—ืœืงื™ื ื”ืื—ื•ืจื™ื™ื ื”ืฉื•ื ื™ื ืฉืœื” ื‘ืืžืฆืขื•ืช ื”ื—ื‘ื™ืœื” ื‘ืื•ืชื• ืฉื, ืฉื‘ืชื•ืจื” ืคื•ืขืœืช ืขืœ ื’ื‘ื™ ื”ื—ื‘ื™ืœื” ืœืจืฉืช. ื–ื” ื”ืื—ืจื•ืŸ ืจืื•ื™ ืœืžืืžืจ ืืจื•ืš ื ืคืจื“; ื–ื” ืœื ืจืง ืžืืคืฉืจ ืœืš ืœื”ืจื™ืฅ ืงื•ื“ Python ืž-R, ืืœื ื’ื ืžืืคืฉืจ ืœืš ืœื”ืขื‘ื™ืจ ืื•ื‘ื™ื™ืงื˜ื™ื ื‘ื™ืŸ ื”ืคืขืœื•ืช R ื•- Python, ืชื•ืš ื‘ื™ืฆื•ืข ืื•ื˜ื•ืžื˜ื™ ืฉืœ ื›ืœ ื”ืžืจื•ืช ื”ืกื•ื’ ื”ื“ืจื•ืฉื•ืช.

ื ืคื˜ืจื ื• ืžื”ืฆื•ืจืš ืœืื—ืกืŸ ืืช ื›ืœ ื”ื ืชื•ื ื™ื ื‘-RAM ื‘ืืžืฆืขื•ืช MonetDBLite, ื›ืœ ืขื‘ื•ื“ืช ื”"ืจืฉืช ื”ืขืฆื‘ื™ืช" ืชืชื‘ืฆืข ืขืœ ื™ื“ื™ ื”ืงื•ื“ ื”ืžืงื•ืจื™ ื‘-Python, ืื ื—ื ื• ืจืง ืฆืจื™ื›ื™ื ืœื›ืชื•ื‘ ืื™ื˜ืจื˜ื•ืจ ืขืœ ื”ื ืชื•ื ื™ื, ืžื›ื™ื•ื•ืŸ ืฉืื™ืŸ ืฉื•ื ื“ื‘ืจ ืžื•ื›ืŸ ืœืžืฆื‘ ื›ื–ื” ื‘-R ืื• ื‘-Python. ืœืžืขืฉื” ื™ืฉ ืจืง ืฉืชื™ ื“ืจื™ืฉื•ืช ืขื‘ื•ืจื•: ืขืœื™ื• ืœื”ื—ื–ื™ืจ ืืฆื•ื•ื” ื‘ืœื•ืœืื” ืื™ื ืกื•ืคื™ืช ื•ืœืฉืžื•ืจ ืืช ืžืฆื‘ื• ื‘ื™ืŸ ืื™ื˜ืจืฆื™ื•ืช (ื”ืื—ืจื•ืŸ ื‘-R ืžื™ื•ืฉื ื‘ืฆื•ืจื” ื”ืคืฉื•ื˜ื” ื‘ื™ื•ืชืจ ื‘ืืžืฆืขื•ืช ืกื’ื™ืจื•ืช). ื‘ืขื‘ืจ, ื ื“ืจืฉ ืœื”ืžื™ืจ ื‘ืžืคื•ืจืฉ ืžืขืจื›ื™ R ืœืžืขืจื›ื™ื numpy ื‘ืชื•ืš ื”ืื™ื˜ืจื˜ื•ืจ, ืื‘ืœ ื”ื’ืจืกื” ื”ื ื•ื›ื—ื™ืช ืฉืœ ื”ื—ื‘ื™ืœื” keras ืขื•ืฉื” ืืช ื–ื” ื‘ืขืฆืžื”.

ื”ืื™ื˜ืจื˜ื•ืจ ืœื ืชื•ื ื™ ืื™ืžื•ืŸ ื•ืชื™ืงื•ืฃ ื”ืชื‘ืจืจ ื›ื“ืœืงืžืŸ:

ืื™ื˜ืจื˜ื•ืจ ืœื ืชื•ื ื™ ื”ื“ืจื›ื” ื•ืื™ืžื•ืช

train_generator <- function(db_connection = con,
                            samples_index,
                            num_classes = 340,
                            batch_size = 32,
                            scale = 1,
                            color = FALSE,
                            imagenet_preproc = FALSE) {
  # ะŸั€ะพะฒะตั€ะบะฐ ะฐั€ะณัƒะผะตะฝั‚ะพะฒ
  checkmate::assert_class(con, "DBIConnection")
  checkmate::assert_integerish(samples_index)
  checkmate::assert_count(num_classes)
  checkmate::assert_count(batch_size)
  checkmate::assert_number(scale, lower = 0.001, upper = 5)
  checkmate::assert_flag(color)
  checkmate::assert_flag(imagenet_preproc)

  # ะŸะตั€ะตะผะตัˆะธะฒะฐะตะผ, ั‡ั‚ะพะฑั‹ ะฑั€ะฐั‚ัŒ ะธ ัƒะดะฐะปัั‚ัŒ ะธัะฟะพะปัŒะทะพะฒะฐะฝะฝั‹ะต ะธะฝะดะตะบัั‹ ะฑะฐั‚ั‡ะตะน ะฟะพ ะฟะพั€ัะดะบัƒ
  dt <- data.table::data.table(id = sample(samples_index))
  # ะŸั€ะพัั‚ะฐะฒะปัะตะผ ะฝะพะผะตั€ะฐ ะฑะฐั‚ั‡ะตะน
  dt[, batch := (.I - 1L) %/% batch_size + 1L]
  # ะžัั‚ะฐะฒะปัะตะผ ั‚ะพะปัŒะบะพ ะฟะพะปะฝั‹ะต ะฑะฐั‚ั‡ะธ ะธ ะธะฝะดะตะบัะธั€ัƒะตะผ
  dt <- dt[, if (.N == batch_size) .SD, keyby = batch]
  # ะฃัั‚ะฐะฝะฐะฒะปะธะฒะฐะตะผ ัั‡ั‘ั‚ั‡ะธะบ
  i <- 1
  # ะšะพะปะธั‡ะตัั‚ะฒะพ ะฑะฐั‚ั‡ะตะน
  max_i <- dt[, max(batch)]

  # ะŸะพะดะณะพั‚ะพะฒะบะฐ ะฒั‹ั€ะฐะถะตะฝะธั ะดะปั ะฒั‹ะณั€ัƒะทะบะธ
  sql <- sprintf(
    "PREPARE SELECT drawing, label_int FROM doodles WHERE id IN (%s)",
    paste(rep("?", batch_size), collapse = ",")
  )
  res <- DBI::dbSendQuery(con, sql)

  # ะะฝะฐะปะพะณ keras::to_categorical
  to_categorical <- function(x, num) {
    n <- length(x)
    m <- numeric(n * num)
    m[x * n + seq_len(n)] <- 1
    dim(m) <- c(n, num)
    return(m)
  }

  # ะ—ะฐะผั‹ะบะฐะฝะธะต
  function() {
    # ะะฐั‡ะธะฝะฐะตะผ ะฝะพะฒัƒัŽ ัะฟะพั…ัƒ
    if (i > max_i) {
      dt[, id := sample(id)]
      data.table::setkey(dt, batch)
      # ะกะฑั€ะฐัั‹ะฒะฐะตะผ ัั‡ั‘ั‚ั‡ะธะบ
      i <<- 1
      max_i <<- dt[, max(batch)]
    }

    # ID ะดะปั ะฒั‹ะณั€ัƒะทะบะธ ะดะฐะฝะฝั‹ั…
    batch_ind <- dt[batch == i, id]
    # ะ’ั‹ะณั€ัƒะทะบะฐ ะดะฐะฝะฝั‹ั…
    batch <- DBI::dbFetch(DBI::dbBind(res, as.list(batch_ind)), n = -1)

    # ะฃะฒะตะปะธั‡ะธะฒะฐะตะผ ัั‡ั‘ั‚ั‡ะธะบ
    i <<- i + 1

    # ะŸะฐั€ัะธะฝะณ JSON ะธ ะฟะพะดะณะพั‚ะพะฒะบะฐ ะผะฐััะธะฒะฐ
    batch_x <- cpp_process_json_vector(batch$drawing, scale = scale, color = color)
    if (imagenet_preproc) {
      # ะจะบะฐะปะธั€ะพะฒะฐะฝะธะต c ะธะฝั‚ะตั€ะฒะฐะปะฐ [0, 1] ะฝะฐ ะธะฝั‚ะตั€ะฒะฐะป [-1, 1]
      batch_x <- (batch_x - 0.5) * 2
    }

    batch_y <- to_categorical(batch$label_int, num_classes)
    result <- list(batch_x, batch_y)
    return(result)
  }
}

ื”ืคื•ื ืงืฆื™ื” ืœื•ืงื—ืช ื›ืงืœื˜ ืžืฉืชื ื” ืขื ื—ื™ื‘ื•ืจ ืœืžืกื“ ื”ื ืชื•ื ื™ื, ืžืกืคืจ ื”ืฉื•ืจื•ืช ื‘ืฉื™ืžื•ืฉ, ืžืกืคืจ ื”ืžื—ืœืงื•ืช, ื’ื•ื“ืœ ืืฆื•ื•ื”, ืงื ื” ืžื™ื“ื” (scale = 1 ืžืชืื™ื ืœืขื™ื‘ื•ื“ ืชืžื•ื ื•ืช ืฉืœ 256x256 ืคื™ืงืกืœื™ื, scale = 0.5 - 128x128 ืคื™ืงืกืœื™ื), ืžื—ื•ื•ืŸ ืฆื‘ืข (color = FALSE ืžืฆื™ื™ืŸ ืขื™ื‘ื•ื“ ื‘ื’ื•ื•ื ื™ ืืคื•ืจ ื‘ืขืช ืฉื™ืžื•ืฉ color = TRUE ื›ืœ ืงื• ืžืฆื•ื™ืจ ื‘ืฆื‘ืข ื—ื“ืฉ) ื•ืžื—ื•ื•ืŸ ืขื™ื‘ื•ื“ ืžืงื“ื™ื ืœืจืฉืชื•ืช ืฉื”ื•ื›ืฉืจื• ืžืจืืฉ ื‘-imagenet. ื”ืื—ืจื•ืŸ ื ื—ื•ืฅ ืขืœ ืžื ืช ืœืฉื ื•ืช ืืช ืงื ื” ื”ืžื™ื“ื” ืฉืœ ืขืจื›ื™ ื”ืคื™ืงืกืœื™ื ืžื”ืžืจื•ื•ื— [0, 1] ืœืžืจื•ื•ื— [-1, 1], ืฉืฉื™ืžืฉ ื‘ืขืช ืื™ืžื•ืŸ ื”ืฆื™ื•ื“ ืฉืกื•ืคืง keras ืžื•ื“ืœื™ื.

ื”ืคื•ื ืงืฆื™ื” ื”ื—ื™ืฆื•ื ื™ืช ืžื›ื™ืœื” ื‘ื“ื™ืงืช ืกื•ื’ ืืจื’ื•ืžื ื˜, ื˜ื‘ืœื” data.table ืขื ืžืกืคืจื™ ืฉื•ืจื” ืžืขื•ืจื‘ื™ื ื‘ืืงืจืื™ ืž samples_index ื•ืžืกืคืจื™ ืืฆื•ื•ื”, ืžื•ื ื” ื•ืžืกืคืจ ืืฆื•ื•ื” ืžืงืกื™ืžืœื™, ื•ื›ืŸ ื‘ื™ื˜ื•ื™ SQL ืœืคืจื™ืงืช ื ืชื•ื ื™ื ืžืžืกื“ ื”ื ืชื•ื ื™ื. ื‘ื ื•ืกืฃ, ื”ื’ื“ืจื ื• ืื ืœื•ื’ื™ ืžื”ื™ืจ ืฉืœ ื”ืคื•ื ืงืฆื™ื” ื‘ืคื ื™ื keras::to_categorical(). ื”ืฉืชืžืฉื ื• ื›ืžืขื˜ ื‘ื›ืœ ื”ื ืชื•ื ื™ื ืœืื™ืžื•ืŸ, ื”ืฉืืจื ื• ื—ืฆื™ ืื—ื•ื– ืœืื™ืžื•ืช, ื›ืš ืฉื’ื•ื“ืœ ื”ืขื™ื“ืŸ ื”ื•ื’ื‘ืœ ืขืœ ื™ื“ื™ ื”ืคืจืžื˜ืจ steps_per_epoch ื›ืฉืงื•ืจืื™ื ืœื• keras::fit_generator(), ื•ื”ืžืฆื‘ if (i > max_i) ืขื‘ื“ ืจืง ืขื‘ื•ืจ ืื™ื˜ืจื˜ื•ืจ ื”ืื™ืžื•ืช.

ื‘ืคื•ื ืงืฆื™ื” ื”ืคื ื™ืžื™ืช ืžืื—ื–ืจื™ื ืื™ื ื“ืงืกื™ื ืฉืœ ืฉื•ืจื•ืช ืขื‘ื•ืจ ื”ืืฆื•ื•ื” ื”ื‘ืื”, ืจืฉื•ืžื•ืช ื ืคืจืงื•ืช ืžืžืกื“ ื”ื ืชื•ื ื™ื ื›ืืฉืจ ืžื•ื ื” ื”ืืฆื•ื•ื” ืขื•ืœื”, ื ื™ืชื•ื— JSON (ืคื•ื ืงืฆื™ื” cpp_process_json_vector(), ื›ืชื•ื‘ ื‘-C++) ื•ื™ืฆื™ืจืช ืžืขืจื›ื™ื ื”ืžืชืื™ืžื™ื ืœืชืžื•ื ื•ืช. ืœืื—ืจ ืžื›ืŸ ื ื•ืฆืจื™ื ื•ืงื˜ื•ืจื™ื ื—ื“ื™ื ืขื ืชื•ื•ื™ื•ืช ืžื—ืœืงื•ืช, ืžืขืจื›ื™ื ืขื ืขืจื›ื™ ืคื™ืงืกืœื™ื ื•ืชื•ื•ื™ื•ืช ืžืฉื•ืœื‘ื™ื ืœืจืฉื™ืžื”, ืฉื”ื™ื ืขืจืš ื”ื”ื—ื–ืจื”. ื›ื“ื™ ืœื”ืื™ืฅ ืืช ื”ืขื‘ื•ื“ื”, ื”ืฉืชืžืฉื ื• ื‘ื™ืฆื™ืจืช ืื™ื ื“ืงืกื™ื ื‘ื˜ื‘ืœืื•ืช data.table ื•ืฉื™ื ื•ื™ ื“ืจืš ื”ืงื™ืฉื•ืจ - ืœืœื "ืฉื‘ื‘ื™" ื”ื—ื‘ื™ืœื•ืช ื”ืœืœื• ื˜ื‘ืœืช ื ืชื•ื ื™ื ื“ื™ ืงืฉื” ืœื“ืžื™ื™ืŸ ืขื‘ื•ื“ื” ื™ืขื™ืœื” ืขื ื›ืœ ื›ืžื•ืช ืžืฉืžืขื•ืชื™ืช ืฉืœ ื ืชื•ื ื™ื ื‘-R.

ื”ืชื•ืฆืื•ืช ืฉืœ ืžื“ื™ื“ื•ืช ืžื”ื™ืจื•ืช ื‘ืžื—ืฉื‘ ื ื™ื™ื“ Core i5 ื”ืŸ ื›ื“ืœืงืžืŸ:

ืจืฃ ืื™ื˜ืจื˜ื•ืจ

library(Rcpp)
library(keras)
library(ggplot2)

source("utils/rcpp.R")
source("utils/keras_iterator.R")

con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

ind <- seq_len(DBI::dbGetQuery(con, "SELECT count(*) FROM doodles")[[1L]])
num_classes <- DBI::dbGetQuery(con, "SELECT max(label_int) + 1 FROM doodles")[[1L]]

# ะ˜ะฝะดะตะบัั‹ ะดะปั ะพะฑัƒั‡ะฐัŽั‰ะตะน ะฒั‹ะฑะพั€ะบะธ
train_ind <- sample(ind, floor(length(ind) * 0.995))
# ะ˜ะฝะดะตะบัั‹ ะดะปั ะฟั€ะพะฒะตั€ะพั‡ะฝะพะน ะฒั‹ะฑะพั€ะบะธ
val_ind <- ind[-train_ind]
rm(ind)
# ะšะพัั„ั„ะธั†ะธะตะฝั‚ ะผะฐััˆั‚ะฐะฑะฐ
scale <- 0.5

# ะŸั€ะพะฒะตะดะตะฝะธะต ะทะฐะผะตั€ะฐ
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    it1 <- train_generator(
      db_connection = con,
      samples_index = train_ind,
      num_classes = num_classes,
      batch_size = batch_size,
      scale = scale
    )
    bench::mark(
      it1(),
      min_iterations = 50L
    )
  }
)
# ะŸะฐั€ะฐะผะตั‚ั€ั‹ ะฑะตะฝั‡ะผะฐั€ะบะฐ
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   batch_size      min   median      max `itr/sec` total_time n_itr
#        <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
# 1         16     25ms  64.36ms   92.2ms     15.9       3.09s    49
# 2         32   48.4ms 118.13ms 197.24ms     8.17       5.88s    48
# 3         64   69.3ms 117.93ms 181.14ms     8.57       5.83s    50
# 4        128  157.2ms 240.74ms 503.87ms     3.85      12.71s    49
# 5        256  359.3ms 613.52ms 988.73ms     1.54       30.5s    47
# 6        512  884.7ms    1.53s    2.07s     0.674      1.11m    45
# 7       1024     2.7s    3.83s    5.47s     0.261      2.81m    44

ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
    geom_point() +
    geom_line() +
    ylab("median time, s") +
    theme_minimal()

DBI::dbDisconnect(con, shutdown = TRUE)

ื–ื™ื”ื•ื™ ืฉืจื‘ื•ื˜ื™ื ืžื”ื™ืจื”: ืื™ืš ืœื”ืชื™ื™ื“ื“ ืขื R, C++ ื•ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช

ืื ื™ืฉ ืœืš ื›ืžื•ืช ืžืกืคืงืช ืฉืœ ื–ื™ื›ืจื•ืŸ RAM, ืืชื” ื™ื›ื•ืœ ืœื”ืื™ืฅ ื‘ืจืฆื™ื ื•ืช ืืช ืคืขื•ืœืช ืžืกื“ ื”ื ืชื•ื ื™ื ืขืœ ื™ื“ื™ ื”ืขื‘ืจืชื• ืœืื•ืชื• ื–ื™ื›ืจื•ืŸ RAM (32 ื’'ื™ื’ื”-ื‘ื™ื™ื˜ ืžืกืคื™ืงื™ื ืœืžืฉื™ืžื” ืฉืœื ื•). ื‘-Linux, ื”ืžื—ื™ืฆื” ืžื•ืชืงื ืช ื›ื‘ืจื™ืจืช ืžื—ื“ืœ /dev/shm, ืชื•ืคืก ืขื“ ืžื—ืฆื™ืช ืžืงื™ื‘ื•ืœืช ื”-RAM. ืืชื” ื™ื›ื•ืœ ืœื”ื“ื’ื™ืฉ ืขื•ื“ ืขืœ ื™ื“ื™ ืขืจื™ื›ื” /etc/fstabื›ื“ื™ ืœืงื‘ืœ ืชืงืœื™ื˜ ื›ืžื• tmpfs /dev/shm tmpfs defaults,size=25g 0 0. ื”ืงืคื“ ืœืืชื—ืœ ื•ืœื‘ื“ื•ืง ืืช ื”ืชื•ืฆืื” ืขืœ ื™ื“ื™ ื”ืคืขืœืช ื”ืคืงื•ื“ื” df -h.

ื”ืื™ื˜ืจื˜ื•ืจ ืœื ืชื•ื ื™ ื‘ื“ื™ืงื” ื ืจืื” ื”ืจื‘ื” ื™ื•ืชืจ ืคืฉื•ื˜, ืžื›ื™ื•ื•ืŸ ืฉืžืขืจืš ื”ื ืชื•ื ื™ื ืฉืœ ื”ื‘ื“ื™ืงื” ืžืชืื™ื ืœื—ืœื•ื˜ื™ืŸ ืœ-RAM:

ืื™ื˜ืจื˜ื•ืจ ืœื ืชื•ื ื™ ื‘ื“ื™ืงื”

test_generator <- function(dt,
                           batch_size = 32,
                           scale = 1,
                           color = FALSE,
                           imagenet_preproc = FALSE) {

  # ะŸั€ะพะฒะตั€ะบะฐ ะฐั€ะณัƒะผะตะฝั‚ะพะฒ
  checkmate::assert_data_table(dt)
  checkmate::assert_count(batch_size)
  checkmate::assert_number(scale, lower = 0.001, upper = 5)
  checkmate::assert_flag(color)
  checkmate::assert_flag(imagenet_preproc)

  # ะŸั€ะพัั‚ะฐะฒะปัะตะผ ะฝะพะผะตั€ะฐ ะฑะฐั‚ั‡ะตะน
  dt[, batch := (.I - 1L) %/% batch_size + 1L]
  data.table::setkey(dt, batch)
  i <- 1
  max_i <- dt[, max(batch)]

  # ะ—ะฐะผั‹ะบะฐะฝะธะต
  function() {
    batch_x <- cpp_process_json_vector(dt[batch == i, drawing], 
                                       scale = scale, color = color)
    if (imagenet_preproc) {
      # ะจะบะฐะปะธั€ะพะฒะฐะฝะธะต c ะธะฝั‚ะตั€ะฒะฐะปะฐ [0, 1] ะฝะฐ ะธะฝั‚ะตั€ะฒะฐะป [-1, 1]
      batch_x <- (batch_x - 0.5) * 2
    }
    result <- list(batch_x)
    i <<- i + 1
    return(result)
  }
}

4. ื‘ื—ื™ืจืช ืืจื›ื™ื˜ืงื˜ื•ืจืช ื”ืžื•ื“ืœ

ื”ืืจื›ื™ื˜ืงื˜ื•ืจื” ื”ืจืืฉื•ื ื” ืฉื‘ื” ื ืขืฉื” ืฉื™ืžื•ืฉ ื”ื™ื™ืชื” mobilenet v1, ืฉืชื›ื•ื ื•ืชื™ื• ื ื“ื•ื ื•ืช ื‘ ื–ื” ื”ื•ึนื“ึธืขึธื”. ื–ื” ื›ืœื•ืœ ื›ืกื˜ื ื“ืจื˜ keras ื•ื‘ื”ืชืื, ื–ืžื™ืŸ ื‘ื—ื‘ื™ืœื” ื‘ืื•ืชื• ืฉื ืขื‘ื•ืจ R. ืื‘ืœ ื›ืฉืžื ืกื™ื ืœื”ืฉืชืžืฉ ื‘ื• ืขื ืชืžื•ื ื•ืช ื—ื“-ืขืจื•ืฆื™ื•ืช, ื”ืชื‘ืจืจ ื“ื‘ืจ ืžื•ื–ืจ: ื˜ื ื–ื•ืจ ื”ืงืœื˜ ื—ื™ื™ื‘ ืชืžื™ื“ ืœื”ื™ื•ืช ื‘ืขืœ ื”ืžืžื“ (batch, height, width, 3), ื›ืœื•ืžืจ, ืœื ื ื™ืชืŸ ืœืฉื ื•ืช ืืช ืžืกืคืจ ื”ืขืจื•ืฆื™ื. ืื™ืŸ ืžื’ื‘ืœื” ื›ื–ื• ื‘-Python, ืื– ืžื™ื”ืจื ื• ื•ื›ืชื‘ื ื• ื™ื™ืฉื•ื ืžืฉืœื ื• ืฉืœ ื”ืืจื›ื™ื˜ืงื˜ื•ืจื” ื”ื–ื•, ื‘ืขืงื‘ื•ืช ื”ืžืืžืจ ื”ืžืงื•ืจื™ (ืœืœื ื”ื ืฉื™ืจื” ืฉื ืžืฆืืช ื‘ื’ืจืกืช ื”-keras):

ืืจื›ื™ื˜ืงื˜ื•ืจืช Mobilenet v1

library(keras)

top_3_categorical_accuracy <- custom_metric(
    name = "top_3_categorical_accuracy",
    metric_fn = function(y_true, y_pred) {
         metric_top_k_categorical_accuracy(y_true, y_pred, k = 3)
    }
)

layer_sep_conv_bn <- function(object, 
                              filters,
                              alpha = 1,
                              depth_multiplier = 1,
                              strides = c(2, 2)) {

  # NB! depth_multiplier !=  resolution multiplier
  # https://github.com/keras-team/keras/issues/10349

  layer_depthwise_conv_2d(
    object = object,
    kernel_size = c(3, 3), 
    strides = strides,
    padding = "same",
    depth_multiplier = depth_multiplier
  ) %>%
  layer_batch_normalization() %>% 
  layer_activation_relu() %>%
  layer_conv_2d(
    filters = filters * alpha,
    kernel_size = c(1, 1), 
    strides = c(1, 1)
  ) %>%
  layer_batch_normalization() %>% 
  layer_activation_relu() 
}

get_mobilenet_v1 <- function(input_shape = c(224, 224, 1),
                             num_classes = 340,
                             alpha = 1,
                             depth_multiplier = 1,
                             optimizer = optimizer_adam(lr = 0.002),
                             loss = "categorical_crossentropy",
                             metrics = c("categorical_crossentropy",
                                         top_3_categorical_accuracy)) {

  inputs <- layer_input(shape = input_shape)

  outputs <- inputs %>%
    layer_conv_2d(filters = 32, kernel_size = c(3, 3), strides = c(2, 2), padding = "same") %>%
    layer_batch_normalization() %>% 
    layer_activation_relu() %>%
    layer_sep_conv_bn(filters = 64, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 128, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 128, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 256, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 256, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 1024, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 1024, strides = c(1, 1)) %>%
    layer_global_average_pooling_2d() %>%
    layer_dense(units = num_classes) %>%
    layer_activation_softmax()

    model <- keras_model(
      inputs = inputs,
      outputs = outputs
    )

    model %>% compile(
      optimizer = optimizer,
      loss = loss,
      metrics = metrics
    )

    return(model)
}

ื”ื—ืกืจื•ื ื•ืช ืฉืœ ื’ื™ืฉื” ื–ื• ื‘ืจื•ืจื™ื. ืื ื™ ืจื•ืฆื” ืœื‘ื“ื•ืง ื”ืจื‘ื” ื“ื’ืžื™ื, ืื‘ืœ ืœื”ื™ืคืš, ืื ื™ ืœื ืจื•ืฆื” ืœืฉื›ืชื‘ ื›ืœ ืืจื›ื™ื˜ืงื˜ื•ืจื” ื‘ืื•ืคืŸ ื™ื“ื ื™. ื’ื ื ืฉืœืœื” ืžืื™ืชื ื• ื”ื”ื–ื“ืžื ื•ืช ืœื”ืฉืชืžืฉ ื‘ืžืฉืงืœื™ื ืฉืœ ื“ื•ื’ืžื ื™ื•ืช ืฉื”ื•ื›ืฉืจื• ืžืจืืฉ ื‘ืื™ืžื’'ื ื˜. ื›ืจื’ื™ืœ, ืœื™ืžื•ื“ ื”ืชื™ืขื•ื“ ืขื–ืจ. ืคื•ึผื ืงืฆึดื™ึธื” get_config() ืžืืคืฉืจ ืœืš ืœืงื‘ืœ ืชื™ืื•ืจ ืฉืœ ื”ื“ื’ื ื‘ืฆื•ืจื” ืžืชืื™ืžื” ืœืขืจื™ื›ื” (base_model_conf$layers - ืจืฉื™ืžืช R ืจื’ื™ืœื”), ื•ื”ืคื•ื ืงืฆื™ื” from_config() ืžื‘ืฆืข ืืช ื”ื”ืžืจื” ื”ื”ืคื•ื›ื” ืœืื•ื‘ื™ื™ืงื˜ ืžื•ื“ืœ:

base_model_conf <- get_config(base_model)
base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
base_model <- from_config(base_model_conf)

ืขื›ืฉื™ื• ื–ื” ืœื ืงืฉื” ืœื›ืชื•ื‘ ืคื•ื ืงืฆื™ื” ืื•ื ื™ื‘ืจืกืœื™ืช ื›ื“ื™ ืœื”ืฉื™ื’ ื›ืœ ืื—ื“ ืžื”ื“ื‘ืจื™ื ืฉืกื•ืคืงื• keras ื“ื’ืžื™ื ืขื ืื• ื‘ืœื™ ืžืฉืงื•ืœื•ืช ืžืื•ืžื ื•ืช ื‘-imagenet:

ืคื•ื ืงืฆื™ื” ืœื˜ืขื™ื ืช ืืจื›ื™ื˜ืงื˜ื•ืจื•ืช ืžื•ื›ื ื•ืช

get_model <- function(name = "mobilenet_v2",
                      input_shape = NULL,
                      weights = "imagenet",
                      pooling = "avg",
                      num_classes = NULL,
                      optimizer = keras::optimizer_adam(lr = 0.002),
                      loss = "categorical_crossentropy",
                      metrics = NULL,
                      color = TRUE,
                      compile = FALSE) {
  # ะŸั€ะพะฒะตั€ะบะฐ ะฐั€ะณัƒะผะตะฝั‚ะพะฒ
  checkmate::assert_string(name)
  checkmate::assert_integerish(input_shape, lower = 1, upper = 256, len = 3)
  checkmate::assert_count(num_classes)
  checkmate::assert_flag(color)
  checkmate::assert_flag(compile)

  # ะŸะพะปัƒั‡ะฐะตะผ ะพะฑัŠะตะบั‚ ะธะท ะฟะฐะบะตั‚ะฐ keras
  model_fun <- get0(paste0("application_", name), envir = asNamespace("keras"))
  # ะŸั€ะพะฒะตั€ะบะฐ ะฝะฐะปะธั‡ะธั ะพะฑัŠะตะบั‚ะฐ ะฒ ะฟะฐะบะตั‚ะต
  if (is.null(model_fun)) {
    stop("Model ", shQuote(name), " not found.", call. = FALSE)
  }

  base_model <- model_fun(
    input_shape = input_shape,
    include_top = FALSE,
    weights = weights,
    pooling = pooling
  )

  # ะ•ัะปะธ ะธะทะพะฑั€ะฐะถะตะฝะธะต ะฝะต ั†ะฒะตั‚ะฝะพะต, ะผะตะฝัะตะผ ั€ะฐะทะผะตั€ะฝะพัั‚ัŒ ะฒั…ะพะดะฐ
  if (!color) {
    base_model_conf <- keras::get_config(base_model)
    base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
    base_model <- keras::from_config(base_model_conf)
  }

  predictions <- keras::get_layer(base_model, "global_average_pooling2d_1")$output
  predictions <- keras::layer_dense(predictions, units = num_classes, activation = "softmax")
  model <- keras::keras_model(
    inputs = base_model$input,
    outputs = predictions
  )

  if (compile) {
    keras::compile(
      object = model,
      optimizer = optimizer,
      loss = loss,
      metrics = metrics
    )
  }

  return(model)
}

ื‘ืขืช ืฉื™ืžื•ืฉ ื‘ืชืžื•ื ื•ืช ื—ื“-ืขืจื•ืฆื™ื•ืช, ืื™ืŸ ืฉื™ืžื•ืฉ ื‘ืžืฉืงืœื™ื ืžืื•ืžื ื™ื ืžืจืืฉ. ื ื™ืชืŸ ืœืชืงืŸ ืืช ื–ื”: ื‘ืืžืฆืขื•ืช ื”ืคื•ื ืงืฆื™ื” get_weights() ืงื‘ืœ ืืช ืžืฉืงืœื™ ื”ืžื•ื“ืœ ื‘ืฆื•ืจื” ืฉืœ ืจืฉื™ืžื” ืฉืœ ืžืขืจื›ื™ R, ืฉื ื” ืืช ื”ืžืžื“ ืฉืœ ื”ืืœืžื ื˜ ื”ืจืืฉื•ืŸ ื‘ืจืฉื™ืžื” ื–ื• (ืขืœ ื™ื“ื™ ื ื˜ื™ืœืช ืขืจื•ืฅ ืฆื‘ืข ืื—ื“ ืื• ืžืžื•ืฆืข ืฉืœ ืฉืœื•ืฉืชื), ื•ืœืื—ืจ ืžื›ืŸ ื˜ืขืŸ ืืช ื”ืžืฉืงื•ืœื•ืช ื‘ื—ื–ืจื” ืœืžื•ื“ืœ ืขื ื”ืคื•ื ืงืฆื™ื” set_weights(). ืžืขื•ืœื ืœื ื”ื•ืกืคื ื• ืืช ื”ืคื•ื ืงืฆื™ื•ื ืœื™ื•ืช ื”ื–ื•, ื›ื™ ื‘ืฉืœื‘ ื–ื” ื›ื‘ืจ ื”ื™ื” ื‘ืจื•ืจ ืฉื™ื•ืชืจ ืคืจื•ื“ื•ืงื˜ื™ื‘ื™ ืœืขื‘ื•ื“ ืขื ืชืžื•ื ื•ืช ืฆื‘ืขื•ื ื™ื•ืช.

ื‘ื™ืฆืขื ื• ืืช ืจื•ื‘ ื”ื ื™ืกื•ื™ื™ื ื‘ืืžืฆืขื•ืช mobilenet ื’ืจืกืื•ืช 1 ื•-2, ื›ืžื• ื’ื resnet34. ืืจื›ื™ื˜ืงื˜ื•ืจื•ืช ืžื•ื“ืจื ื™ื•ืช ื™ื•ืชืจ ื›ืžื• SE-ResNeXt ื”ื•ืคื™ืขื• ื”ื™ื˜ื‘ ื‘ืชื—ืจื•ืช ื–ื•. ืœืฆืขืจื™ ืœื ืขืžื“ื• ืœืจืฉื•ืชื ื• ื™ื™ืฉื•ืžื™ื ืžื•ื›ื ื™ื ื•ืœื ื›ืชื‘ื ื• ืžืฉืœื ื• (ืื‘ืœ ื‘ื”ื—ืœื˜ ื ื›ืชื•ื‘).

5. ืคืจืžื˜ืจื™ื–ืฆื™ื” ืฉืœ ืกืงืจื™ืคื˜ื™ื

ืžื˜ืขืžื™ ื ื•ื—ื•ืช, ื›ืœ ื”ืงื•ื“ ืœืชื—ื™ืœืช ื”ืื™ืžื•ืŸ ืชื•ื›ื ืŸ ื›ืกืงืจื™ืคื˜ ื™ื—ื™ื“, ืคืจืžื˜ืจื™ื ื‘ืืžืฆืขื•ืช ื“ื•ืงื•ืคื˜ ื›ื“ืœืงืžืŸ:

doc <- '
Usage:
  train_nn.R --help
  train_nn.R --list-models
  train_nn.R [options]

Options:
  -h --help                   Show this message.
  -l --list-models            List available models.
  -m --model=<model>          Neural network model name [default: mobilenet_v2].
  -b --batch-size=<size>      Batch size [default: 32].
  -s --scale-factor=<ratio>   Scale factor [default: 0.5].
  -c --color                  Use color lines [default: FALSE].
  -d --db-dir=<path>          Path to database directory [default: Sys.getenv("db_dir")].
  -r --validate-ratio=<ratio> Validate sample ratio [default: 0.995].
  -n --n-gpu=<number>         Number of GPUs [default: 1].
'
args <- docopt::docopt(doc)

ื—ื‘ื™ืœื” ื“ื•ืงื•ืคื˜ ืžื™ื™ืฆื’ ืืช ื”ื™ื™ืฉื•ื http://docopt.org/ ืขื‘ื•ืจ R. ื‘ืขื–ืจืชื•, ืกืงืจื™ืคื˜ื™ื ืžื•ืคืขืœื™ื ืขื ืคืงื•ื“ื•ืช ืคืฉื•ื˜ื•ืช ื›ืžื• Rscript bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db ืื• ./bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db, ืื ืงื•ื‘ืฅ train_nn.R ื ื™ืชื ืช ืœื”ืคืขืœื” (ืคืงื•ื“ื” ื–ื• ืชืชื—ื™ืœ ืœืืžืŸ ืืช ื”ืžื•ื“ืœ resnet50 ื‘ืชืžื•ื ื•ืช ื‘ืฉืœื•ืฉื” ืฆื‘ืขื™ื ื‘ื’ื•ื“ืœ 128x128 ืคื™ืงืกืœื™ื, ืžืกื“ ื”ื ืชื•ื ื™ื ื—ื™ื™ื‘ ืœื”ื™ื•ืช ืžืžื•ืงื ื‘ืชื™ืงื™ื™ื” /home/andrey/doodle_db). ืืชื” ื™ื›ื•ืœ ืœื”ื•ืกื™ืฃ ืœืจืฉื™ืžื” ืžื”ื™ืจื•ืช ืœืžื™ื“ื”, ืกื•ื’ ืื•ืคื˜ื™ืžื™ื–ืฆื™ื” ื•ื›ืœ ืคืจืžื˜ืจ ืื—ืจ ื”ื ื™ืชืŸ ืœื”ืชืืžื” ืื™ืฉื™ืช. ื‘ืชื”ืœื™ืš ื”ื›ื ืช ื”ืคืจืกื•ื ื”ืชื‘ืจืจ ื›ื™ ื”ืื“ืจื™ื›ืœื•ืช mobilenet_v2 ืžื”ื’ืจืกื” ื”ื ื•ื›ื—ื™ืช keras ื‘ืฉื™ืžื•ืฉ R ืœื ื™ื›ื•ืœ ืขืงื‘ ืฉื™ื ื•ื™ื™ื ืฉืœื ื ืœืงื—ื• ื‘ื—ืฉื‘ื•ืŸ ื‘ื—ื‘ื™ืœืช R, ืื ื• ืžื—ื›ื™ื ืฉื™ืชืงื ื• ื–ืืช.

ื’ื™ืฉื” ื–ื• ืืคืฉืจื” ืœื–ืจื– ืžืฉืžืขื•ืชื™ืช ื ื™ืกื•ื™ื™ื ืขื ื“ื’ืžื™ื ืฉื•ื ื™ื ื‘ื”ืฉื•ื•ืื” ืœื”ืฉืงื” ื”ืžืกื•ืจืชื™ืช ื™ื•ืชืจ ืฉืœ ืกืงืจื™ืคื˜ื™ื ื‘-RStudio (ืื ื• ืžืฆื™ื™ื ื™ื ืืช ื”ื—ื‘ื™ืœื” ื›ื—ืœื•ืคื” ืืคืฉืจื™ืช tfruns). ืื‘ืœ ื”ื™ืชืจื•ืŸ ื”ืขื™ืงืจื™ ื”ื•ื ื”ื™ื›ื•ืœืช ืœื ื”ืœ ื‘ืงืœื•ืช ืืช ื”ื”ืฉืงื” ืฉืœ ืกืงืจื™ืคื˜ื™ื ื‘-Docker ืื• ืคืฉื•ื˜ ื‘ืฉืจืช, ืžื‘ืœื™ ืœื”ืชืงื™ืŸ RStudio ื‘ืฉื‘ื™ืœ ื–ื”.

6. ืขื’ื™ื ื” ืฉืœ ืกืงืจื™ืคื˜ื™ื

ื”ืฉืชืžืฉื ื• ื‘-Docker ื›ื“ื™ ืœื”ื‘ื˜ื™ื— ื ื™ื™ื“ื•ืช ืฉืœ ื”ืกื‘ื™ื‘ื” ืœืื™ืžื•ืŸ ืžื•ื“ืœื™ื ื‘ื™ืŸ ื—ื‘ืจื™ ืฆื•ื•ืช ื•ืœืคืจื™ืกื” ืžื”ื™ืจื” ื‘ืขื ืŸ. ืืชื” ื™ื›ื•ืœ ืœื”ืชื—ื™ืœ ืœื”ื›ื™ืจ ืืช ื”ื›ืœื™ ื”ื–ื”, ืฉื”ื•ื ื™ื—ืกื™ืช ื—ืจื™ื’ ืœืžืชื›ื ืช R ื–ื” ืกื“ืจืช ืคืจืกื•ืžื™ื ืื• ืงื•ืจืก ื•ื™ื“ืื•.

Docker ืžืืคืฉืจ ืœืš ื’ื ืœื™ืฆื•ืจ ืชืžื•ื ื•ืช ืžืฉืœืš ืžืืคืก ื•ื’ื ืœื”ืฉืชืžืฉ ื‘ืชืžื•ื ื•ืช ืื—ืจื•ืช ื›ื‘ืกื™ืก ืœื™ืฆื™ืจืช ืชืžื•ื ื•ืช ืžืฉืœืš. ื‘ืขืช ื ื™ืชื•ื— ื”ืืคืฉืจื•ื™ื•ืช ื”ื–ืžื™ื ื•ืช, ื”ื’ืขื ื• ืœืžืกืงื ื” ืฉื”ืชืงื ืช ืžื ื”ืœื™ ื”ืชืงื ื™ื ืฉืœ NVIDIA, CUDA+cuDNN ื•ืกืคืจื™ื•ืช Python ื”ื™ื ื—ืœืง ื“ื™ ื ืจื—ื‘ ืžื”ืชืžื•ื ื”, ื•ื”ื—ืœื˜ื ื• ืœืงื—ืช ืืช ื”ืชืžื•ื ื” ื”ืจืฉืžื™ืช ื›ื‘ืกื™ืก tensorflow/tensorflow:1.12.0-gpu, ื”ื•ืกืคืช ืฉื ืืช ื—ื‘ื™ืœื•ืช ื”-R ื”ื“ืจื•ืฉื•ืช.

ืงื•ื‘ืฅ ื”ื“ื•ืงืจ ื”ืกื•ืคื™ ื ืจืื” ื›ืš:

ื“ื•ืงืจืคื™ืœ

FROM tensorflow/tensorflow:1.12.0-gpu

MAINTAINER Artem Klevtsov <[email protected]>

SHELL ["/bin/bash", "-c"]

ARG LOCALE="en_US.UTF-8"
ARG APT_PKG="libopencv-dev r-base r-base-dev littler"
ARG R_BIN_PKG="futile.logger checkmate data.table rcpp rapidjsonr dbi keras jsonlite curl digest remotes"
ARG R_SRC_PKG="xtensor RcppThread docopt MonetDBLite"
ARG PY_PIP_PKG="keras"
ARG DIRS="/db /app /app/data /app/models /app/logs"

RUN source /etc/os-release && 
    echo "deb https://cloud.r-project.org/bin/linux/ubuntu ${UBUNTU_CODENAME}-cran35/" > /etc/apt/sources.list.d/cran35.list && 
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 && 
    add-apt-repository -y ppa:marutter/c2d4u3.5 && 
    add-apt-repository -y ppa:timsc/opencv-3.4 && 
    apt-get update && 
    apt-get install -y locales && 
    locale-gen ${LOCALE} && 
    apt-get install -y --no-install-recommends ${APT_PKG} && 
    ln -s /usr/lib/R/site-library/littler/examples/install.r /usr/local/bin/install.r && 
    ln -s /usr/lib/R/site-library/littler/examples/install2.r /usr/local/bin/install2.r && 
    ln -s /usr/lib/R/site-library/littler/examples/installGithub.r /usr/local/bin/installGithub.r && 
    echo 'options(Ncpus = parallel::detectCores())' >> /etc/R/Rprofile.site && 
    echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site && 
    apt-get install -y $(printf "r-cran-%s " ${R_BIN_PKG}) && 
    install.r ${R_SRC_PKG} && 
    pip install ${PY_PIP_PKG} && 
    mkdir -p ${DIRS} && 
    chmod 777 ${DIRS} && 
    rm -rf /tmp/downloaded_packages/ /tmp/*.rds && 
    rm -rf /var/lib/apt/lists/*

COPY utils /app/utils
COPY src /app/src
COPY tests /app/tests
COPY bin/*.R /app/

ENV DBDIR="/db"
ENV CUDA_HOME="/usr/local/cuda"
ENV PATH="/app:${PATH}"

WORKDIR /app

VOLUME /db
VOLUME /app

CMD bash

ืžื˜ืขืžื™ ื ื•ื—ื•ืช, ื”ื—ื‘ื™ืœื•ืช ื‘ื”ืŸ ื ืขืฉื” ืฉื™ืžื•ืฉ ื”ื•ื›ื ืกื• ืœืžืฉืชื ื™ื; ืจื•ื‘ ื”ืชืกืจื™ื˜ื™ื ื”ื›ืชื•ื‘ื™ื ืžื•ืขืชืงื™ื ื‘ืชื•ืš ื”ืžื™ื›ืœื™ื ื‘ืžื”ืœืš ื”ื”ืจื›ื‘ื”. ืฉื™ื ื™ื ื• ื’ื ืืช ืžืขื˜ืคืช ื”ืคืงื•ื“ื” ืœ /bin/bash ืœื ื•ื—ื•ืช ื”ืฉื™ืžื•ืฉ ื‘ืชื•ื›ืŸ /etc/os-release. ื–ื” ื ืžื ืข ืžื”ืฆื•ืจืš ืœืฆื™ื™ืŸ ืืช ื’ืจืกืช ืžืขืจื›ืช ื”ื”ืคืขืœื” ื‘ืงื•ื“.

ื‘ื ื•ืกืฃ, ื ื›ืชื‘ ืกืงืจื™ืคื˜ bash ืงื˜ืŸ ื”ืžืืคืฉืจ ืœืš ืœื”ืคืขื™ืœ ืงื•ื ื˜ื™ื™ื ืจ ืขื ืคืงื•ื“ื•ืช ืฉื•ื ื•ืช. ืœื“ื•ื’ืžื”, ืืœื” ื™ื›ื•ืœื™ื ืœื”ื™ื•ืช ืกืงืจื™ืคื˜ื™ื ืœืื™ืžื•ืŸ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช ืฉื”ื•ืฆื‘ื• ื‘ืขื‘ืจ ื‘ืชื•ืš ื”ืงื•ื ื˜ื™ื™ื ืจ, ืื• ืžืขื˜ืคืช ืคืงื•ื“ื” ืœื ื™ืคื•ื™ ื‘ืื’ื™ื ื•ืœื ื™ื˜ื•ืจ ืคืขื•ืœืช ื”ืงื•ื ื˜ื™ื™ื ืจ:

ืกืงืจื™ืคื˜ ืœื”ืคืขืœืช ื”ืžื™ื›ืœ

#!/bin/sh

DBDIR=${PWD}/db
LOGSDIR=${PWD}/logs
MODELDIR=${PWD}/models
DATADIR=${PWD}/data
ARGS="--runtime=nvidia --rm -v ${DBDIR}:/db -v ${LOGSDIR}:/app/logs -v ${MODELDIR}:/app/models -v ${DATADIR}:/app/data"

if [ -z "$1" ]; then
    CMD="Rscript /app/train_nn.R"
elif [ "$1" = "bash" ]; then
    ARGS="${ARGS} -ti"
else
    CMD="Rscript /app/train_nn.R $@"
fi

docker run ${ARGS} doodles-tf ${CMD}

ืื ืกืงืจื™ืคื˜ bash ื–ื” ืžื•ืคืขืœ ืœืœื ืคืจืžื˜ืจื™ื, ื”ืกืงืจื™ืคื˜ ื™ื™ืงืจื ื‘ืชื•ืš ื”ืงื•ื ื˜ื™ื™ื ืจ train_nn.R ืขื ืขืจื›ื™ ื‘ืจื™ืจืช ืžื—ื“ืœ; ืื ื”ืืจื’ื•ืžื ื˜ ื”ืžื™ืงื•ื ื”ืจืืฉื•ืŸ ื”ื•ื "bash", ืื– ื”ืžื™ื›ืœ ื™ืชื—ื™ืœ ื‘ืื•ืคืŸ ืื™ื ื˜ืจืืงื˜ื™ื‘ื™ ืขื ืžืขื˜ืคืช ืคืงื•ื“ื”. ื‘ื›ืœ ืฉืืจ ื”ืžืงืจื™ื, ื”ืขืจื›ื™ื ืฉืœ ืืจื’ื•ืžื ื˜ื™ื ืžื™ืงื•ืื™ื™ื ืžื•ื—ืœืคื™ื: CMD="Rscript /app/train_nn.R $@".

ืจืื•ื™ ืœืฆื™ื™ืŸ ืฉื”ืกืคืจื™ื•ืช ืขื ื ืชื•ื ื™ ืžืงื•ืจ ื•ืžืกื“ ื ืชื•ื ื™ื, ื›ืžื• ื’ื ื”ืกืคืจื™ื™ื” ืœืฉืžื™ืจืช ืžื•ื“ืœื™ื ืžืื•ืžื ื™ื, ืžื•ืชืงื ื•ืช ื‘ืชื•ืš ื”ืงื•ื ื˜ื™ื™ื ืจ ืžื”ืžืขืจื›ืช ื”ืžืืจื—ืช, ืžื” ืฉืžืืคืฉืจ ืœืš ืœื’ืฉืช ืœืชื•ืฆืื•ืช ื”ืกืงืจื™ืคื˜ื™ื ืœืœื ืžื ื™ืคื•ืœืฆื™ื•ืช ืžื™ื•ืชืจื•ืช.

7. ืฉื™ืžื•ืฉ ื‘ืžืกืคืจ GPUs ื‘-Google Cloud

ืื—ื“ ื”ืžืืคื™ื™ื ื™ื ืฉืœ ื”ืชื—ืจื•ืช ื”ื™ื” ื”ื ืชื•ื ื™ื ื”ืจื•ืขืฉื™ื ืžืื•ื“ (ืจืื” ืืช ืชืžื•ื ืช ื”ื›ื•ืชืจืช, ืฉื”ื•ืฉืืœื” ืž[email protected] ืž-ODS slack). ืืฆื•ื•ืช ื’ื“ื•ืœื•ืช ืขื•ื–ืจื•ืช ืœื”ื™ืœื—ื ื‘ื–ื”, ื•ืœืื—ืจ ื ื™ืกื•ื™ื™ื ื‘ืžื—ืฉื‘ ืขื 1 GPU, ื”ื—ืœื˜ื ื• ืœืฉืœื•ื˜ ื‘ืžื•ื“ืœื™ื ืฉืœ ืื™ืžื•ืŸ ื‘ืžืกืคืจ GPUs ื‘ืขื ืŸ. ื”ืฉืชืžืฉ ื‘-GoogleCloud (ืžื“ืจื™ืš ื˜ื•ื‘ ืœื™ืกื•ื“ื•ืช) ื‘ืฉืœ ื”ืžื‘ื—ืจ ื”ื’ื“ื•ืœ ืฉืœ ืชืฆื•ืจื•ืช ื–ืžื™ื ื•ืช, ืžื—ื™ืจื™ื ืกื‘ื™ืจื™ื ื•ื‘ื•ื ื•ืก ืฉืœ $300. ืžืชื•ืš ื—ืžื“ื ื•ืช, ื”ื–ืžื ืชื™ ืžื•ืคืข 4xV100 ืขื SSD ื•ื”ืจื‘ื” ื–ื™ื›ืจื•ืŸ RAM, ื•ื–ื• ื”ื™ื™ืชื” ื˜ืขื•ืช ื’ื“ื•ืœื”. ืžื›ื•ื ื” ื›ื–ื• ื’ื•ื–ืœืช ื›ืกืฃ ื‘ืžื”ื™ืจื•ืช; ืืชื” ื™ื›ื•ืœ ืœืฆืืช ืœื ื™ืกื•ื™ื™ื ืœืœื ืฆื™ื ื•ืจ ืžื•ื›ื—. ืœืžื˜ืจื•ืช ื—ื™ื ื•ื›ื™ื•ืช, ืขื“ื™ืฃ ืœืงื—ืช ืืช K80. ืื‘ืœ ื”ื›ืžื•ืช ื”ื’ื“ื•ืœื” ืฉืœ ื–ื™ื›ืจื•ืŸ RAM ื”ื•ืขื™ืœื” - ื”-SSD ื‘ืขื ืŸ ืœื ื”ืจืฉื™ื ื‘ื‘ื™ืฆื•ืขื™ื ืฉืœื•, ื•ืœื›ืŸ ืžืกื“ ื”ื ืชื•ื ื™ื ื”ื•ืขื‘ืจ ืืœ dev/shm.

ื”ืžืขื ื™ื™ืŸ ื‘ื™ื•ืชืจ ื”ื•ื ืงื˜ืข ื”ืงื•ื“ ืฉืื—ืจืื™ ืœืฉื™ืžื•ืฉ ื‘ืžืกืคืจ GPUs. ืจืืฉื™ืช, ื”ืžื•ื“ืœ ื ื•ืฆืจ ืขืœ ื”-CPU ื‘ืืžืฆืขื•ืช ืžื ื”ืœ ื”ืงืฉืจ, ื‘ื“ื™ื•ืง ื›ืžื• ื‘-Python:

with(tensorflow::tf$device("/cpu:0"), {
  model_cpu <- get_model(
    name = model_name,
    input_shape = input_shape,
    weights = weights,
    metrics =(top_3_categorical_accuracy,
    compile = FALSE
  )
})

ื•ืื– ื”ืžื•ื“ืœ ื”ืœื-ืงื•ืžืคื™ืœืฆื™ื” (ื–ื” ื—ืฉื•ื‘) ืžื•ืขืชืง ืœืžืกืคืจ ื ืชื•ืŸ ืฉืœ GPUs ื–ืžื™ื ื™ื, ื•ืจืง ืœืื—ืจ ืžื›ืŸ ื”ื•ื ื”ื™ื“ื•ืจ:

model <- keras::multi_gpu_model(model_cpu, gpus = n_gpu)
keras::compile(
  object = model,
  optimizer = keras::optimizer_adam(lr = 0.0004),
  loss = "categorical_crossentropy",
  metrics = c(top_3_categorical_accuracy)
)

ืœื ื ื™ืชืŸ ื”ื™ื” ืœื™ื™ืฉื ืืช ื”ื˜ื›ื ื™ืงื” ื”ืงืœืืกื™ืช ืฉืœ ื”ืงืคืืช ื›ืœ ื”ืฉื›ื‘ื•ืช ืžืœื‘ื“ ื”ืื—ืจื•ื ื”, ืื™ืžื•ืŸ ื”ืฉื›ื‘ื” ื”ืื—ืจื•ื ื”, ื‘ื™ื˜ื•ืœ ื”ืงืคืื” ื•ืื™ืžื•ืŸ ืžื—ื“ืฉ ืฉืœ ื”ื“ื’ื ื›ื•ืœื• ืœืžืกืคืจ GPUs.

ื”ืื™ืžื•ืŸ ื ื•ื˜ืจ ืœืœื ืฉื™ืžื•ืฉ. ื˜ื ืกื•ืจื‘ื•ืจื“, ืžื’ื‘ื™ืœื™ื ืืช ืขืฆืžื ื• ืœื”ืงืœื˜ืช ื™ื•ืžื ื™ื ื•ืฉืžื™ืจืช ื“ื’ืžื™ื ืขื ืฉืžื•ืช ืื™ื ืคื•ืจืžื˜ื™ื‘ื™ื™ื ืœืื—ืจ ื›ืœ ืชืงื•ืคื”:

ื”ืชืงืฉืจื•ื™ื•ืช ื—ื•ื–ืจื•ืช

# ะจะฐะฑะปะพะฝ ะธะผะตะฝะธ ั„ะฐะนะปะฐ ะปะพะณะฐ
log_file_tmpl <- file.path("logs", sprintf(
  "%s_%d_%dch_%s.csv",
  model_name,
  dim_size,
  channels,
  format(Sys.time(), "%Y%m%d%H%M%OS")
))
# ะจะฐะฑะปะพะฝ ะธะผะตะฝะธ ั„ะฐะนะปะฐ ะผะพะดะตะปะธ
model_file_tmpl <- file.path("models", sprintf(
  "%s_%d_%dch_{epoch:02d}_{val_loss:.2f}.h5",
  model_name,
  dim_size,
  channels
))

callbacks_list <- list(
  keras::callback_csv_logger(
    filename = log_file_tmpl
  ),
  keras::callback_early_stopping(
    monitor = "val_loss",
    min_delta = 1e-4,
    patience = 8,
    verbose = 1,
    mode = "min"
  ),
  keras::callback_reduce_lr_on_plateau(
    monitor = "val_loss",
    factor = 0.5, # ัƒะผะตะฝัŒัˆะฐะตะผ lr ะฒ 2 ั€ะฐะทะฐ
    patience = 4,
    verbose = 1,
    min_delta = 1e-4,
    mode = "min"
  ),
  keras::callback_model_checkpoint(
    filepath = model_file_tmpl,
    monitor = "val_loss",
    save_best_only = FALSE,
    save_weights_only = FALSE,
    mode = "min"
  )
)

8. ื‘ืžืงื•ื ืžืกืงื ื”

ืขื“ื™ื™ืŸ ืœื ื”ืชื’ื‘ืจื• ืขืœ ืžืกืคืจ ื‘ืขื™ื•ืช ืฉื ืชืงืœื ื• ื‘ื”ืŸ:

  • ะฒ keras ืื™ืŸ ืคื•ื ืงืฆื™ื” ืžื•ื›ื ื” ืœื—ื™ืคื•ืฉ ืื•ื˜ื•ืžื˜ื™ ืื—ืจ ืงืฆื‘ ื”ืœืžื™ื“ื” ื”ืื•ืคื˜ื™ืžืœื™ (ืื ืœื•ื’ื™ lr_finder ื‘ืกืคืจื™ื” ืžื”ืจ.ืื™); ืขื ืงืฆืช ืžืืžืฅ, ืืคืฉืจ ืœื”ืขื‘ื™ืจ ื™ื™ืฉื•ืžื™ ืฆื“ ืฉืœื™ืฉื™ ืœ-R, ืœืžืฉืœ, ื–ื”;
  • ื›ืชื•ืฆืื” ืžื”ื ืงื•ื“ื” ื”ืงื•ื“ืžืช, ืœื ื ื™ืชืŸ ื”ื™ื” ืœื‘ื—ื•ืจ ืืช ืžื”ื™ืจื•ืช ื”ืื™ืžื•ืŸ ื”ื ื›ื•ื ื” ื‘ืขืช ืฉื™ืžื•ืฉ ื‘ืžืกืคืจ GPUs;
  • ื™ืฉ ื—ื•ืกืจ ื‘ืืจื›ื™ื˜ืงื˜ื•ืจื•ืช ืžื•ื“ืจื ื™ื•ืช ืฉืœ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช, ื‘ืžื™ื•ื—ื“ ืืœื• ืฉื”ื•ื›ืฉืจื• ืžืจืืฉ ื‘-imagenet;
  • ืื™ืŸ ืžื“ื™ื ื™ื•ืช ืฉืœ ืžื—ื–ื•ืจ ืื—ื“ ื•ืฉื™ืขื•ืจื™ ืœืžื™ื“ื” ืžืคืœื™ื (ื—ื™ืฉื•ืœ ืงื•ืกื™ื ื•ืก ื”ื™ื” ืœื‘ืงืฉืชื ื• ืžื•ื˜ืžืขืชื•ื“ื” ืกืงื™ื™ื“ืŸ).

ืื™ืœื• ื“ื‘ืจื™ื ืฉื™ืžื•ืฉื™ื™ื ื ืœืžื“ื• ืžื”ืชื—ืจื•ืช ื”ื–ื•:

  • ืขืœ ื—ื•ืžืจื” ื‘ืขืœืช ื”ืกืคืง ื ืžื•ืš ื™ื—ืกื™ืช, ืืชื” ื™ื›ื•ืœ ืœืขื‘ื•ื“ ืขื ื ืคื—ื™ ื ืชื•ื ื™ื ื”ื’ื•ื ื™ื (ืคื™ ืจื‘ื™ื ืž-RAM) ืœืœื ื›ืื‘. ืฉืงื™ืช ืคืœืกื˜ื™ืง ื˜ื‘ืœืช ื ืชื•ื ื™ื ื—ื•ืกืš ื–ื™ื›ืจื•ืŸ ืขืงื‘ ืฉื™ื ื•ื™ ื‘ืžืงื•ื ืฉืœ ื˜ื‘ืœืื•ืช, ืžื” ืฉื ืžื ืข ืžื”ืขืชืงืชืŸ, ื•ื‘ืฉื™ืžื•ืฉ ื ื›ื•ืŸ, ื”ื™ื›ื•ืœื•ืช ืฉืœื• ื›ืžืขื˜ ืชืžื™ื“ ืžื“ื’ื™ืžื•ืช ืืช ื”ืžื”ื™ืจื•ืช ื”ื’ื‘ื•ื”ื” ื‘ื™ื•ืชืจ ืžื‘ื™ืŸ ื›ืœ ื”ื›ืœื™ื ื”ืžื•ื›ืจื™ื ืœื ื• ืœืฉืคื•ืช ืกืงืจื™ืคื˜ื™ื. ืฉืžื™ืจืช ื ืชื•ื ื™ื ื‘ืžืกื“ ื ืชื•ื ื™ื ืžืืคืฉืจืช ืœืš, ื‘ืžืงืจื™ื ืจื‘ื™ื, ืœื ืœื—ืฉื•ื‘ ื›ืœืœ ืขืœ ื”ืฆื•ืจืš ืœืกื—ื•ื˜ ืืช ื›ืœ ืžืขืจืš ื”ื ืชื•ื ื™ื ืœืชื•ืš ื”-RAM.
  • ื ื™ืชืŸ ืœื”ื—ืœื™ืฃ ืคื•ื ืงืฆื™ื•ืช ืื™ื˜ื™ื•ืช ื‘-R ื‘ืžื”ื™ืจื•ืช ื‘-C++ ื‘ืืžืฆืขื•ืช ื”ื—ื‘ื™ืœื” Rcpp. ืื ื‘ื ื•ืกืฃ ืœืฉื™ืžื•ืฉ RcppThread ืื• RcppParallel, ืื ื• ืžืงื‘ืœื™ื ืžื™ืžื•ืฉื™ื ืขื ืจื™ื‘ื•ื™ ื”ืœื™ื›ื™ ื—ื•ืฆื” ืคืœื˜ืคื•ืจืžื•ืช, ื›ืš ืฉืื™ืŸ ืฆื•ืจืš ืœื”ืงื‘ื™ืœ ืืช ื”ืงื•ื“ ื‘ืจืžืช R.
  • ื—ึฒื‘ึดื™ืœึธื” Rcpp ื ื™ืชืŸ ืœื”ืฉืชืžืฉ ืœืœื ื™ื“ืข ืจืฆื™ื ื™ ื‘-C++, ื”ืžื™ื ื™ืžื•ื ื”ื ื“ืจืฉ ืžืชื•ืืจ ื›ืืŸ. ืงื‘ืฆื™ ื›ื•ืชืจื•ืช ืœืžืกืคืจ ืกืคืจื™ื•ืช C ืžื’ื ื™ื‘ื•ืช ื›ืžื• xtensor ื–ืžื™ืŸ ื‘-CRAN, ื›ืœื•ืžืจ ื ื•ืฆืจืช ืชืฉืชื™ืช ืœื™ื™ืฉื•ื ืคืจื•ื™ืงื˜ื™ื ื”ืžืฉืœื‘ื™ื ืงื•ื“ C++ ืžื•ื›ืŸ ื•ืžื•ื›ืŸ ืขื ื‘ื™ืฆื•ืขื™ื ื’ื‘ื•ื”ื™ื ืœืชื•ืš R. ื ื•ื—ื•ืช ื ื•ืกืคืช ื”ื™ื ื”ื“ื’ืฉืช ืชื—ื‘ื™ืจ ื•ืžื ืชื— ืงื•ื“ C++ ืกื˜ื˜ื™ ื‘-RStudio.
  • ื“ื•ืงื•ืคื˜ ืžืืคืฉืจ ืœืš ืœื”ืจื™ืฅ ืกืงืจื™ืคื˜ื™ื ืขืฆืžืื™ื™ื ืขื ืคืจืžื˜ืจื™ื. ื–ื” ื ื•ื— ืœืฉื™ืžื•ืฉ ื‘ืฉืจืช ืžืจื•ื—ืง, ื›ื•ืœืœ. ืชื—ืช ื“ื•ืงืจ. ื‘-RStudio ืœื ื ื•ื— ืœืขืจื•ืš ืฉืขื•ืช ืจื‘ื•ืช ืฉืœ ื ื™ืกื•ื™ื™ื ื‘ืื™ืžื•ืŸ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช, ื•ื”ืชืงื ืช ื”-IDE ืขืœ ื”ืฉืจืช ืขืฆืžื• ืœื ืชืžื™ื“ ืžื•ืฆื“ืงืช.
  • Docker ืžื‘ื˜ื™ื— ื ื™ื™ื“ื•ืช ืงื•ื“ ื•ืฉื—ื–ื•ืจ ืฉืœ ืชื•ืฆืื•ืช ื‘ื™ืŸ ืžืคืชื—ื™ื ืขื ื’ืจืกืื•ืช ืฉื•ื ื•ืช ืฉืœ ืžืขืจื›ืช ื”ื”ืคืขืœื” ื•ื”ืกืคืจื™ื•ืช, ื›ืžื• ื’ื ืงืœื•ืช ื‘ื™ืฆื•ืข ื‘ืฉืจืชื™ื. ืืชื” ื™ื›ื•ืœ ืœื”ืคืขื™ืœ ืืช ื›ืœ ืฆื™ื ื•ืจ ื”ืื™ืžื•ื ื™ื ืขื ืคืงื•ื“ื” ืื—ืช ื‘ืœื‘ื“.
  • Google Cloud ื”ื™ื ื“ืจืš ื™ื“ื™ื“ื•ืชื™ืช ืœืชืงืฆื™ื‘ ืœื”ืชื ืกื•ืช ื‘ื—ื•ืžืจื” ื™ืงืจื”, ืืš ืขืœื™ืš ืœื‘ื—ื•ืจ ืชืฆื•ืจื•ืช ื‘ืงืคื™ื“ื”.
  • ืžื“ื™ื“ืช ืžื”ื™ืจื•ืช ืฉืœ ืงื˜ืขื™ ืงื•ื“ ื‘ื•ื“ื“ื™ื ื”ื™ื ืฉื™ืžื•ืฉื™ืช ืžืื•ื“, ื‘ืžื™ื•ื—ื“ ื‘ืฉื™ืœื•ื‘ R ื•-C++, ื•ืขื ื”ื—ื‘ื™ืœื” ืกืคืกืœ - ื’ื ืงืœ ืžืื•ื“.

ื‘ืกืš ื”ื›ืœ ื”ื—ื•ื•ื™ื” ื”ื–ื• ื”ื™ื™ืชื” ืžืชื’ืžืœืช ืžืื•ื“ ื•ืื ื—ื ื• ืžืžืฉื™ื›ื™ื ืœืขื‘ื•ื“ ื›ื“ื™ ืœืคืชื•ืจ ื›ืžื” ืžื”ื‘ืขื™ื•ืช ืฉื”ื•ืขืœื•.

ืžืงื•ืจ: www.habr.com

ื”ื•ืกืคืช ืชื’ื•ื‘ื”