Kurumidza Dhirowa Doodle Kuzivikanwa: maitiro ekuita shamwari neR, C ++ uye neural network

Kurumidza Dhirowa Doodle Kuzivikanwa: maitiro ekuita shamwari neR, C ++ uye neural network

Hei Habr!

Kudonha kwekupedzisira, Kaggle akaita makwikwi ekurongedza mapikicha akadhonzwa nemaoko, Kurumidza Dhirowa Doodle Recognition, umo, pakati pevamwe, timu yeR-sainzi yakatora chikamu: Artem Klevtsova, Philippa maneja ΠΈ Andrey Ogurtsov. Hatisi kuzotsanangura makwikwi zvakadzama; izvo zvakatoitwa mukati kubudiswa munguva pfupi yapfuura.

Panguva ino hazvina kushanda nekurima menduru, asi ruzivo rwakawanda rwakakosha rwakawanikwa, saka ndinoda kuudza nharaunda pamusoro pezvizhinji zvinofadza uye zvinobatsira zvinhu paKagle uye mubasa remazuva ose. Pakati pemisoro yakakurukurwa: hupenyu hwakaoma pasina OpenCV, JSON parsing (iyi mienzaniso inoongorora kubatanidzwa kweC++ kodhi mune zvinyorwa kana mapakeji muR uchishandisa Rcpp), parameterization yezvinyorwa uye dockerization yemhinduro yekupedzisira. Yese kodhi kubva kune meseji mune fomu yakakodzera kuurayiwa inowanikwa mukati repositories.

Zviri Mukati:

  1. Nekugona kurodha data kubva kuCSV kuenda kuMonetDB
  2. Kugadzirira mabheti
  3. Iterators yekuburitsa mabhechi kubva kudhatabhesi
  4. Kusarudza Model Architecture
  5. Script parameterization
  6. Dockerization yezvinyorwa
  7. Kushandisa akawanda maGPU paGoogle Cloud
  8. Pane mhedziso

1. Kurodha data kubva kuCSV mudura reMonetDB

Iyo data mumakwikwi aya inopihwa kwete muchimiro chemifananidzo yakagadzirwa, asi muchimiro che340 CSV mafaera (faira rimwe rekirasi yega yega) ine maJSON ane mapoinzi coordinates. Nekubatanidza mapoinzi aya nemitsara, tinowana mufananidzo wekupedzisira unoyera 256x256 pixels. Zvakare parekodhi yega yega pane chikwangwani chinoratidza kana mufananidzo wacho wakanyatso zivikanwa neclassifier yakashandiswa panguva yakaunganidzwa dhata, kodhi ine mavara maviri yenyika yekugara yemunyori wemufananidzo, chiziviso chakasiyana, chidhindo chenguva. uye zita rekirasi rinoenderana nezita refaira. Shanduro yakareruka yedata rekutanga inorema 7.4 GB mudura uye ingangoita makumi maviri GB mushure mekuburitsa, iyo data yakazara mushure mekuburitsa inotora 20 GB. Varongi vakave nechokwadi chekuti mavhezheni ese ari maviri akadhirowa zvakafanana, zvichireva kuti iyo yakazara vhezheni yaive isina basa. Chero zvazvingava, kuchengetedza mamirioni makumi mashanu emifananidzo mumafaira emifananidzo kana muchimiro chezvirongwa zvakabva zvaonekwa sezvisingabatsiri, uye isu takasarudza kubatanidza ese CSV mafaera kubva mudura. train_simplified.zip mune dhatabhesi ine chizvarwa chinotevera chemifananidzo yehukuru hunodiwa "pane nhunzi" yebatch yega yega.

Iyo yakanyatso kuratidzwa sisitimu yakasarudzwa seDBMS MonetDB, kureva kuita kweR sepakeji MonetDBLite. Iyo package inosanganisira yakamisikidzwa vhezheni ye database server uye inokutendera iwe kuti utore sevha zvakananga kubva kuR chikamu uye kushanda nayo ipapo. Kugadzira dhatabhesi uye kubatanidza kwairi kunoitwa nemurairo mumwechete:

con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

Tichada kugadzira matafura maviri: imwe yedata rese, imwe yeruzivo rwesevhisi nezve akadhawunirodha mafaera (anobatsira kana chimwe chinhu chikatadza uye maitiro acho anofanirwa kutangwazve mushure mekurodha akati wandei mafaera):

Kugadzira matafura

if (!DBI::dbExistsTable(con, "doodles")) {
  DBI::dbCreateTable(
    con = con,
    name = "doodles",
    fields = c(
      "countrycode" = "char(2)",
      "drawing" = "text",
      "key_id" = "bigint",
      "recognized" = "bool",
      "timestamp" = "timestamp",
      "word" = "text"
    )
  )
}

if (!DBI::dbExistsTable(con, "upload_log")) {
  DBI::dbCreateTable(
    con = con,
    name = "upload_log",
    fields = c(
      "id" = "serial",
      "file_name" = "text UNIQUE",
      "uploaded" = "bool DEFAULT false"
    )
  )
}

Nzira yekukurumidza kurodha data mudhatabhesi yaive yekukopa zvakananga CSV mafaera uchishandisa SQL - command COPY OFFSET 2 INTO tablename FROM path USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORTkupi tablename - tafura zita uye path - nzira yefaira. Ndichiri kushanda neiyo archive, zvakaonekwa kuti yakavakirwa-mukati kuita unzip muR haishande nemazvo nehuwandu hwemafaira kubva mudura, saka takashandisa system unzip (uchishandisa parameter getOption("unzip")).

Basa rekunyora kune database

#' @title Π˜Π·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ ΠΈ Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠ° Ρ„Π°ΠΉΠ»ΠΎΠ²
#'
#' @description
#' Π˜Π·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ CSV-Ρ„Π°ΠΉΠ»ΠΎΠ² ΠΈΠ· ZIP-Π°Ρ€Ρ…ΠΈΠ²Π° ΠΈ Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠ° ΠΈΡ… Π² Π±Π°Π·Ρƒ Π΄Π°Π½Π½Ρ‹Ρ…
#'
#' @param con ΠžΠ±ΡŠΠ΅ΠΊΡ‚ ΠΏΠΎΠ΄ΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΡ ΠΊ Π±Π°Π·Π΅ Π΄Π°Π½Π½Ρ‹Ρ… (класс `MonetDBEmbeddedConnection`).
#' @param tablename НазваниС Ρ‚Π°Π±Π»ΠΈΡ†Ρ‹ Π² Π±Π°Π·Π΅ Π΄Π°Π½Π½Ρ‹Ρ….
#' @oaram zipfile ΠŸΡƒΡ‚ΡŒ ΠΊ ZIP-Π°Ρ€Ρ…ΠΈΠ²Ρƒ.
#' @oaram filename Имя Ρ„Π°ΠΉΠ»Π° Π²Π½ΡƒΡ€ΠΈ ZIP-Π°Ρ€Ρ…ΠΈΠ²Π°.
#' @param preprocess Ѐункция ΠΏΡ€Π΅Π΄ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ, которая Π±ΡƒΠ΄Π΅Ρ‚ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½Π° ΠΈΠ·Π²Π»Π΅Ρ‡Ρ‘Π½Π½ΠΎΠΌΡƒ Ρ„Π°ΠΉΠ»Ρƒ.
#'   Π”ΠΎΠ»ΠΆΠ½Π° ΠΏΡ€ΠΈΠ½ΠΈΠΌΠ°Ρ‚ΡŒ ΠΎΠ΄ΠΈΠ½ Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ `data` (ΠΎΠ±ΡŠΠ΅ΠΊΡ‚ `data.table`).
#'
#' @return `TRUE`.
#'
upload_file <- function(con, tablename, zipfile, filename, preprocess = NULL) {
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²
  checkmate::assert_class(con, "MonetDBEmbeddedConnection")
  checkmate::assert_string(tablename)
  checkmate::assert_string(filename)
  checkmate::assert_true(DBI::dbExistsTable(con, tablename))
  checkmate::assert_file_exists(zipfile, access = "r", extension = "zip")
  checkmate::assert_function(preprocess, args = c("data"), null.ok = TRUE)

  # Π˜Π·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ Ρ„Π°ΠΉΠ»Π°
  path <- file.path(tempdir(), filename)
  unzip(zipfile, files = filename, exdir = tempdir(), 
        junkpaths = TRUE, unzip = getOption("unzip"))
  on.exit(unlink(file.path(path)))

  # ΠŸΡ€ΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ функция ΠΏΡ€Π΅Π΄ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ
  if (!is.null(preprocess)) {
    .data <- data.table::fread(file = path)
    .data <- preprocess(data = .data)
    data.table::fwrite(x = .data, file = path, append = FALSE)
    rm(.data)
  }

  # Запрос ΠΊ Π‘Π” Π½Π° ΠΈΠΌΠΏΠΎΡ€Ρ‚ CSV
  sql <- sprintf(
    "COPY OFFSET 2 INTO %s FROM '%s' USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORT",
    tablename, path
  )
  # Π’Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ запроса ΠΊ Π‘Π”
  DBI::dbExecute(con, sql)

  # Π”ΠΎΠ±Π°Π²Π»Π΅Π½ΠΈΠ΅ записи ΠΎΠ± ΡƒΡΠΏΠ΅ΡˆΠ½ΠΎΠΉ Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠ΅ Π² ΡΠ»ΡƒΠΆΠ΅Π±Π½ΡƒΡŽ Ρ‚Π°Π±Π»ΠΈΡ†Ρƒ
  DBI::dbExecute(con, sprintf("INSERT INTO upload_log(file_name, uploaded) VALUES('%s', true)",
                              filename))

  return(invisible(TRUE))
}

Kana iwe uchida kushandura tafura usati wanyora kune database, zvakakwana kuti upfuure mukupokana preprocess basa rinozoshandura data.

Kodhi yekuteedzera data mudhatabhesi:

Kunyora data kune database

# Бписок Ρ„Π°ΠΉΠ»ΠΎΠ² для записи
files <- unzip(zipfile, list = TRUE)$Name

# Бписок ΠΈΡΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΠΉ, Ссли Ρ‡Π°ΡΡ‚ΡŒ Ρ„Π°ΠΉΠ»ΠΎΠ² ΡƒΠΆΠ΅ Π±Ρ‹Π»Π° Π·Π°Π³Ρ€ΡƒΠΆΠ΅Π½Π°
to_skip <- DBI::dbGetQuery(con, "SELECT file_name FROM upload_log")[[1L]]
files <- setdiff(files, to_skip)

if (length(files) > 0L) {
  # ЗапускаСм Ρ‚Π°ΠΉΠΌΠ΅Ρ€
  tictoc::tic()
  # ΠŸΡ€ΠΎΠ³Ρ€Π΅ΡΡ Π±Π°Ρ€
  pb <- txtProgressBar(min = 0L, max = length(files), style = 3)
  for (i in seq_along(files)) {
    upload_file(con = con, tablename = "doodles", 
                zipfile = zipfile, filename = files[i])
    setTxtProgressBar(pb, i)
  }
  close(pb)
  # ΠžΡΡ‚Π°Π½Π°Π²Π»ΠΈΠ²Π°Π΅ΠΌ Ρ‚Π°ΠΉΠΌΠ΅Ρ€
  tictoc::toc()
}

# 526.141 sec elapsed - ΠΊΠΎΠΏΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ SSD->SSD
# 558.879 sec elapsed - ΠΊΠΎΠΏΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ USB->SSD

Nguva yekurodha data inogona kusiyana zvichienderana nekumhanya hunhu hwedhiraivha inoshandiswa. Kwatiri, kuverenga nekunyora mukati meSSD imwe kana kubva kune flash drive (source file) kune SSD (DB) inotora isingasviki maminetsi gumi.

Zvinotora mamwe masekonzi mashoma kugadzira koramu ine integer class label uye index column (ORDERED INDEX) nenhamba dzemutsara dzinozotariswa nadzo pakugadzira mabhechi:

Kugadzira Mamwe Makoramu uye Index

message("Generate lables")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD label_int int"))
invisible(DBI::dbExecute(con, "UPDATE doodles SET label_int = dense_rank() OVER (ORDER BY word) - 1"))

message("Generate row numbers")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD id serial"))
invisible(DBI::dbExecute(con, "CREATE ORDERED INDEX doodles_id_ord_idx ON doodles(id)"))

Kugadzirisa dambudziko rekugadzira batch panhunzi, isu taifanira kuwana iyo yakanyanya kumhanya yekubvisa mitsetse isina kurongeka kubva patafura. doodles. Nokuda kweizvi takashandisa 3 tricks. Yekutanga yaive yekudzikisa dimensionality yemhando inochengeta iyo yekutarisa ID. Mune yekutanga data set, mhando inodiwa kuchengetedza ID ndeye bigint, asi nhamba yezvakacherechedzwa inoita kuti zvikwanise kukwana zviziviso zvavo, zvakaenzana nenhamba ye ordinal, murudzi. int. Kutsvaga kunokurumidza zvikuru munyaya iyi. Chechipiri chaive chekushandisa ORDERED INDEX - takasvika kune iyi sarudzo zvine simba, tapfuura zvese zviripo sarudzo. Yechitatu yaive yekushandisa parameterized mibvunzo. Chinokosha cheiyo nzira ndeyekuita murairo kamwe chete PREPARE nekushandiswa kunotevera kwekutaura kwakagadzirirwa paunenge uchigadzira boka remibvunzo yemhando imwe chete, asi kutaura zvazviri pane mukana kana uchienzaniswa neyakapusa. SELECT zvakava mukati mehuwandu hwekukanganisa kwenhamba.

Maitiro ekurodha data haadyi anopfuura 450 MB ye RAM. Ndokunge, iyo yakatsanangurwa nzira inobvumidza iwe kufambisa datasets inorema makumi emagigabytes pane angangoita chero bhajeti hardware, kusanganisira imwe single-board zvishandiso, izvo zvinotonhorera.

Chasara kuyera kumhanya kwekutora (random) data uye kuongorora kuyera kana sampling mabheji ehukuru hwakasiyana:

Database benchmark

library(ggplot2)

set.seed(0)
# ΠŸΠΎΠ΄ΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΠ΅ ΠΊ Π±Π°Π·Π΅ Π΄Π°Π½Π½Ρ‹Ρ…
con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

# Ѐункция для ΠΏΠΎΠ΄Π³ΠΎΡ‚ΠΎΠ²ΠΊΠΈ запроса Π½Π° сторонС сСрвСра
prep_sql <- function(batch_size) {
  sql <- sprintf("PREPARE SELECT id FROM doodles WHERE id IN (%s)",
                 paste(rep("?", batch_size), collapse = ","))
  res <- DBI::dbSendQuery(con, sql)
  return(res)
}

# Ѐункция для извлСчСния Π΄Π°Π½Π½Ρ‹Ρ…
fetch_data <- function(rs, batch_size) {
  ids <- sample(seq_len(n), batch_size)
  res <- DBI::dbFetch(DBI::dbBind(rs, as.list(ids)))
  return(res)
}

# ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ Π·Π°ΠΌΠ΅Ρ€Π°
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    rs <- prep_sql(batch_size)
    bench::mark(
      fetch_data(rs, batch_size),
      min_iterations = 50L
    )
  }
)
# ΠŸΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ Π±Π΅Π½Ρ‡ΠΌΠ°Ρ€ΠΊΠ°
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   batch_size      min   median      max `itr/sec` total_time n_itr
#        <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
# 1         16   23.6ms  54.02ms  93.43ms     18.8        2.6s    49
# 2         32     38ms  84.83ms 151.55ms     11.4       4.29s    49
# 3         64   63.3ms 175.54ms 248.94ms     5.85       8.54s    50
# 4        128   83.2ms 341.52ms 496.24ms     3.00      16.69s    50
# 5        256  232.8ms 653.21ms 847.44ms     1.58      31.66s    50
# 6        512  784.6ms    1.41s    1.98s     0.740       1.1m    49
# 7       1024  681.7ms    2.72s    4.06s     0.377      2.16m    49

ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
  geom_point() +
  geom_line() +
  ylab("median time, s") +
  theme_minimal()

DBI::dbDisconnect(con, shutdown = TRUE)

Kurumidza Dhirowa Doodle Kuzivikanwa: maitiro ekuita shamwari neR, C ++ uye neural network

2. Kugadzirira mabheti

Iyo yese batch yekugadzirira maitiro ine anotevera matanho:

  1. Kuisa akati wandei maJSON ane mavector etambo ane coordes yemapoinzi.
  2. Kudhirowa mitsetse ine mavara zvichienderana nekurongeka kwemapoinzi pamufananidzo wehukuru hunodiwa (semuenzaniso, 256 Γ— 256 kana 128 Γ— 128).
  3. Kushandura mifananidzo inoguma kuita tensor.

Sechikamu chemakwikwi pakati pePython kernels, dambudziko rakagadziriswa kunyanya kushandisa OpenCV. Imwe yeakareruka uye ari pachena analogues muR angaite seizvi:

Kushandisa JSON kune Tensor Shanduko muR

r_process_json_str <- function(json, line.width = 3, 
                               color = TRUE, scale = 1) {
  # ΠŸΠ°Ρ€ΡΠΈΠ½Π³ JSON
  coords <- jsonlite::fromJSON(json, simplifyMatrix = FALSE)
  tmp <- tempfile()
  # УдаляСм Π²Ρ€Π΅ΠΌΠ΅Π½Π½Ρ‹ΠΉ Ρ„Π°ΠΉΠ» ΠΏΠΎ Π·Π°Π²Π΅Ρ€ΡˆΠ΅Π½ΠΈΡŽ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΈ
  on.exit(unlink(tmp))
  png(filename = tmp, width = 256 * scale, height = 256 * scale, pointsize = 1)
  # ΠŸΡƒΡΡ‚ΠΎΠΉ Π³Ρ€Π°Ρ„ΠΈΠΊ
  plot.new()
  # Π Π°Π·ΠΌΠ΅Ρ€ ΠΎΠΊΠ½Π° Π³Ρ€Π°Ρ„ΠΈΠΊΠ°
  plot.window(xlim = c(256 * scale, 0), ylim = c(256 * scale, 0))
  # Π¦Π²Π΅Ρ‚Π° Π»ΠΈΠ½ΠΈΠΉ
  cols <- if (color) rainbow(length(coords)) else "#000000"
  for (i in seq_along(coords)) {
    lines(x = coords[[i]][[1]] * scale, y = coords[[i]][[2]] * scale, 
          col = cols[i], lwd = line.width)
  }
  dev.off()
  # ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ изобраТСния Π² 3-Ρ… ΠΌΠ΅Ρ€Π½Ρ‹ΠΉ массив
  res <- png::readPNG(tmp)
  return(res)
}

r_process_json_vector <- function(x, ...) {
  res <- lapply(x, r_process_json_str, ...)
  # ОбъСдинСниС 3-Ρ… ΠΌΠ΅Ρ€Π½Ρ‹Ρ… массивов ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΎΠΊ Π² 4-Ρ… ΠΌΠ΅Ρ€Π½Ρ‹ΠΉ Π² Ρ‚Π΅Π½Π·ΠΎΡ€
  res <- do.call(abind::abind, c(res, along = 0))
  return(res)
}

Dhirowa inoitwa uchishandisa yakajairwa maturusi eR uye yakachengetedzwa kune yechinguva PNG yakachengetwa muRAM (paLinux, yenguva pfupi madhairekitori ari mudhairekitori. /tmp, yakaiswa mu RAM). Iri faira rinobva raverengwa sematatu-dimensional array ane nhamba kubva 0 kusvika 1. Izvi zvakakosha nekuti imwe yakajairika BMP yaizoverengerwa kuita mbishi array ine hex color codes.

Ngatiedze mhedzisiro:

zip_file <- file.path("data", "train_simplified.zip")
csv_file <- "cat.csv"
unzip(zip_file, files = csv_file, exdir = tempdir(), 
      junkpaths = TRUE, unzip = getOption("unzip"))
tmp_data <- data.table::fread(file.path(tempdir(), csv_file), sep = ",", 
                              select = "drawing", nrows = 10000)
arr <- r_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256   3
plot(magick::image_read(arr))

Kurumidza Dhirowa Doodle Kuzivikanwa: maitiro ekuita shamwari neR, C ++ uye neural network

Iyo batch pachayo ichagadzirwa seinotevera:

res <- r_process_json_vector(tmp_data[1:4, drawing], scale = 0.5)
str(res)
 # num [1:4, 1:128, 1:128, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
 # - attr(*, "dimnames")=List of 4
 #  ..$ : NULL
 #  ..$ : NULL
 #  ..$ : NULL
 #  ..$ : NULL

Kuita uku kwaiita sekusina kunaka kwatiri, sezvo kuumbwa kwemabhechi makuru kunotora nguva yakareba zvisina hunhu, uye takasarudza kutora mukana wechiitiko chevatinoshanda navo nekushandisa raibhurari ine simba. OpenCV. Panguva iyoyo pakanga pasina pasuru yakagadzirwa-yakagadzirirwa yeR (hapana ikozvino), saka kuita kushoma kwekuita kwaidiwa kwakanyorwa muC ++ nekubatanidzwa muR kodhi uchishandisa. Rcpp.

Kugadzirisa dambudziko, mapakeji anotevera nemaraibhurari akashandiswa:

  1. OpenCV yekushanda nemifananidzo uye mitsara yekudhirowa. Yakashandiswa pre-yakaiswa system maraibhurari uye misoro mafaera, pamwe neinosimba yekubatanidza.

  2. xtensor yekushanda nemultidimensional arrays uye tensor. Isu takashandisa emusoro mafaera akabatanidzwa muR pasuru yezita rimwe chete. Iyo raibhurari inobvumidza iwe kushanda neakawanda madhimensional arrays, ese ari mumutsara wakakura uye koramu huru kurongeka.

  3. ndjson yekutsanangura JSON. Raibhurari iyi inoshandiswa mu xtensor otomatiki kana iripo muprojekiti.

  4. RcppThread yekuronga multi-threaded processing yevector kubva kuJSON. Washandisa iwo musoro mafaira akapihwa nepasuru iyi. Kubva kune mukurumbira RcppParallel Iyo package, pakati pezvimwe zvinhu, ine yakavakirwa-mukati loop yekuvhiringidza michina.

Izvo zvinofanirwa kucherechedzwa kuti xtensor yakazova godsend: mukuwedzera kune chokwadi chekuti ine basa rakawanda uye kuita kwepamusoro, vagadziri vayo vakave vanopindura uye vakapindura mibvunzo nekukasira uye zvakadzama. Nerubatsiro rwavo, zvakagoneka kuita shanduko yeOpenCV matrices kuita xtensor tensors, pamwe nenzira yekubatanidza 3-dimensional image tensor kuita 4-dimensional tensor yeiyo dimension chaiyo (batch pachayo).

Zvishandiso zvekudzidza Rcpp, xtensor uye RcppThread

https://thecoatlessprofessor.com/programming/unofficial-rcpp-api-documentation

https://docs.opencv.org/4.0.1/d7/dbd/group__imgproc.html

https://xtensor.readthedocs.io/en/latest/

https://xtensor.readthedocs.io/en/latest/file_loading.html#loading-json-data-into-xtensor

https://cran.r-project.org/web/packages/RcppThread/vignettes/RcppThread-vignette.pdf

Kuunganidza mafaera anoshandisa masisitimu mafaera uye ane simba kubatanidza nemaraibhurari akaiswa pane sisitimu, isu takashandisa plugin nzira inoshandiswa mupakeji. Rcpp. Kuti tiwane otomatiki nzira uye mireza, takashandisa yakakurumbira Linux utility pkg-gadziriso.

Kuitwa kweiyo Rcpp plugin yekushandisa iyo OpenCV raibhurari

Rcpp::registerPlugin("opencv", function() {
  # Π’ΠΎΠ·ΠΌΠΎΠΆΠ½Ρ‹Π΅ названия ΠΏΠ°ΠΊΠ΅Ρ‚Π°
  pkg_config_name <- c("opencv", "opencv4")
  # Π‘ΠΈΠ½Π°Ρ€Π½Ρ‹ΠΉ Ρ„Π°ΠΉΠ» ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρ‹ pkg-config
  pkg_config_bin <- Sys.which("pkg-config")
  # ΠŸΡ€ΠΎΠ²Ρ€Π΅ΠΊΠ° наличия ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρ‹ Π² систСмС
  checkmate::assert_file_exists(pkg_config_bin, access = "x")
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° наличия Ρ„Π°ΠΉΠ»Π° настроСк OpenCV для pkg-config
  check <- sapply(pkg_config_name, 
                  function(pkg) system(paste(pkg_config_bin, pkg)))
  if (all(check != 0)) {
    stop("OpenCV config for the pkg-config not found", call. = FALSE)
  }

  pkg_config_name <- pkg_config_name[check == 0]
  list(env = list(
    PKG_CXXFLAGS = system(paste(pkg_config_bin, "--cflags", pkg_config_name), 
                          intern = TRUE),
    PKG_LIBS = system(paste(pkg_config_bin, "--libs", pkg_config_name), 
                      intern = TRUE)
  ))
})

Nekuda kwekushanda kwe plugin, zvinotevera zvakakosha zvichatsiviwa panguva yekuunganidza maitiro:

Rcpp:::.plugins$opencv()$env

# $PKG_CXXFLAGS
# [1] "-I/usr/include/opencv"
#
# $PKG_LIBS
# [1] "-lopencv_shape -lopencv_stitching -lopencv_superres -lopencv_videostab -lopencv_aruco -lopencv_bgsegm -lopencv_bioinspired -lopencv_ccalib -lopencv_datasets -lopencv_dpm -lopencv_face -lopencv_freetype -lopencv_fuzzy -lopencv_hdf -lopencv_line_descriptor -lopencv_optflow -lopencv_video -lopencv_plot -lopencv_reg -lopencv_saliency -lopencv_stereo -lopencv_structured_light -lopencv_phase_unwrapping -lopencv_rgbd -lopencv_viz -lopencv_surface_matching -lopencv_text -lopencv_ximgproc -lopencv_calib3d -lopencv_features2d -lopencv_flann -lopencv_xobjdetect -lopencv_objdetect -lopencv_ml -lopencv_xphoto -lopencv_highgui -lopencv_videoio -lopencv_imgcodecs -lopencv_photo -lopencv_imgproc -lopencv_core"

Iyo kodhi yekushandisa yekuisa JSON uye kugadzira batch yekufambisa kune modhi inopihwa pasi pemuparadzi. Kutanga, wedzera dhairekitori yepurojekiti yemunharaunda kutsvaga mafaira emusoro (inodiwa ndjson):

Sys.setenv("PKG_CXXFLAGS" = paste0("-I", normalizePath(file.path("src"))))

Kuitwa kweJSON kune tensor shanduko muC++

// [[Rcpp::plugins(cpp14)]]
// [[Rcpp::plugins(opencv)]]
// [[Rcpp::depends(xtensor)]]
// [[Rcpp::depends(RcppThread)]]

#include <xtensor/xjson.hpp>
#include <xtensor/xadapt.hpp>
#include <xtensor/xview.hpp>
#include <xtensor-r/rtensor.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <Rcpp.h>
#include <RcppThread.h>

// Π‘ΠΈΠ½ΠΎΠ½ΠΈΠΌΡ‹ для Ρ‚ΠΈΠΏΠΎΠ²
using RcppThread::parallelFor;
using json = nlohmann::json;
using points = xt::xtensor<double,2>;     // Π˜Π·Π²Π»Π΅Ρ‡Ρ‘Π½Π½Ρ‹Π΅ ΠΈΠ· JSON ΠΊΠΎΠΎΡ€Π΄ΠΈΠ½Π°Ρ‚Ρ‹ Ρ‚ΠΎΡ‡Π΅ΠΊ
using strokes = std::vector<points>;      // Π˜Π·Π²Π»Π΅Ρ‡Ρ‘Π½Π½Ρ‹Π΅ ΠΈΠ· JSON ΠΊΠΎΠΎΡ€Π΄ΠΈΠ½Π°Ρ‚Ρ‹ Ρ‚ΠΎΡ‡Π΅ΠΊ
using xtensor3d = xt::xtensor<double, 3>; // Π’Π΅Π½Π·ΠΎΡ€ для хранСния ΠΌΠ°Ρ‚Ρ€ΠΈΡ†Ρ‹ изообраТСния
using xtensor4d = xt::xtensor<double, 4>; // Π’Π΅Π½Π·ΠΎΡ€ для хранСния мноТСства ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠΉ
using rtensor3d = xt::rtensor<double, 3>; // ΠžΠ±Ρ‘Ρ€Ρ‚ΠΊΠ° для экспорта Π² R
using rtensor4d = xt::rtensor<double, 4>; // ΠžΠ±Ρ‘Ρ€Ρ‚ΠΊΠ° для экспорта Π² R

// БтатичСскиС константы
// Π Π°Π·ΠΌΠ΅Ρ€ изобраТСния Π² пиксСлях
const static int SIZE = 256;
// Π’ΠΈΠΏ Π»ΠΈΠ½ΠΈΠΈ
// Π‘ΠΌ. https://en.wikipedia.org/wiki/Pixel_connectivity#2-dimensional
const static int LINE_TYPE = cv::LINE_4;
// Π’ΠΎΠ»Ρ‰ΠΈΠ½Π° Π»ΠΈΠ½ΠΈΠΈ Π² пиксСлях
const static int LINE_WIDTH = 3;
// Алгоритм рСсайза
// https://docs.opencv.org/3.1.0/da/d54/group__imgproc__transform.html#ga5bb5a1fea74ea38e1a5445ca803ff121
const static int RESIZE_TYPE = cv::INTER_LINEAR;

// Π¨Π°Π±Π»ΠΎΠ½ для конвСртирования OpenCV-ΠΌΠ°Ρ‚Ρ€ΠΈΡ†Ρ‹ Π² Ρ‚Π΅Π½Π·ΠΎΡ€
template <typename T, int NCH, typename XT=xt::xtensor<T,3,xt::layout_type::column_major>>
XT to_xt(const cv::Mat_<cv::Vec<T, NCH>>& src) {
  // Π Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒ Ρ†Π΅Π»Π΅Π²ΠΎΠ³ΠΎ Ρ‚Π΅Π½Π·ΠΎΡ€Π°
  std::vector<int> shape = {src.rows, src.cols, NCH};
  // ΠžΠ±Ρ‰Π΅Π΅ количСство элСмСнтов Π² массивС
  size_t size = src.total() * NCH;
  // ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ cv::Mat Π² xt::xtensor
  XT res = xt::adapt((T*) src.data, size, xt::no_ownership(), shape);
  return res;
}

// ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ JSON Π² список ΠΊΠΎΠΎΡ€Π΄ΠΈΠ½Π°Ρ‚ Ρ‚ΠΎΡ‡Π΅ΠΊ
strokes parse_json(const std::string& x) {
  auto j = json::parse(x);
  // Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ парсинга Π΄ΠΎΠ»ΠΆΠ΅Π½ Π±Ρ‹Ρ‚ΡŒ массивом
  if (!j.is_array()) {
    throw std::runtime_error("'x' must be JSON array.");
  }
  strokes res;
  res.reserve(j.size());
  for (const auto& a: j) {
    // ΠšΠ°ΠΆΠ΄Ρ‹ΠΉ элСмСнт массива Π΄ΠΎΠ»ΠΆΠ΅Π½ Π±Ρ‹Ρ‚ΡŒ 2-ΠΌΠ΅Ρ€Π½Ρ‹ΠΌ массивом
    if (!a.is_array() || a.size() != 2) {
      throw std::runtime_error("'x' must include only 2d arrays.");
    }
    // Π˜Π·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ Π²Π΅ΠΊΡ‚ΠΎΡ€Π° Ρ‚ΠΎΡ‡Π΅ΠΊ
    auto p = a.get<points>();
    res.push_back(p);
  }
  return res;
}

// ΠžΡ‚Ρ€ΠΈΡΠΎΠ²ΠΊΠ° Π»ΠΈΠ½ΠΈΠΉ
// Π¦Π²Π΅Ρ‚Π° HSV
cv::Mat ocv_draw_lines(const strokes& x, bool color = true) {
  // Π˜ΡΡ…ΠΎΠ΄Π½Ρ‹ΠΉ Ρ‚ΠΈΠΏ ΠΌΠ°Ρ‚Ρ€ΠΈΡ†Ρ‹
  auto stype = color ? CV_8UC3 : CV_8UC1;
  // Π˜Ρ‚ΠΎΠ³ΠΎΠ²Ρ‹ΠΉ Ρ‚ΠΈΠΏ ΠΌΠ°Ρ‚Ρ€ΠΈΡ†Ρ‹
  auto dtype = color ? CV_32FC3 : CV_32FC1;
  auto bg = color ? cv::Scalar(0, 0, 255) : cv::Scalar(255);
  auto col = color ? cv::Scalar(0, 255, 220) : cv::Scalar(0);
  cv::Mat img = cv::Mat(SIZE, SIZE, stype, bg);
  // ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²ΠΎ Π»ΠΈΠ½ΠΈΠΉ
  size_t n = x.size();
  for (const auto& s: x) {
    // ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²ΠΎ Ρ‚ΠΎΡ‡Π΅ΠΊ Π² Π»ΠΈΠ½ΠΈΠΈ
    size_t n_points = s.shape()[1];
    for (size_t i = 0; i < n_points - 1; ++i) {
      // Π’ΠΎΡ‡ΠΊΠ° Π½Π°Ρ‡Π°Π»Π° ΡˆΡ‚Ρ€ΠΈΡ…Π°
      cv::Point from(s(0, i), s(1, i));
      // Π’ΠΎΡ‡ΠΊΠ° окончания ΡˆΡ‚Ρ€ΠΈΡ…Π°
      cv::Point to(s(0, i + 1), s(1, i + 1));
      // ΠžΡ‚Ρ€ΠΈΡΠΎΠ²ΠΊΠ° Π»ΠΈΠ½ΠΈΠΈ
      cv::line(img, from, to, col, LINE_WIDTH, LINE_TYPE);
    }
    if (color) {
      // МСняСм Ρ†Π²Π΅Ρ‚ Π»ΠΈΠ½ΠΈΠΈ
      col[0] += 180 / n;
    }
  }
  if (color) {
    // МСняСм Ρ†Π²Π΅Ρ‚ΠΎΠ²ΠΎΠ΅ прСдставлСниС Π½Π° RGB
    cv::cvtColor(img, img, cv::COLOR_HSV2RGB);
  }
  // МСняСм Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ прСдставлСния Π½Π° float32 с Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½ΠΎΠΌ [0, 1]
  img.convertTo(img, dtype, 1 / 255.0);
  return img;
}

// ΠžΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠ° JSON ΠΈ ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½ΠΈΠ΅ Ρ‚Π΅Π½Π·ΠΎΡ€Π° с Π΄Π°Π½Π½Ρ‹ΠΌΠΈ изобраТСния
xtensor3d process(const std::string& x, double scale = 1.0, bool color = true) {
  auto p = parse_json(x);
  auto img = ocv_draw_lines(p, color);
  if (scale != 1) {
    cv::Mat out;
    cv::resize(img, out, cv::Size(), scale, scale, RESIZE_TYPE);
    cv::swap(img, out);
    out.release();
  }
  xtensor3d arr = color ? to_xt<double,3>(img) : to_xt<double,1>(img);
  return arr;
}

// [[Rcpp::export]]
rtensor3d cpp_process_json_str(const std::string& x, 
                               double scale = 1.0, 
                               bool color = true) {
  xtensor3d res = process(x, scale, color);
  return res;
}

// [[Rcpp::export]]
rtensor4d cpp_process_json_vector(const std::vector<std::string>& x, 
                                  double scale = 1.0, 
                                  bool color = false) {
  size_t n = x.size();
  size_t dim = floor(SIZE * scale);
  size_t channels = color ? 3 : 1;
  xtensor4d res({n, dim, dim, channels});
  parallelFor(0, n, [&x, &res, scale, color](int i) {
    xtensor3d tmp = process(x[i], scale, color);
    auto view = xt::view(res, i, xt::all(), xt::all(), xt::all());
    view = tmp;
  });
  return res;
}

Iyi kodhi inofanira kuiswa mufaira src/cv_xt.cpp uye unganidza nemurairo Rcpp::sourceCpp(file = "src/cv_xt.cpp", env = .GlobalEnv); inodiwawo kubasa nlohmann/json.hpp kubva repository. Iyo kodhi yakakamurwa kuita akati wandei mabasa:

  • to_xt - templated basa rekushandura mufananidzo matrix (cv::Mat) kune tensor xt::xtensor;

  • parse_json - basa racho rinoparadzanisa tambo yeJSON, inobvisa marongerwo emapoinzi, kuaisa muvector;

  • ocv_draw_lines - kubva kune yakaguma vector yemapoinzi, inodhirowa mitsetse yakawanda-mavara;

  • process - inosanganisa mabasa ari pamusoro uye zvakare inowedzera kugona kuyera iyo inoguma mufananidzo;

  • cpp_process_json_str - wrapper pamusoro pebasa process, iyo inotumira kunze mhedzisiro kune R-chinhu (multidimensional array);

  • cpp_process_json_vector - wrapper pamusoro pebasa cpp_process_json_str, iyo inokutendera kuti ugadzirise tambo vector mune yakawanda-yakarukwa maitiro.

Kudhirowa mitsara ine mavara mazhinji, iyo HSV color modhi yakashandiswa, ichiteverwa nekushandurwa kuRGB. Ngatiedze mhedzisiro:

arr <- cpp_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256   3
plot(magick::image_read(arr))

Kurumidza Dhirowa Doodle Kuzivikanwa: maitiro ekuita shamwari neR, C ++ uye neural network
Kuenzanisa kwekumhanya kwekuita muR uye C ++

res_bench <- bench::mark(
  r_process_json_str(tmp_data[4, drawing], scale = 0.5),
  cpp_process_json_str(tmp_data[4, drawing], scale = 0.5),
  check = FALSE,
  min_iterations = 100
)
# ΠŸΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ Π±Π΅Π½Ρ‡ΠΌΠ°Ρ€ΠΊΠ°
cols <- c("expression", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   expression                min     median       max `itr/sec` total_time  n_itr
#   <chr>                <bch:tm>   <bch:tm>  <bch:tm>     <dbl>   <bch:tm>  <int>
# 1 r_process_json_str     3.49ms     3.55ms    4.47ms      273.      490ms    134
# 2 cpp_process_json_str   1.94ms     2.02ms    5.32ms      489.      497ms    243

library(ggplot2)
# ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ Π·Π°ΠΌΠ΅Ρ€Π°
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    .data <- tmp_data[sample(seq_len(.N), batch_size), drawing]
    bench::mark(
      r_process_json_vector(.data, scale = 0.5),
      cpp_process_json_vector(.data,  scale = 0.5),
      min_iterations = 50,
      check = FALSE
    )
  }
)

res_bench[, cols]

#    expression   batch_size      min   median      max `itr/sec` total_time n_itr
#    <chr>             <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
#  1 r                   16   50.61ms  53.34ms  54.82ms    19.1     471.13ms     9
#  2 cpp                 16    4.46ms   5.39ms   7.78ms   192.      474.09ms    91
#  3 r                   32   105.7ms 109.74ms 212.26ms     7.69        6.5s    50
#  4 cpp                 32    7.76ms  10.97ms  15.23ms    95.6     522.78ms    50
#  5 r                   64  211.41ms 226.18ms 332.65ms     3.85      12.99s    50
#  6 cpp                 64   25.09ms  27.34ms  32.04ms    36.0        1.39s    50
#  7 r                  128   534.5ms 627.92ms 659.08ms     1.61      31.03s    50
#  8 cpp                128   56.37ms  58.46ms  66.03ms    16.9        2.95s    50
#  9 r                  256     1.15s    1.18s    1.29s     0.851     58.78s    50
# 10 cpp                256  114.97ms 117.39ms 130.09ms     8.45       5.92s    50
# 11 r                  512     2.09s    2.15s    2.32s     0.463       1.8m    50
# 12 cpp                512  230.81ms  235.6ms 261.99ms     4.18      11.97s    50
# 13 r                 1024        4s    4.22s     4.4s     0.238       3.5m    50
# 14 cpp               1024  410.48ms 431.43ms 462.44ms     2.33      21.45s    50

ggplot(res_bench, aes(x = factor(batch_size), y = median, 
                      group =  expression, color = expression)) +
  geom_point() +
  geom_line() +
  ylab("median time, s") +
  theme_minimal() +
  scale_color_discrete(name = "", labels = c("cpp", "r")) +
  theme(legend.position = "bottom") 

Kurumidza Dhirowa Doodle Kuzivikanwa: maitiro ekuita shamwari neR, C ++ uye neural network

Sezvauri kuona, kukurumidza kwekuwedzera kwakave kwakakosha, uye hazvigoneke kubata C ++ kodhi nekufananidza R kodhi.

3. Iterators yekuburitsa mabhechi kubva mudhatabhesi

R ine mukurumbira wakanyatsokodzera wekugadzirisa data inokodzera mu RAM, nepo Python inonyanya kuratidzwa ne iterative data processing, ichikutendera iwe kuti uite nyore uye nemasikirwo kuita kunze-kwe-core calculations (kuverenga uchishandisa ndangariro yekunze). Muenzaniso wekare uye wakakodzera kwatiri mumamiriro edambudziko rakatsanangurwa yakadzika neural network yakadzidziswa neiyo gradient descent nzira ine fungidziro yegradient padanho rega rega uchishandisa chikamu chidiki chekutarisa, kana mini-batch.

Dzidziso dzakadzama dzakanyorwa muPython dzine makirasi akakosha anoshandisa iterators zvichienderana nedata: matafura, mapikicha mumaforodha, mabhinari mafomati, nezvimwe. Unogona kushandisa zvakagadzirirwa-zvakagadzirwa sarudzo kana kunyora yako wega kune chaiwo mabasa. MuR tinogona kutora mukana wezvese maficha ePython raibhurari kera ine mabackends ayo akasiyana-siyana uchishandisa pasuru yezita rimwe chete, iro rinoshanda pamusoro pepakiti dzokorora zvakare. Iyo yekupedzisira inofanirwa neyakasiyana refu chinyorwa; haingokubvumiri chete kuti umhanye Python kodhi kubva kuR, asi zvakare inobvumidza iwe kuendesa zvinhu pakati peR nePython zvikamu, uchiita otomatiki ese anodiwa mhando shanduko.

Isu takabvisa kukosha kwekuchengeta data rese muRAM nekushandisa MonetDBLite, ese "neural network" basa richaitwa neiyo yekutanga kodhi muPython, isu tinongofanirwa kunyora iterator pamusoro peiyo data, sezvo pasina chakagadzirira. yemamiriro akadai mune R kana Python. Pane zvinongodiwa zviviri chete pairi: inofanirwa kudzorera mabhechi mune isingaperi loop uye kuchengetedza mamiriro ayo pakati pekudzokorora (iyo yekupedzisira muR inoshandiswa nenzira yakapusa uchishandisa kuvhara). Pakutanga, zvaidiwa kushandura zvakajeka R arrays kuita numpy arrays mukati me iterator, asi iyo yazvino vhezheni yepakeji. kera anozviita pachake.

Iyo iterator yekudzidziswa uye yekusimbisa data yakaita seinotevera:

Iterator yekudzidziswa uye yekusimbisa data

train_generator <- function(db_connection = con,
                            samples_index,
                            num_classes = 340,
                            batch_size = 32,
                            scale = 1,
                            color = FALSE,
                            imagenet_preproc = FALSE) {
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²
  checkmate::assert_class(con, "DBIConnection")
  checkmate::assert_integerish(samples_index)
  checkmate::assert_count(num_classes)
  checkmate::assert_count(batch_size)
  checkmate::assert_number(scale, lower = 0.001, upper = 5)
  checkmate::assert_flag(color)
  checkmate::assert_flag(imagenet_preproc)

  # ΠŸΠ΅Ρ€Π΅ΠΌΠ΅ΡˆΠΈΠ²Π°Π΅ΠΌ, Ρ‡Ρ‚ΠΎΠ±Ρ‹ Π±Ρ€Π°Ρ‚ΡŒ ΠΈ ΡƒΠ΄Π°Π»ΡΡ‚ΡŒ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Π½Π½Ρ‹Π΅ индСксы Π±Π°Ρ‚Ρ‡Π΅ΠΉ ΠΏΠΎ порядку
  dt <- data.table::data.table(id = sample(samples_index))
  # ΠŸΡ€ΠΎΡΡ‚Π°Π²Π»ΡΠ΅ΠΌ Π½ΠΎΠΌΠ΅Ρ€Π° Π±Π°Ρ‚Ρ‡Π΅ΠΉ
  dt[, batch := (.I - 1L) %/% batch_size + 1L]
  # ΠžΡΡ‚Π°Π²Π»ΡΠ΅ΠΌ Ρ‚ΠΎΠ»ΡŒΠΊΠΎ ΠΏΠΎΠ»Π½Ρ‹Π΅ Π±Π°Ρ‚Ρ‡ΠΈ ΠΈ индСксируСм
  dt <- dt[, if (.N == batch_size) .SD, keyby = batch]
  # УстанавливаСм счётчик
  i <- 1
  # ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²ΠΎ Π±Π°Ρ‚Ρ‡Π΅ΠΉ
  max_i <- dt[, max(batch)]

  # ΠŸΠΎΠ΄Π³ΠΎΡ‚ΠΎΠ²ΠΊΠ° выраТСния для Π²Ρ‹Π³Ρ€ΡƒΠ·ΠΊΠΈ
  sql <- sprintf(
    "PREPARE SELECT drawing, label_int FROM doodles WHERE id IN (%s)",
    paste(rep("?", batch_size), collapse = ",")
  )
  res <- DBI::dbSendQuery(con, sql)

  # Аналог keras::to_categorical
  to_categorical <- function(x, num) {
    n <- length(x)
    m <- numeric(n * num)
    m[x * n + seq_len(n)] <- 1
    dim(m) <- c(n, num)
    return(m)
  }

  # Π—Π°ΠΌΡ‹ΠΊΠ°Π½ΠΈΠ΅
  function() {
    # НачинаСм Π½ΠΎΠ²ΡƒΡŽ эпоху
    if (i > max_i) {
      dt[, id := sample(id)]
      data.table::setkey(dt, batch)
      # БбрасываСм счётчик
      i <<- 1
      max_i <<- dt[, max(batch)]
    }

    # ID для Π²Ρ‹Π³Ρ€ΡƒΠ·ΠΊΠΈ Π΄Π°Π½Π½Ρ‹Ρ…
    batch_ind <- dt[batch == i, id]
    # Π’Ρ‹Π³Ρ€ΡƒΠ·ΠΊΠ° Π΄Π°Π½Π½Ρ‹Ρ…
    batch <- DBI::dbFetch(DBI::dbBind(res, as.list(batch_ind)), n = -1)

    # Π£Π²Π΅Π»ΠΈΡ‡ΠΈΠ²Π°Π΅ΠΌ счётчик
    i <<- i + 1

    # ΠŸΠ°Ρ€ΡΠΈΠ½Π³ JSON ΠΈ ΠΏΠΎΠ΄Π³ΠΎΡ‚ΠΎΠ²ΠΊΠ° массива
    batch_x <- cpp_process_json_vector(batch$drawing, scale = scale, color = color)
    if (imagenet_preproc) {
      # Π¨ΠΊΠ°Π»ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ c ΠΈΠ½Ρ‚Π΅Ρ€Π²Π°Π»Π° [0, 1] Π½Π° ΠΈΠ½Ρ‚Π΅Ρ€Π²Π°Π» [-1, 1]
      batch_x <- (batch_x - 0.5) * 2
    }

    batch_y <- to_categorical(batch$label_int, num_classes)
    result <- list(batch_x, batch_y)
    return(result)
  }
}

Basa racho rinotora sekuisa shanduko ine chinongedzo kune dhatabhesi, nhamba dzemitsara inoshandiswa, nhamba yemakirasi, batch size, chiyero (scale = 1 zvinoenderana nekupa mifananidzo ye256x256 pixels, scale = 0.5 - 128x128 pixels), chiratidzo chemavara (color = FALSE inotsanangura kupa mu grayscale kana yashandiswa color = TRUE sitiroko yega yega inodhirowewa muruvara rutsva) uye chinoratidza preprocessing yemanetiweki akafanodzidziswa pa imagenet. Iyo yekupedzisira inodiwa kuitira kuyera pixel values ​​kubva panguva [0, 1] kusvika panguva [-1, 1], iyo yakashandiswa pakudzidzisa iyo yakapihwa. kera mienzaniso.

Basa rekunze rine nharo yemhando yekutarisa, tafura data.table nenhamba dzisina kurongeka dzakasanganiswa kubva samples_index uye nhamba dzebatch, counter uye huwandu hwehuwandu hwemabhechi, pamwe nekutaura kweSQL kwekuburitsa data kubva kudhatabhesi. Pamusoro pezvo, isu takatsanangura kukurumidza analogue yebasa mukati keras::to_categorical(). Isu takashandisa rinenge data rese rekudzidziswa, tichisiya hafu muzana kuti isimbiswe, saka epoch saizi yakaganhurwa neparameter. steps_per_epoch kana adanwa keras::fit_generator(), uye mamiriro acho if (i > max_i) yakangoshanda kune yekusimbisa iterator.

Mune basa remukati, mitsara indexes inodzoserwa yeinotevera batch, marekodhi anoburitswa kubva mudhatabhesi ine batch counter inowedzera, JSON parsing (basa. cpp_process_json_vector(), yakanyorwa muC ++) uye kugadzira mitsara inoenderana nemifananidzo. Ipapo mavheji-anopisa ane mavara ekirasi anogadzirwa, arrays ane pixel values ​​uye mavara anosanganiswa kuita runyorwa, inova kukosha kwekudzoka. Kuti tikurumidze basa, takashandisa kusikwa kwe indexes mumatafura data.table uye kugadziridzwa kuburikidza neiyi link - pasina aya mapakeji "chips" data.table Zvakaoma kufungidzira kushanda zvinobudirira nechero yakakosha data muR.

Mhedzisiro yezviyero zvekumhanya paCore i5 laptop ndeiyi:

Iterator benchmark

library(Rcpp)
library(keras)
library(ggplot2)

source("utils/rcpp.R")
source("utils/keras_iterator.R")

con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

ind <- seq_len(DBI::dbGetQuery(con, "SELECT count(*) FROM doodles")[[1L]])
num_classes <- DBI::dbGetQuery(con, "SELECT max(label_int) + 1 FROM doodles")[[1L]]

# Π˜Π½Π΄Π΅ΠΊΡΡ‹ для ΠΎΠ±ΡƒΡ‡Π°ΡŽΡ‰Π΅ΠΉ Π²Ρ‹Π±ΠΎΡ€ΠΊΠΈ
train_ind <- sample(ind, floor(length(ind) * 0.995))
# Π˜Π½Π΄Π΅ΠΊΡΡ‹ для ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΎΡ‡Π½ΠΎΠΉ Π²Ρ‹Π±ΠΎΡ€ΠΊΠΈ
val_ind <- ind[-train_ind]
rm(ind)
# ΠšΠΎΡΡ„Ρ„ΠΈΡ†ΠΈΠ΅Π½Ρ‚ ΠΌΠ°ΡΡˆΡ‚Π°Π±Π°
scale <- 0.5

# ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ Π·Π°ΠΌΠ΅Ρ€Π°
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    it1 <- train_generator(
      db_connection = con,
      samples_index = train_ind,
      num_classes = num_classes,
      batch_size = batch_size,
      scale = scale
    )
    bench::mark(
      it1(),
      min_iterations = 50L
    )
  }
)
# ΠŸΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ Π±Π΅Π½Ρ‡ΠΌΠ°Ρ€ΠΊΠ°
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   batch_size      min   median      max `itr/sec` total_time n_itr
#        <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
# 1         16     25ms  64.36ms   92.2ms     15.9       3.09s    49
# 2         32   48.4ms 118.13ms 197.24ms     8.17       5.88s    48
# 3         64   69.3ms 117.93ms 181.14ms     8.57       5.83s    50
# 4        128  157.2ms 240.74ms 503.87ms     3.85      12.71s    49
# 5        256  359.3ms 613.52ms 988.73ms     1.54       30.5s    47
# 6        512  884.7ms    1.53s    2.07s     0.674      1.11m    45
# 7       1024     2.7s    3.83s    5.47s     0.261      2.81m    44

ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
    geom_point() +
    geom_line() +
    ylab("median time, s") +
    theme_minimal()

DBI::dbDisconnect(con, shutdown = TRUE)

Kurumidza Dhirowa Doodle Kuzivikanwa: maitiro ekuita shamwari neR, C ++ uye neural network

Kana uine huwandu hwakakwana hwe RAM, unogona kumhanyisa zvakanyanya kushanda kwedhatabhesi nekuiendesa kune imwecheteyo RAM (32 GB inokwana basa redu). MuLinux, chikamu chinoiswa nekusarudzika /dev/shm, inotora inosvika hafu ye RAM. Unogona kuratidza zvimwe nekugadzirisa /etc/fstabkuti uwane rekodhi se tmpfs /dev/shm tmpfs defaults,size=25g 0 0. Iva neshuwa kuti reboot uye tarisa mhedzisiro nekumhanyisa rairo df -h.

Iyo iterator yedata rekuyedza inotaridzika zvakapfava, sezvo dataset yebvunzo inokodzera zvakakwana muRAM:

Iterator yedata rebvunzo

test_generator <- function(dt,
                           batch_size = 32,
                           scale = 1,
                           color = FALSE,
                           imagenet_preproc = FALSE) {

  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²
  checkmate::assert_data_table(dt)
  checkmate::assert_count(batch_size)
  checkmate::assert_number(scale, lower = 0.001, upper = 5)
  checkmate::assert_flag(color)
  checkmate::assert_flag(imagenet_preproc)

  # ΠŸΡ€ΠΎΡΡ‚Π°Π²Π»ΡΠ΅ΠΌ Π½ΠΎΠΌΠ΅Ρ€Π° Π±Π°Ρ‚Ρ‡Π΅ΠΉ
  dt[, batch := (.I - 1L) %/% batch_size + 1L]
  data.table::setkey(dt, batch)
  i <- 1
  max_i <- dt[, max(batch)]

  # Π—Π°ΠΌΡ‹ΠΊΠ°Π½ΠΈΠ΅
  function() {
    batch_x <- cpp_process_json_vector(dt[batch == i, drawing], 
                                       scale = scale, color = color)
    if (imagenet_preproc) {
      # Π¨ΠΊΠ°Π»ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ c ΠΈΠ½Ρ‚Π΅Ρ€Π²Π°Π»Π° [0, 1] Π½Π° ΠΈΠ½Ρ‚Π΅Ρ€Π²Π°Π» [-1, 1]
      batch_x <- (batch_x - 0.5) * 2
    }
    result <- list(batch_x)
    i <<- i + 1
    return(result)
  }
}

4. Kusarudzwa kwemuenzaniso wezvivakwa

Mavakirwo ekutanga akashandiswa aive mobilenet v1, zvinhu zvinokurukurwa mu izvi message. Inosanganisirwa seyakajairwa kera uye, maererano, inowanikwa mupakiti yezita rimwechete reR. Asi kana uchiedza kuishandisa ne-single-channel mifananidzo, chinhu chinoshamisa chakaitika: iyo tensor yekupinda inofanira kugara iine chiyero. (batch, height, width, 3), ndiko kuti, nhamba yematanho haigoni kuchinjwa. Iko hakuna muganho wakadaro muPython, saka takamhanya ndokunyora zvedu kuita kwekuvaka uku, tichitevera chinyorwa chepakutanga (pasina kudonhedza kuri mune keras vhezheni):

Mobilenet v1 zvivakwa

library(keras)

top_3_categorical_accuracy <- custom_metric(
    name = "top_3_categorical_accuracy",
    metric_fn = function(y_true, y_pred) {
         metric_top_k_categorical_accuracy(y_true, y_pred, k = 3)
    }
)

layer_sep_conv_bn <- function(object, 
                              filters,
                              alpha = 1,
                              depth_multiplier = 1,
                              strides = c(2, 2)) {

  # NB! depth_multiplier !=  resolution multiplier
  # https://github.com/keras-team/keras/issues/10349

  layer_depthwise_conv_2d(
    object = object,
    kernel_size = c(3, 3), 
    strides = strides,
    padding = "same",
    depth_multiplier = depth_multiplier
  ) %>%
  layer_batch_normalization() %>% 
  layer_activation_relu() %>%
  layer_conv_2d(
    filters = filters * alpha,
    kernel_size = c(1, 1), 
    strides = c(1, 1)
  ) %>%
  layer_batch_normalization() %>% 
  layer_activation_relu() 
}

get_mobilenet_v1 <- function(input_shape = c(224, 224, 1),
                             num_classes = 340,
                             alpha = 1,
                             depth_multiplier = 1,
                             optimizer = optimizer_adam(lr = 0.002),
                             loss = "categorical_crossentropy",
                             metrics = c("categorical_crossentropy",
                                         top_3_categorical_accuracy)) {

  inputs <- layer_input(shape = input_shape)

  outputs <- inputs %>%
    layer_conv_2d(filters = 32, kernel_size = c(3, 3), strides = c(2, 2), padding = "same") %>%
    layer_batch_normalization() %>% 
    layer_activation_relu() %>%
    layer_sep_conv_bn(filters = 64, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 128, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 128, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 256, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 256, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 1024, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 1024, strides = c(1, 1)) %>%
    layer_global_average_pooling_2d() %>%
    layer_dense(units = num_classes) %>%
    layer_activation_softmax()

    model <- keras_model(
      inputs = inputs,
      outputs = outputs
    )

    model %>% compile(
      optimizer = optimizer,
      loss = loss,
      metrics = metrics
    )

    return(model)
}

Kuipa kweiyi nzira kuri pachena. Ini ndinoda kuyedza akawanda mamodheru, asi zvakapesana, ini handidi kunyora imwe neimwe yekuvakisa nemawoko. Isu takanyimwawo mukana wekushandisa huremu hwemhando dzakambodzidziswa pa imagenet. Semazuva ose, kudzidza magwaro kwakabatsira. Function get_config() inokubvumira kuti uwane tsananguro yemuenzaniso mune fomu yakakodzera kugadzirisa (base_model_conf$layers - yenguva dzose R runyorwa), uye basa from_config() inoita shandurudzo yekudzosera kuchinhu chemuenzaniso:

base_model_conf <- get_config(base_model)
base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
base_model <- from_config(base_model_conf)

Iye zvino hazvina kuoma kunyora basa repasi rose kuti uwane chero chinopihwa kera mhando dzine kana dzisina huremu dzakadzidziswa pa imagenet:

Basa rekurodha yakagadzirira-yakagadzirwa zvivakwa

get_model <- function(name = "mobilenet_v2",
                      input_shape = NULL,
                      weights = "imagenet",
                      pooling = "avg",
                      num_classes = NULL,
                      optimizer = keras::optimizer_adam(lr = 0.002),
                      loss = "categorical_crossentropy",
                      metrics = NULL,
                      color = TRUE,
                      compile = FALSE) {
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²
  checkmate::assert_string(name)
  checkmate::assert_integerish(input_shape, lower = 1, upper = 256, len = 3)
  checkmate::assert_count(num_classes)
  checkmate::assert_flag(color)
  checkmate::assert_flag(compile)

  # ΠŸΠΎΠ»ΡƒΡ‡Π°Π΅ΠΌ ΠΎΠ±ΡŠΠ΅ΠΊΡ‚ ΠΈΠ· ΠΏΠ°ΠΊΠ΅Ρ‚Π° keras
  model_fun <- get0(paste0("application_", name), envir = asNamespace("keras"))
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° наличия ΠΎΠ±ΡŠΠ΅ΠΊΡ‚Π° Π² ΠΏΠ°ΠΊΠ΅Ρ‚Π΅
  if (is.null(model_fun)) {
    stop("Model ", shQuote(name), " not found.", call. = FALSE)
  }

  base_model <- model_fun(
    input_shape = input_shape,
    include_top = FALSE,
    weights = weights,
    pooling = pooling
  )

  # Если ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅ Π½Π΅ Ρ†Π²Π΅Ρ‚Π½ΠΎΠ΅, мСняСм Ρ€Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒ Π²Ρ…ΠΎΠ΄Π°
  if (!color) {
    base_model_conf <- keras::get_config(base_model)
    base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
    base_model <- keras::from_config(base_model_conf)
  }

  predictions <- keras::get_layer(base_model, "global_average_pooling2d_1")$output
  predictions <- keras::layer_dense(predictions, units = num_classes, activation = "softmax")
  model <- keras::keras_model(
    inputs = base_model$input,
    outputs = predictions
  )

  if (compile) {
    keras::compile(
      object = model,
      optimizer = optimizer,
      loss = loss,
      metrics = metrics
    )
  }

  return(model)
}

Paunenge uchishandisa mifananidzo ye-single-channel, hapana uremu hwakafanodzidziswa hunoshandiswa. Izvi zvinogona kugadziriswa: kushandisa basa get_weights() tora uremu hwemuenzaniso muchimiro cherondedzero yeR arrays, shandura dimension yechinhu chekutanga cherunyorwa urwu (nekutora chiteshi cheruvara rumwe kana kuenzana ese ari matatu), wozoremedza zviremu kudzoka mumuenzaniso nebasa racho. set_weights(). Hatina kumbowedzera mashandiro aya, nekuti panguva ino zvaive zvatove pachena kuti zvaive zvakanyanya kuita basa nemifananidzo yemavara.

Takaita akawanda ekuedza tichishandisa mobilenet shanduro 1 uye 2, pamwe neresnet34. Zvimwe zvivakwa zvemazuvano zvakaita seSE-ResNeXt zvakaita zvakanaka mumakwikwi aya. Nehurombo, isu takanga tisina kugadzirira-kuitwa kwatinayo, uye isu hatina kunyora zvedu (asi isu tichanyora zvechokwadi).

5. Parameterization yezvinyorwa

Kuti zvive nyore, kodhi yese yekutanga kudzidziswa yakagadzirwa sechinyorwa chimwe chete, parameterized uchishandisa docopt sezvinotevera:

doc <- '
Usage:
  train_nn.R --help
  train_nn.R --list-models
  train_nn.R [options]

Options:
  -h --help                   Show this message.
  -l --list-models            List available models.
  -m --model=<model>          Neural network model name [default: mobilenet_v2].
  -b --batch-size=<size>      Batch size [default: 32].
  -s --scale-factor=<ratio>   Scale factor [default: 0.5].
  -c --color                  Use color lines [default: FALSE].
  -d --db-dir=<path>          Path to database directory [default: Sys.getenv("db_dir")].
  -r --validate-ratio=<ratio> Validate sample ratio [default: 0.995].
  -n --n-gpu=<number>         Number of GPUs [default: 1].
'
args <- docopt::docopt(doc)

Package docopt inomiririra kushandiswa http://docopt.org/ yeR. Nerubatsiro rwayo, zvinyorwa zvinotangwa nemirairo iri nyore senge Rscript bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db kana ./bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db, kana faira train_nn.R inoitwa (uyu murairo uchatanga kudzidzisa modhi resnet50 pamifananidzo ine mavara matatu anoyera 128x128 pixels, dhatabhesi rinofanira kunge riri muforodha. /home/andrey/doodle_db) Iwe unogona kuwedzera kumhanya yekudzidza, optimizer mhando, uye chero mamwe magadzirirwo akajairika paramita kune iyo rondedzero. Mukati mekugadzirira kubudiswa, zvakazoitika kuti zvivakwa mobilenet_v2 kubva kune yazvino vhezheni kera muR kushandisa haigoni nekuda kweshanduko dzisina kuverengerwa muR package, takamirira kuti vagadzirise.

Iyi nzira yakaita kuti zvikwanisike kukurumidzira kuyedza nemhando dzakasiyana zvichienzaniswa neakawanda echinyakare kuvhurwa kwezvinyorwa muRStudio (isu tinocherekedza pasuru yacho seimwe nzira inogoneka. tfruns) Asi mukana mukuru kugona kubata zviri nyore kuvhurwa kwezvinyorwa muDocker kana kungoita sevha, pasina kuisa RStudio yeizvi.

6. Dockerization yezvinyorwa

Isu takashandisa Docker kuona kutakurika kwenzvimbo yekudzidzira modhi pakati penhengo dzechikwata uye nekukasira kutumirwa mugore. Iwe unogona kutanga kujairana nechishandiso ichi, icho chisina kujairika kune R programmer, ine izvi nhevedzano yezvinyorwa kana vhidhiyo kosi.

Docker inokutendera kuti ugadzire mese mifananidzo yako kubva kutanga uye shandisa mimwe mifananidzo sehwaro hwekugadzira yako. Pakuongorora sarudzo dziripo, takasvika pamhedzisiro yekuti kuisa NVIDIA, CUDA + cuDNN madhiraibhurari uye Python raibhurari chikamu chakajeka chemufananidzo, uye isu takasarudza kutora iyo yepamutemo mufananidzo sehwaro. tensorflow/tensorflow:1.12.0-gpu, kuwedzera anodiwa R mapakeji ipapo.

Iyo yekupedzisira docker faira yakaita seizvi:

dockerfile

FROM tensorflow/tensorflow:1.12.0-gpu

MAINTAINER Artem Klevtsov <[email protected]>

SHELL ["/bin/bash", "-c"]

ARG LOCALE="en_US.UTF-8"
ARG APT_PKG="libopencv-dev r-base r-base-dev littler"
ARG R_BIN_PKG="futile.logger checkmate data.table rcpp rapidjsonr dbi keras jsonlite curl digest remotes"
ARG R_SRC_PKG="xtensor RcppThread docopt MonetDBLite"
ARG PY_PIP_PKG="keras"
ARG DIRS="/db /app /app/data /app/models /app/logs"

RUN source /etc/os-release && 
    echo "deb https://cloud.r-project.org/bin/linux/ubuntu ${UBUNTU_CODENAME}-cran35/" > /etc/apt/sources.list.d/cran35.list && 
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 && 
    add-apt-repository -y ppa:marutter/c2d4u3.5 && 
    add-apt-repository -y ppa:timsc/opencv-3.4 && 
    apt-get update && 
    apt-get install -y locales && 
    locale-gen ${LOCALE} && 
    apt-get install -y --no-install-recommends ${APT_PKG} && 
    ln -s /usr/lib/R/site-library/littler/examples/install.r /usr/local/bin/install.r && 
    ln -s /usr/lib/R/site-library/littler/examples/install2.r /usr/local/bin/install2.r && 
    ln -s /usr/lib/R/site-library/littler/examples/installGithub.r /usr/local/bin/installGithub.r && 
    echo 'options(Ncpus = parallel::detectCores())' >> /etc/R/Rprofile.site && 
    echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site && 
    apt-get install -y $(printf "r-cran-%s " ${R_BIN_PKG}) && 
    install.r ${R_SRC_PKG} && 
    pip install ${PY_PIP_PKG} && 
    mkdir -p ${DIRS} && 
    chmod 777 ${DIRS} && 
    rm -rf /tmp/downloaded_packages/ /tmp/*.rds && 
    rm -rf /var/lib/apt/lists/*

COPY utils /app/utils
COPY src /app/src
COPY tests /app/tests
COPY bin/*.R /app/

ENV DBDIR="/db"
ENV CUDA_HOME="/usr/local/cuda"
ENV PATH="/app:${PATH}"

WORKDIR /app

VOLUME /db
VOLUME /app

CMD bash

Kuti zvive nyore, mapakeji akashandiswa akaiswa mumhando dzakasiyana; iyo yakawanda yezvinyorwa zvakanyorwa zvinokopwa mukati memidziyo panguva yekuungana. Isu takashandurawo ganda rekuraira kuti /bin/bash kuitira nyore kushandisa zvemukati /etc/os-release. Izvi zvakadzivirira kukosha kwekutsanangura OS vhezheni mukodhi.

Pamusoro pezvo, diki bash script rakanyorwa rinokutendera kuti utange mudziyo une mirairo yakasiyana. Semuenzaniso, aya anogona kunge ari magwaro ekudzidzisa neural network akamboiswa mukati memudziyo, kana goko rekuraira rekugadzirisa uye kutarisa kushanda kwemudziyo:

Script yekuvhura mudziyo

#!/bin/sh

DBDIR=${PWD}/db
LOGSDIR=${PWD}/logs
MODELDIR=${PWD}/models
DATADIR=${PWD}/data
ARGS="--runtime=nvidia --rm -v ${DBDIR}:/db -v ${LOGSDIR}:/app/logs -v ${MODELDIR}:/app/models -v ${DATADIR}:/app/data"

if [ -z "$1" ]; then
    CMD="Rscript /app/train_nn.R"
elif [ "$1" = "bash" ]; then
    ARGS="${ARGS} -ti"
else
    CMD="Rscript /app/train_nn.R $@"
fi

docker run ${ARGS} doodles-tf ${CMD}

Kana iyi bash script ichiitwa isina paramita, iyo script inodanwa mukati memudziyo train_nn.R ine default values; kana iyo yekutanga yekupokana nharo iri "bash", ipapo mudziyo uchatanga kupindirana negomba rekuraira. Mune zvimwe zviitiko zvese, kukosha kwekupokana kunotsiviwa: CMD="Rscript /app/train_nn.R $@".

Zvakakosha kucherechedza kuti madhairekitori ane dhata uye dhatabhesi, pamwe nedhairekitori rekuchengetedza mamodheru akadzidziswa, akaiswa mukati memudziyo kubva kune iyo host system, iyo inokutendera iwe kuti uwane mibairo yezvinyorwa pasina manipulations asina kufanira.

7. Kushandisa maGPU akawanda paGoogle Cloud

Chimwe chezvinhu zvemakwikwi chaive data rine ruzha (ona mufananidzo wemusoro, wakakweretwa kubva @Leigh.plt kubva kuODS slack). Mabheji mahombe anobatsira kurwisa izvi, uye mushure mekuyedza paPC ine 1 GPU, takafunga kugona mamodheru ekudzidzisa pane akati wandei maGPU mugore. Yakashandiswa GoogleCloud (gwara rakanaka kune zvekutanga) nekuda kwekusarudzwa kukuru kwezvirongwa zviripo, mitengo inonzwisisika uye $ 300 bhonasi. Nekuda kwemakaro, ndakaraira 4xV100 muenzaniso neSSD uye toni ye RAM, uye icho chaive chikanganiso chikuru. Muchina wakadaro unodya mari nekukurumidza; unogona kuenda wakatyora kuyedza pasina pombi yakasimbiswa. Nezvinangwa zvekudzidzisa, zviri nani kutora iyo K80. Asi huwandu hukuru hwe RAM hwakauya hunobatsira - iyo gore SSD haina kufadza nekuita kwayo, saka dhatabhesi yakaendeswa kune dev/shm.

Chinonyanya kufarirwa ikodhi chidimbu chine chekuita nekushandisa akawanda maGPU. Kutanga, modhi inogadzirwa paCPU uchishandisa maneja wemamiriro ezvinhu, sezvakaita muPython:

with(tensorflow::tf$device("/cpu:0"), {
  model_cpu <- get_model(
    name = model_name,
    input_shape = input_shape,
    weights = weights,
    metrics =(top_3_categorical_accuracy,
    compile = FALSE
  )
})

Ipapo iyo isina kunyorwa (iyi yakakosha) modhi inokopwa kune yakapihwa nhamba yeGPUs iripo, uye chete mushure meizvozvo inounganidzwa:

model <- keras::multi_gpu_model(model_cpu, gpus = n_gpu)
keras::compile(
  object = model,
  optimizer = keras::optimizer_adam(lr = 0.0004),
  loss = "categorical_crossentropy",
  metrics = c(top_3_categorical_accuracy)
)

Iyo yemhando yepamusoro nzira yekuomesa ese maseru kunze kweiyo yekupedzisira, kudzidzisa iyo yekupedzisira dhizaini, kusunungura uye kudzidzisazve modhi yese kune akati wandei maGPU haigone kuitwa.

Kudzidziswa kwakatariswa pasina kushandiswa. tensorboard, tichizvimisira kurekodha matanda uye mamodheru ekuchengetedza ane mazita anodzidzisa mushure menguva yega yega:

Callbacks

# Π¨Π°Π±Π»ΠΎΠ½ ΠΈΠΌΠ΅Π½ΠΈ Ρ„Π°ΠΉΠ»Π° Π»ΠΎΠ³Π°
log_file_tmpl <- file.path("logs", sprintf(
  "%s_%d_%dch_%s.csv",
  model_name,
  dim_size,
  channels,
  format(Sys.time(), "%Y%m%d%H%M%OS")
))
# Π¨Π°Π±Π»ΠΎΠ½ ΠΈΠΌΠ΅Π½ΠΈ Ρ„Π°ΠΉΠ»Π° ΠΌΠΎΠ΄Π΅Π»ΠΈ
model_file_tmpl <- file.path("models", sprintf(
  "%s_%d_%dch_{epoch:02d}_{val_loss:.2f}.h5",
  model_name,
  dim_size,
  channels
))

callbacks_list <- list(
  keras::callback_csv_logger(
    filename = log_file_tmpl
  ),
  keras::callback_early_stopping(
    monitor = "val_loss",
    min_delta = 1e-4,
    patience = 8,
    verbose = 1,
    mode = "min"
  ),
  keras::callback_reduce_lr_on_plateau(
    monitor = "val_loss",
    factor = 0.5, # ΡƒΠΌΠ΅Π½ΡŒΡˆΠ°Π΅ΠΌ lr Π² 2 Ρ€Π°Π·Π°
    patience = 4,
    verbose = 1,
    min_delta = 1e-4,
    mode = "min"
  ),
  keras::callback_model_checkpoint(
    filepath = model_file_tmpl,
    monitor = "val_loss",
    save_best_only = FALSE,
    save_weights_only = FALSE,
    mode = "min"
  )
)

8. Panzvimbo pemhedziso

Matambudziko akati wandei atakasangana nawo haasati akundwa:

  • Π² kera hapana chakagadzirira-chakagadzirwa basa rekutsvaga otomatiki iyo yakakwana yekudzidza mwero (analogue lr_finder muraibhurari fast.ai); Nekumwe kuedza, zvinokwanisika kuendesa yechitatu-bato kuita kuR, semuenzaniso, izvi;
  • semugumisiro wepoindi yapfuura, zvaisaita kuti usarudze kumhanya chaiko kwekudzidzira paunenge uchishandisa akati wandei maGPU;
  • pane kushomeka kweazvino neural network architectures, kunyanya ayo akafanodzidziswa pa imagenet;
  • hapana mutemo wekutenderera uye kusarura kwemazinga ekudzidza (cosine annealing yaive pakukumbira kwedu itwa, Ndatenda skydan).

Ndezvipi zvinhu zvinobatsira zvakadzidzwa kubva mumakwikwi aya:

  • Pane zvine simba-yakaderera-simba, unogona kushanda neane hunhu (kazhinji saizi yeRAM) mavhoriyamu edata pasina kurwadziwa. Plastic bag data.table inochengetedza ndangariro nekuda kwekugadziridzwa kwenzvimbo kwematafura, izvo zvinodzivirira kuakopa, uye kana akashandiswa nemazvo, kugona kwaro kunenge nguva dzose kunoratidza kumhanya kwepamusoro pakati pezvishandiso zvese zvatinoziva pamitauro yekunyora. Kuchengetedza dhata mudhatabhesi kunobvumira iwe, muzviitiko zvakawanda, kusafunga zvachose nezve kukosha kwekudzvanya dataset rese mu RAM.
  • Anononoka mabasa muR anogona kutsiviwa neanokurumidza muC ++ uchishandisa package Rcpp. Kana kuwedzera kushandisa RcppThread kana RcppParallel, isu tinowana muchinjika-chikuva akawanda-akarukwa mashandisirwo, saka hapana chikonzero chekufananidza kodhi paR level.
  • Package Rcpp inogona kushandiswa pasina ruzivo rwakakomba rweC ++, hushoma hunodiwa hunotsanangurwa pano. Mafaira emusoro ehuwandu hweC-raibhurari inotonhorera senge xtensor inowanika paCRAN, kureva kuti, chivakwa chiri kuumbwa kuti chiitwe mapurojekiti anobatanidza akagadzirira-akagadzirwa akakwira-kuita C++ kodhi muR. Kuwedzera kuve nyore ndeye syntax kuratidza uye static C ++ kodhi kodhi muRStudio.
  • docopt inokutendera iwe kuti umhanye-ega zvinyorwa zvine parameter. Izvi zvakanakira kushandiswa pane iri kure server, incl. pasi pedocker. MuRStudio, hazvina kunaka kuitisa maawa akawanda ekuedza nekudzidzisa neural network, uye kuisa IDE pane server pachayo haiwanzo ruramiswa.
  • Docker inovimbisa kutakurika kwekodhi uye kudzokororwa kwemhedzisiro pakati pevagadziri vane shanduro dzakasiyana dzeOS uye maraibhurari, pamwe nekureruka kwekuita pamaseva. Iwe unogona kuvhura iyo yese pombi yekudzidzisa nekuraira mumwechete.
  • Google Cloud inzira yebhajeti-inoshamwaridzika yekuyedza pane inodhura Hardware, asi iwe unofanirwa kusarudza zvigadziriso nekungwarira.
  • Kuyera kumhanya kwezvimedu zvekodhi kunobatsira zvakanyanya, kunyanya kana uchibatanidza R uye C ++, uye nepakeji. bench - zvakare nyore kwazvo.

Pakazara chiitiko ichi chaive nemubairo mukuru uye tinoenderera mberi nekushanda kugadzirisa dzimwe dzenyaya dzakasimudzwa.

Source: www.habr.com

Voeg