Kuzindikira kwa Draw Doodle mwachangu: momwe mungapangire zibwenzi ndi R, C++ ndi neural network

Kuzindikira kwa Draw Doodle mwachangu: momwe mungapangire zibwenzi ndi R, C++ ndi neural network

Pa Habr!

Kugwa komaliza, Kaggle adachita mpikisano wosankha zithunzi zojambula pamanja, Kuzindikira kwa Doodle Mwachangu, komwe, mwa ena, gulu la R-asayansi adatenga nawo gawo: Artem Klevtsova, Philippa Manager ΠΈ Andrey Ogurtsov. Sitidzalongosola mpikisano mwatsatanetsatane; zomwe zachitika kale kusindikizidwa kwaposachedwa.

Nthawiyi sizinagwire ntchito ndi ulimi wa mendulo, koma zambiri zamtengo wapatali zinapindula, kotero ndikufuna kuuza anthu ammudzi za zinthu zingapo zosangalatsa komanso zothandiza pa Kagle ndi ntchito ya tsiku ndi tsiku. Pakati pa nkhani zomwe zafotokozedwa: moyo wovuta popanda OpenCV, JSON parsing (zitsanzo izi zimayang'ana kuphatikiza kwa C++ code mu zolemba kapena phukusi mu R pogwiritsa ntchito Rcpp), parameterization of scripts and dockerization of the final solution. Ma code onse a uthengawo mu mawonekedwe oyenera kuphedwa akupezeka nkhokwe.

Zamkatimu:

  1. Kwezani bwino deta kuchokera ku CSV kupita ku MonetDB
  2. Kukonzekera magulu
  3. Othandizira kutsitsa magulu kuchokera ku database
  4. Kusankha Model Architecture
  5. Script parameterization
  6. Dockerization ya scripts
  7. Kugwiritsa ntchito ma GPU angapo pa Google Cloud
  8. M'malo mapeto

1. Kwezani bwino deta kuchokera ku CSV kupita ku database ya MonetDB

Deta yomwe ili mumpikisanowu imaperekedwa osati mu mawonekedwe a zithunzi zopangidwa kale, koma mu mawonekedwe a mafayilo a 340 CSV (fayilo imodzi ya kalasi iliyonse) yomwe ili ndi ma JSON okhala ndi mfundo zogwirizanitsa. Polumikiza mfundozi ndi mizere, timapeza chithunzi chomaliza chokhala ndi ma pixel 256x256. Komanso pa mbiri iliyonse pali chizindikiro chosonyeza ngati chithunzicho chinazindikiridwa molondola ndi gulu lomwe linagwiritsidwa ntchito panthawi yomwe deta idasonkhanitsidwa, zilembo ziwiri za dziko lomwe akukhala la wolemba chithunzicho, chizindikiritso chapadera, chizindikiro cha nthawi. ndi dzina lakalasi lomwe likufanana ndi dzina lafayilo. Chidziwitso chosavuta cha deta yoyambirira chimalemera 7.4 GB mu archive ndipo pafupifupi 20 GB mutatsegula, deta yonse mutatsegula imatenga 240 GB. Okonzawo adawonetsetsa kuti matembenuzidwe onse awiri apanganso zojambula zomwezo, kutanthauza kuti zonsezo zinali zopanda ntchito. Mulimonse momwe zingakhalire, kusunga zithunzi 50 miliyoni m'mafayilo ojambulidwa kapena m'mawonekedwe ophatikizika kudawonedwa ngati kopanda phindu, ndipo tidaganiza zophatikiza mafayilo onse a CSV kuchokera pazosungidwa. train_simplified.zip m'nkhokwe ndi m'badwo wotsatira wa zithunzi za kukula kofunikira "pa ntchentche" pa batch iliyonse.

Dongosolo lotsimikiziridwa bwino linasankhidwa ngati DBMS MonetDB, ndiko kukhazikitsa kwa R ngati phukusi MonetDBLite. Phukusili limaphatikizapo mawonekedwe ophatikizidwa a seva ya database ndikukulolani kuti mutenge seva mwachindunji kuchokera ku gawo la R ndikugwira ntchito nayo pamenepo. Kupanga database ndikulumikizana nayo kumachitika ndi lamulo limodzi:

con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

Tidzafunika kupanga matebulo awiri: imodzi ya data yonse, ina yodziwitsa za mafayilo otsitsidwa (zothandiza ngati china chake sichikuyenda bwino ndipo ndondomekoyo iyenera kuyambiranso mutatsitsa mafayilo angapo):

Kupanga matebulo

if (!DBI::dbExistsTable(con, "doodles")) {
  DBI::dbCreateTable(
    con = con,
    name = "doodles",
    fields = c(
      "countrycode" = "char(2)",
      "drawing" = "text",
      "key_id" = "bigint",
      "recognized" = "bool",
      "timestamp" = "timestamp",
      "word" = "text"
    )
  )
}

if (!DBI::dbExistsTable(con, "upload_log")) {
  DBI::dbCreateTable(
    con = con,
    name = "upload_log",
    fields = c(
      "id" = "serial",
      "file_name" = "text UNIQUE",
      "uploaded" = "bool DEFAULT false"
    )
  )
}

Njira yachangu kwambiri yoyika deta mu database inali kukopera mafayilo a CSV mwachindunji pogwiritsa ntchito SQL - command COPY OFFSET 2 INTO tablename FROM path USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORTkumene tablename - dzina la tebulo ndi path - njira yopita ku fayilo. Pogwira ntchito ndi archive, zidapezeka kuti kukhazikitsidwa komangidwa unzip mu R sichigwira ntchito moyenera ndi mafayilo angapo kuchokera muzosungirako, kotero tinagwiritsa ntchito dongosolo unzip (pogwiritsa ntchito parameter getOption("unzip")).

Ntchito yolembera ku database

#' @title Π˜Π·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ ΠΈ Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠ° Ρ„Π°ΠΉΠ»ΠΎΠ²
#'
#' @description
#' Π˜Π·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ CSV-Ρ„Π°ΠΉΠ»ΠΎΠ² ΠΈΠ· ZIP-Π°Ρ€Ρ…ΠΈΠ²Π° ΠΈ Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠ° ΠΈΡ… Π² Π±Π°Π·Ρƒ Π΄Π°Π½Π½Ρ‹Ρ…
#'
#' @param con ΠžΠ±ΡŠΠ΅ΠΊΡ‚ ΠΏΠΎΠ΄ΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΡ ΠΊ Π±Π°Π·Π΅ Π΄Π°Π½Π½Ρ‹Ρ… (класс `MonetDBEmbeddedConnection`).
#' @param tablename НазваниС Ρ‚Π°Π±Π»ΠΈΡ†Ρ‹ Π² Π±Π°Π·Π΅ Π΄Π°Π½Π½Ρ‹Ρ….
#' @oaram zipfile ΠŸΡƒΡ‚ΡŒ ΠΊ ZIP-Π°Ρ€Ρ…ΠΈΠ²Ρƒ.
#' @oaram filename Имя Ρ„Π°ΠΉΠ»Π° Π²Π½ΡƒΡ€ΠΈ ZIP-Π°Ρ€Ρ…ΠΈΠ²Π°.
#' @param preprocess Ѐункция ΠΏΡ€Π΅Π΄ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ, которая Π±ΡƒΠ΄Π΅Ρ‚ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½Π° ΠΈΠ·Π²Π»Π΅Ρ‡Ρ‘Π½Π½ΠΎΠΌΡƒ Ρ„Π°ΠΉΠ»Ρƒ.
#'   Π”ΠΎΠ»ΠΆΠ½Π° ΠΏΡ€ΠΈΠ½ΠΈΠΌΠ°Ρ‚ΡŒ ΠΎΠ΄ΠΈΠ½ Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ `data` (ΠΎΠ±ΡŠΠ΅ΠΊΡ‚ `data.table`).
#'
#' @return `TRUE`.
#'
upload_file <- function(con, tablename, zipfile, filename, preprocess = NULL) {
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²
  checkmate::assert_class(con, "MonetDBEmbeddedConnection")
  checkmate::assert_string(tablename)
  checkmate::assert_string(filename)
  checkmate::assert_true(DBI::dbExistsTable(con, tablename))
  checkmate::assert_file_exists(zipfile, access = "r", extension = "zip")
  checkmate::assert_function(preprocess, args = c("data"), null.ok = TRUE)

  # Π˜Π·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ Ρ„Π°ΠΉΠ»Π°
  path <- file.path(tempdir(), filename)
  unzip(zipfile, files = filename, exdir = tempdir(), 
        junkpaths = TRUE, unzip = getOption("unzip"))
  on.exit(unlink(file.path(path)))

  # ΠŸΡ€ΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ функция ΠΏΡ€Π΅Π΄ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ
  if (!is.null(preprocess)) {
    .data <- data.table::fread(file = path)
    .data <- preprocess(data = .data)
    data.table::fwrite(x = .data, file = path, append = FALSE)
    rm(.data)
  }

  # Запрос ΠΊ Π‘Π” Π½Π° ΠΈΠΌΠΏΠΎΡ€Ρ‚ CSV
  sql <- sprintf(
    "COPY OFFSET 2 INTO %s FROM '%s' USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORT",
    tablename, path
  )
  # Π’Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ запроса ΠΊ Π‘Π”
  DBI::dbExecute(con, sql)

  # Π”ΠΎΠ±Π°Π²Π»Π΅Π½ΠΈΠ΅ записи ΠΎΠ± ΡƒΡΠΏΠ΅ΡˆΠ½ΠΎΠΉ Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠ΅ Π² ΡΠ»ΡƒΠΆΠ΅Π±Π½ΡƒΡŽ Ρ‚Π°Π±Π»ΠΈΡ†Ρƒ
  DBI::dbExecute(con, sprintf("INSERT INTO upload_log(file_name, uploaded) VALUES('%s', true)",
                              filename))

  return(invisible(TRUE))
}

Ngati mukufuna kusintha tebulo musanalembe ku database, ndikwanira kudutsa mkangano preprocess ntchito yomwe ingasinthe data.

Khodi yotsitsa motsatizana data mu nkhokwe:

Kulemba deta ku database

# Бписок Ρ„Π°ΠΉΠ»ΠΎΠ² для записи
files <- unzip(zipfile, list = TRUE)$Name

# Бписок ΠΈΡΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΠΉ, Ссли Ρ‡Π°ΡΡ‚ΡŒ Ρ„Π°ΠΉΠ»ΠΎΠ² ΡƒΠΆΠ΅ Π±Ρ‹Π»Π° Π·Π°Π³Ρ€ΡƒΠΆΠ΅Π½Π°
to_skip <- DBI::dbGetQuery(con, "SELECT file_name FROM upload_log")[[1L]]
files <- setdiff(files, to_skip)

if (length(files) > 0L) {
  # ЗапускаСм Ρ‚Π°ΠΉΠΌΠ΅Ρ€
  tictoc::tic()
  # ΠŸΡ€ΠΎΠ³Ρ€Π΅ΡΡ Π±Π°Ρ€
  pb <- txtProgressBar(min = 0L, max = length(files), style = 3)
  for (i in seq_along(files)) {
    upload_file(con = con, tablename = "doodles", 
                zipfile = zipfile, filename = files[i])
    setTxtProgressBar(pb, i)
  }
  close(pb)
  # ΠžΡΡ‚Π°Π½Π°Π²Π»ΠΈΠ²Π°Π΅ΠΌ Ρ‚Π°ΠΉΠΌΠ΅Ρ€
  tictoc::toc()
}

# 526.141 sec elapsed - ΠΊΠΎΠΏΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ SSD->SSD
# 558.879 sec elapsed - ΠΊΠΎΠΏΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ USB->SSD

Nthawi yotsitsa deta imatha kusiyanasiyana kutengera kuthamanga kwagalimoto yomwe imagwiritsidwa ntchito. Kwa ife, kuwerenga ndi kulemba mkati mwa SSD imodzi kapena kuchokera pa drive drive (source file) kupita ku SSD (DB) kumatenga mphindi zosakwana 10.

Zimatenga masekondi angapo kuti mupange mzati wokhala ndi chizindikiro cha gulu lonse komanso mzere wolozera (ORDERED INDEX) okhala ndi manambala amizere momwe zowonera zidzatsatiridwa popanga magulu:

Kupanga Mizati Yowonjezera ndi Mlozera

message("Generate lables")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD label_int int"))
invisible(DBI::dbExecute(con, "UPDATE doodles SET label_int = dense_rank() OVER (ORDER BY word) - 1"))

message("Generate row numbers")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD id serial"))
invisible(DBI::dbExecute(con, "CREATE ORDERED INDEX doodles_id_ord_idx ON doodles(id)"))

Kuti tithane ndi vuto lopanga gulu pa ntchentche, tidafunikira kuti tikwaniritse liwiro lalikulu lochotsa mizere yosasinthika patebulo. doodles. Kwa izi tinagwiritsa ntchito zidule zitatu. Choyamba chinali kuchepetsa kukula kwa mtundu womwe umasunga ID yowonera. Pazida zoyambira, mtundu wofunikira kuti usunge ID ndi bigint, koma kuchuluka kwa zowonera kumapangitsa kuti zigwirizane ndi zizindikiritso zawo, zofanana ndi nambala ya ordinal, mumtundu wa int. Kusaka kumathamanga kwambiri pankhaniyi. Chinyengo chachiwiri chinali kugwiritsa ntchito ORDERED INDEX - tidafika pachisankhochi molimba mtima, titadutsa zonse zomwe zilipo zosankha. Chachitatu chinali kugwiritsa ntchito mafunso a parameterized. Chofunika cha njirayi ndikuchita lamulo kamodzi PREPARE ndikugwiritsa ntchito mawu okonzekera popanga mafunso amtundu womwewo, koma kwenikweni pali ubwino poyerekeza ndi yosavuta. SELECT zidapezeka kuti zili mkati mwazolakwa zambiri.

Njira yoyika deta imadya zosaposa 450 MB ya RAM. Ndiko kuti, njira yomwe tafotokozayi imakupatsani mwayi wosuntha ma dataset olemera makumi a ma gigabytes pafupifupi pafupifupi zida zilizonse za bajeti, kuphatikiza zida za bolodi limodzi, zomwe ndizozizira kwambiri.

Chotsalira ndikuyesa kuthamanga kwa kubweza (mwachisawawa) ndikuwunika makulitsidwe poyesa magulu osiyanasiyana:

Benchmark ya database

library(ggplot2)

set.seed(0)
# ΠŸΠΎΠ΄ΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΠ΅ ΠΊ Π±Π°Π·Π΅ Π΄Π°Π½Π½Ρ‹Ρ…
con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

# Ѐункция для ΠΏΠΎΠ΄Π³ΠΎΡ‚ΠΎΠ²ΠΊΠΈ запроса Π½Π° сторонС сСрвСра
prep_sql <- function(batch_size) {
  sql <- sprintf("PREPARE SELECT id FROM doodles WHERE id IN (%s)",
                 paste(rep("?", batch_size), collapse = ","))
  res <- DBI::dbSendQuery(con, sql)
  return(res)
}

# Ѐункция для извлСчСния Π΄Π°Π½Π½Ρ‹Ρ…
fetch_data <- function(rs, batch_size) {
  ids <- sample(seq_len(n), batch_size)
  res <- DBI::dbFetch(DBI::dbBind(rs, as.list(ids)))
  return(res)
}

# ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ Π·Π°ΠΌΠ΅Ρ€Π°
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    rs <- prep_sql(batch_size)
    bench::mark(
      fetch_data(rs, batch_size),
      min_iterations = 50L
    )
  }
)
# ΠŸΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ Π±Π΅Π½Ρ‡ΠΌΠ°Ρ€ΠΊΠ°
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   batch_size      min   median      max `itr/sec` total_time n_itr
#        <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
# 1         16   23.6ms  54.02ms  93.43ms     18.8        2.6s    49
# 2         32     38ms  84.83ms 151.55ms     11.4       4.29s    49
# 3         64   63.3ms 175.54ms 248.94ms     5.85       8.54s    50
# 4        128   83.2ms 341.52ms 496.24ms     3.00      16.69s    50
# 5        256  232.8ms 653.21ms 847.44ms     1.58      31.66s    50
# 6        512  784.6ms    1.41s    1.98s     0.740       1.1m    49
# 7       1024  681.7ms    2.72s    4.06s     0.377      2.16m    49

ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
  geom_point() +
  geom_line() +
  ylab("median time, s") +
  theme_minimal()

DBI::dbDisconnect(con, shutdown = TRUE)

Kuzindikira kwa Draw Doodle mwachangu: momwe mungapangire zibwenzi ndi R, C++ ndi neural network

2. Kukonzekera magulu

Njira yonse yokonzekera batch imakhala ndi izi:

  1. Kuyika ma JSON angapo okhala ndi ma vector a zingwe okhala ndi ma coordinates a mfundo.
  2. Kujambula mizere yachikuda kutengera kugwirizanitsa kwa mfundo pa chithunzi cha kukula kofunikira (mwachitsanzo, 256 Γ— 256 kapena 128 Γ— 128).
  3. Kutembenuza zithunzizo kukhala tensor.

Monga gawo la mpikisano pakati pa ma Python kernels, vutoli linathetsedwa makamaka pogwiritsa ntchito OpenCV. Chimodzi mwazinthu zosavuta komanso zodziwika bwino mu R zitha kuwoneka motere:

Kukhazikitsa JSON to Tensor Conversion mu R

r_process_json_str <- function(json, line.width = 3, 
                               color = TRUE, scale = 1) {
  # ΠŸΠ°Ρ€ΡΠΈΠ½Π³ JSON
  coords <- jsonlite::fromJSON(json, simplifyMatrix = FALSE)
  tmp <- tempfile()
  # УдаляСм Π²Ρ€Π΅ΠΌΠ΅Π½Π½Ρ‹ΠΉ Ρ„Π°ΠΉΠ» ΠΏΠΎ Π·Π°Π²Π΅Ρ€ΡˆΠ΅Π½ΠΈΡŽ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΈ
  on.exit(unlink(tmp))
  png(filename = tmp, width = 256 * scale, height = 256 * scale, pointsize = 1)
  # ΠŸΡƒΡΡ‚ΠΎΠΉ Π³Ρ€Π°Ρ„ΠΈΠΊ
  plot.new()
  # Π Π°Π·ΠΌΠ΅Ρ€ ΠΎΠΊΠ½Π° Π³Ρ€Π°Ρ„ΠΈΠΊΠ°
  plot.window(xlim = c(256 * scale, 0), ylim = c(256 * scale, 0))
  # Π¦Π²Π΅Ρ‚Π° Π»ΠΈΠ½ΠΈΠΉ
  cols <- if (color) rainbow(length(coords)) else "#000000"
  for (i in seq_along(coords)) {
    lines(x = coords[[i]][[1]] * scale, y = coords[[i]][[2]] * scale, 
          col = cols[i], lwd = line.width)
  }
  dev.off()
  # ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ изобраТСния Π² 3-Ρ… ΠΌΠ΅Ρ€Π½Ρ‹ΠΉ массив
  res <- png::readPNG(tmp)
  return(res)
}

r_process_json_vector <- function(x, ...) {
  res <- lapply(x, r_process_json_str, ...)
  # ОбъСдинСниС 3-Ρ… ΠΌΠ΅Ρ€Π½Ρ‹Ρ… массивов ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΎΠΊ Π² 4-Ρ… ΠΌΠ΅Ρ€Π½Ρ‹ΠΉ Π² Ρ‚Π΅Π½Π·ΠΎΡ€
  res <- do.call(abind::abind, c(res, along = 0))
  return(res)
}

Kujambula kumachitika pogwiritsa ntchito zida zokhazikika za R ndikusungidwa ku PNG yakanthawi yosungidwa mu RAM (pa Linux, zolemba zosakhalitsa za R zili m'ndandanda. /tmp, wokwezedwa mu RAM). Fayiloyi imawerengedwa ngati magawo atatu okhala ndi manambala kuyambira 0 mpaka 1. Izi ndizofunikira chifukwa BMP yodziwika bwino imatha kuwerengedwa mumtundu wakuda wokhala ndi ma code amtundu wa hex.

Tiyeni tiyese zotsatira:

zip_file <- file.path("data", "train_simplified.zip")
csv_file <- "cat.csv"
unzip(zip_file, files = csv_file, exdir = tempdir(), 
      junkpaths = TRUE, unzip = getOption("unzip"))
tmp_data <- data.table::fread(file.path(tempdir(), csv_file), sep = ",", 
                              select = "drawing", nrows = 10000)
arr <- r_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256   3
plot(magick::image_read(arr))

Kuzindikira kwa Draw Doodle mwachangu: momwe mungapangire zibwenzi ndi R, C++ ndi neural network

Gulu lokhalo lidzapangidwa motere:

res <- r_process_json_vector(tmp_data[1:4, drawing], scale = 0.5)
str(res)
 # num [1:4, 1:128, 1:128, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
 # - attr(*, "dimnames")=List of 4
 #  ..$ : NULL
 #  ..$ : NULL
 #  ..$ : NULL
 #  ..$ : NULL

Kukhazikitsa uku kumawoneka ngati kosayenera kwa ife, chifukwa kupanga magulu akulu kumatenga nthawi yayitali, ndipo tidaganiza zotengera mwayi pazomwe anzathu adakumana nazo pogwiritsa ntchito laibulale yamphamvu. OpenCV. Panthawiyo panalibe phukusi lokonzekera la R (palibe tsopano), kotero kukhazikitsa kochepa kwa magwiridwe antchito kunalembedwa mu C ++ ndikuphatikiza mu R code pogwiritsa ntchito. Rcpp.

Kuti athetse vutoli, maphukusi ndi malaibulale otsatirawa adagwiritsidwa ntchito:

  1. OpenCV pogwira ntchito ndi zithunzi ndi mizere yojambula. Amagwiritsidwa ntchito malaibulale oyikapo kale ndi mafayilo apamutu, komanso kulumikizana kwamphamvu.

  2. xtensor pogwira ntchito ndi multidimensional arrays ndi tensor. Tidagwiritsa ntchito mafayilo apamutu omwe ali mu phukusi la R la dzina lomwelo. Laibulale imakulolani kuti mugwire ntchito ndi ma multidimensional arrays, onse mumzere waukulu ndi mzere waukulu.

  3. ndjson za kusanthula JSON. Laibulale iyi imagwiritsidwa ntchito mu xtensor zokha ngati zilipo mu polojekiti.

  4. RcppThread pokonza makina amtundu wamitundu yambiri kuchokera ku JSON. Gwiritsani ntchito mafayilo apamutu omwe aperekedwa ndi phukusili. Kuchokera kutchuka RcppParallel Phukusili, mwa zina, lili ndi njira yolumikizira lupu yomangidwira.

Ndikoyenera kuzindikira zimenezo xtensor inakhala godsend: kuphatikiza kuti ili ndi magwiridwe antchito ambiri komanso magwiridwe antchito apamwamba, opanga ake adakhala omvera ndikuyankha mafunso mwachangu komanso mwatsatanetsatane. Ndi chithandizo chawo, zinali zotheka kukhazikitsa masinthidwe a matrices a OpenCV kukhala ma tensor a xtensor, komanso njira yophatikizira ma tensor azithunzi-3-dimensional tensor ya 4-dimensional dimension yolondola (batch yokha).

Zida zophunzirira Rcpp, xtensor ndi RcppThread

https://thecoatlessprofessor.com/programming/unofficial-rcpp-api-documentation

https://docs.opencv.org/4.0.1/d7/dbd/group__imgproc.html

https://xtensor.readthedocs.io/en/latest/

https://xtensor.readthedocs.io/en/latest/file_loading.html#loading-json-data-into-xtensor

https://cran.r-project.org/web/packages/RcppThread/vignettes/RcppThread-vignette.pdf

Kupanga mafayilo omwe amagwiritsa ntchito mafayilo amachitidwe ndikulumikizana mwamphamvu ndi malaibulale omwe adayikidwa padongosolo, tidagwiritsa ntchito pulogalamu yowonjezera yomwe idakhazikitsidwa mu phukusi. Rcpp. Kuti tipeze njira ndi mbendera, tidagwiritsa ntchito chida chodziwika bwino cha Linux pkg-config.

Kukhazikitsa pulogalamu yowonjezera ya Rcpp yogwiritsa ntchito laibulale ya OpenCV

Rcpp::registerPlugin("opencv", function() {
  # Π’ΠΎΠ·ΠΌΠΎΠΆΠ½Ρ‹Π΅ названия ΠΏΠ°ΠΊΠ΅Ρ‚Π°
  pkg_config_name <- c("opencv", "opencv4")
  # Π‘ΠΈΠ½Π°Ρ€Π½Ρ‹ΠΉ Ρ„Π°ΠΉΠ» ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρ‹ pkg-config
  pkg_config_bin <- Sys.which("pkg-config")
  # ΠŸΡ€ΠΎΠ²Ρ€Π΅ΠΊΠ° наличия ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρ‹ Π² систСмС
  checkmate::assert_file_exists(pkg_config_bin, access = "x")
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° наличия Ρ„Π°ΠΉΠ»Π° настроСк OpenCV для pkg-config
  check <- sapply(pkg_config_name, 
                  function(pkg) system(paste(pkg_config_bin, pkg)))
  if (all(check != 0)) {
    stop("OpenCV config for the pkg-config not found", call. = FALSE)
  }

  pkg_config_name <- pkg_config_name[check == 0]
  list(env = list(
    PKG_CXXFLAGS = system(paste(pkg_config_bin, "--cflags", pkg_config_name), 
                          intern = TRUE),
    PKG_LIBS = system(paste(pkg_config_bin, "--libs", pkg_config_name), 
                      intern = TRUE)
  ))
})

Chifukwa cha ntchito ya plugin, mfundo zotsatirazi zidzalowetsedwa m'malo mwa kupanga:

Rcpp:::.plugins$opencv()$env

# $PKG_CXXFLAGS
# [1] "-I/usr/include/opencv"
#
# $PKG_LIBS
# [1] "-lopencv_shape -lopencv_stitching -lopencv_superres -lopencv_videostab -lopencv_aruco -lopencv_bgsegm -lopencv_bioinspired -lopencv_ccalib -lopencv_datasets -lopencv_dpm -lopencv_face -lopencv_freetype -lopencv_fuzzy -lopencv_hdf -lopencv_line_descriptor -lopencv_optflow -lopencv_video -lopencv_plot -lopencv_reg -lopencv_saliency -lopencv_stereo -lopencv_structured_light -lopencv_phase_unwrapping -lopencv_rgbd -lopencv_viz -lopencv_surface_matching -lopencv_text -lopencv_ximgproc -lopencv_calib3d -lopencv_features2d -lopencv_flann -lopencv_xobjdetect -lopencv_objdetect -lopencv_ml -lopencv_xphoto -lopencv_highgui -lopencv_videoio -lopencv_imgcodecs -lopencv_photo -lopencv_imgproc -lopencv_core"

Khodi yokhazikitsira yogawa JSON ndikupanga batch yotumizira ku fanizo imaperekedwa pansi pa wowononga. Choyamba, onjezani chikwatu cha polojekiti yanu kuti mufufuze mafayilo apamutu (ofunikira ndjson):

Sys.setenv("PKG_CXXFLAGS" = paste0("-I", normalizePath(file.path("src"))))

Kukhazikitsa kwa JSON kukhala kutembenuka kwa tensor mu C++

// [[Rcpp::plugins(cpp14)]]
// [[Rcpp::plugins(opencv)]]
// [[Rcpp::depends(xtensor)]]
// [[Rcpp::depends(RcppThread)]]

#include <xtensor/xjson.hpp>
#include <xtensor/xadapt.hpp>
#include <xtensor/xview.hpp>
#include <xtensor-r/rtensor.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <Rcpp.h>
#include <RcppThread.h>

// Π‘ΠΈΠ½ΠΎΠ½ΠΈΠΌΡ‹ для Ρ‚ΠΈΠΏΠΎΠ²
using RcppThread::parallelFor;
using json = nlohmann::json;
using points = xt::xtensor<double,2>;     // Π˜Π·Π²Π»Π΅Ρ‡Ρ‘Π½Π½Ρ‹Π΅ ΠΈΠ· JSON ΠΊΠΎΠΎΡ€Π΄ΠΈΠ½Π°Ρ‚Ρ‹ Ρ‚ΠΎΡ‡Π΅ΠΊ
using strokes = std::vector<points>;      // Π˜Π·Π²Π»Π΅Ρ‡Ρ‘Π½Π½Ρ‹Π΅ ΠΈΠ· JSON ΠΊΠΎΠΎΡ€Π΄ΠΈΠ½Π°Ρ‚Ρ‹ Ρ‚ΠΎΡ‡Π΅ΠΊ
using xtensor3d = xt::xtensor<double, 3>; // Π’Π΅Π½Π·ΠΎΡ€ для хранСния ΠΌΠ°Ρ‚Ρ€ΠΈΡ†Ρ‹ изообраТСния
using xtensor4d = xt::xtensor<double, 4>; // Π’Π΅Π½Π·ΠΎΡ€ для хранСния мноТСства ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠΉ
using rtensor3d = xt::rtensor<double, 3>; // ΠžΠ±Ρ‘Ρ€Ρ‚ΠΊΠ° для экспорта Π² R
using rtensor4d = xt::rtensor<double, 4>; // ΠžΠ±Ρ‘Ρ€Ρ‚ΠΊΠ° для экспорта Π² R

// БтатичСскиС константы
// Π Π°Π·ΠΌΠ΅Ρ€ изобраТСния Π² пиксСлях
const static int SIZE = 256;
// Π’ΠΈΠΏ Π»ΠΈΠ½ΠΈΠΈ
// Π‘ΠΌ. https://en.wikipedia.org/wiki/Pixel_connectivity#2-dimensional
const static int LINE_TYPE = cv::LINE_4;
// Π’ΠΎΠ»Ρ‰ΠΈΠ½Π° Π»ΠΈΠ½ΠΈΠΈ Π² пиксСлях
const static int LINE_WIDTH = 3;
// Алгоритм рСсайза
// https://docs.opencv.org/3.1.0/da/d54/group__imgproc__transform.html#ga5bb5a1fea74ea38e1a5445ca803ff121
const static int RESIZE_TYPE = cv::INTER_LINEAR;

// Π¨Π°Π±Π»ΠΎΠ½ для конвСртирования OpenCV-ΠΌΠ°Ρ‚Ρ€ΠΈΡ†Ρ‹ Π² Ρ‚Π΅Π½Π·ΠΎΡ€
template <typename T, int NCH, typename XT=xt::xtensor<T,3,xt::layout_type::column_major>>
XT to_xt(const cv::Mat_<cv::Vec<T, NCH>>& src) {
  // Π Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒ Ρ†Π΅Π»Π΅Π²ΠΎΠ³ΠΎ Ρ‚Π΅Π½Π·ΠΎΡ€Π°
  std::vector<int> shape = {src.rows, src.cols, NCH};
  // ΠžΠ±Ρ‰Π΅Π΅ количСство элСмСнтов Π² массивС
  size_t size = src.total() * NCH;
  // ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ cv::Mat Π² xt::xtensor
  XT res = xt::adapt((T*) src.data, size, xt::no_ownership(), shape);
  return res;
}

// ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ JSON Π² список ΠΊΠΎΠΎΡ€Π΄ΠΈΠ½Π°Ρ‚ Ρ‚ΠΎΡ‡Π΅ΠΊ
strokes parse_json(const std::string& x) {
  auto j = json::parse(x);
  // Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ парсинга Π΄ΠΎΠ»ΠΆΠ΅Π½ Π±Ρ‹Ρ‚ΡŒ массивом
  if (!j.is_array()) {
    throw std::runtime_error("'x' must be JSON array.");
  }
  strokes res;
  res.reserve(j.size());
  for (const auto& a: j) {
    // ΠšΠ°ΠΆΠ΄Ρ‹ΠΉ элСмСнт массива Π΄ΠΎΠ»ΠΆΠ΅Π½ Π±Ρ‹Ρ‚ΡŒ 2-ΠΌΠ΅Ρ€Π½Ρ‹ΠΌ массивом
    if (!a.is_array() || a.size() != 2) {
      throw std::runtime_error("'x' must include only 2d arrays.");
    }
    // Π˜Π·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ Π²Π΅ΠΊΡ‚ΠΎΡ€Π° Ρ‚ΠΎΡ‡Π΅ΠΊ
    auto p = a.get<points>();
    res.push_back(p);
  }
  return res;
}

// ΠžΡ‚Ρ€ΠΈΡΠΎΠ²ΠΊΠ° Π»ΠΈΠ½ΠΈΠΉ
// Π¦Π²Π΅Ρ‚Π° HSV
cv::Mat ocv_draw_lines(const strokes& x, bool color = true) {
  // Π˜ΡΡ…ΠΎΠ΄Π½Ρ‹ΠΉ Ρ‚ΠΈΠΏ ΠΌΠ°Ρ‚Ρ€ΠΈΡ†Ρ‹
  auto stype = color ? CV_8UC3 : CV_8UC1;
  // Π˜Ρ‚ΠΎΠ³ΠΎΠ²Ρ‹ΠΉ Ρ‚ΠΈΠΏ ΠΌΠ°Ρ‚Ρ€ΠΈΡ†Ρ‹
  auto dtype = color ? CV_32FC3 : CV_32FC1;
  auto bg = color ? cv::Scalar(0, 0, 255) : cv::Scalar(255);
  auto col = color ? cv::Scalar(0, 255, 220) : cv::Scalar(0);
  cv::Mat img = cv::Mat(SIZE, SIZE, stype, bg);
  // ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²ΠΎ Π»ΠΈΠ½ΠΈΠΉ
  size_t n = x.size();
  for (const auto& s: x) {
    // ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²ΠΎ Ρ‚ΠΎΡ‡Π΅ΠΊ Π² Π»ΠΈΠ½ΠΈΠΈ
    size_t n_points = s.shape()[1];
    for (size_t i = 0; i < n_points - 1; ++i) {
      // Π’ΠΎΡ‡ΠΊΠ° Π½Π°Ρ‡Π°Π»Π° ΡˆΡ‚Ρ€ΠΈΡ…Π°
      cv::Point from(s(0, i), s(1, i));
      // Π’ΠΎΡ‡ΠΊΠ° окончания ΡˆΡ‚Ρ€ΠΈΡ…Π°
      cv::Point to(s(0, i + 1), s(1, i + 1));
      // ΠžΡ‚Ρ€ΠΈΡΠΎΠ²ΠΊΠ° Π»ΠΈΠ½ΠΈΠΈ
      cv::line(img, from, to, col, LINE_WIDTH, LINE_TYPE);
    }
    if (color) {
      // МСняСм Ρ†Π²Π΅Ρ‚ Π»ΠΈΠ½ΠΈΠΈ
      col[0] += 180 / n;
    }
  }
  if (color) {
    // МСняСм Ρ†Π²Π΅Ρ‚ΠΎΠ²ΠΎΠ΅ прСдставлСниС Π½Π° RGB
    cv::cvtColor(img, img, cv::COLOR_HSV2RGB);
  }
  // МСняСм Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ прСдставлСния Π½Π° float32 с Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½ΠΎΠΌ [0, 1]
  img.convertTo(img, dtype, 1 / 255.0);
  return img;
}

// ΠžΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠ° JSON ΠΈ ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½ΠΈΠ΅ Ρ‚Π΅Π½Π·ΠΎΡ€Π° с Π΄Π°Π½Π½Ρ‹ΠΌΠΈ изобраТСния
xtensor3d process(const std::string& x, double scale = 1.0, bool color = true) {
  auto p = parse_json(x);
  auto img = ocv_draw_lines(p, color);
  if (scale != 1) {
    cv::Mat out;
    cv::resize(img, out, cv::Size(), scale, scale, RESIZE_TYPE);
    cv::swap(img, out);
    out.release();
  }
  xtensor3d arr = color ? to_xt<double,3>(img) : to_xt<double,1>(img);
  return arr;
}

// [[Rcpp::export]]
rtensor3d cpp_process_json_str(const std::string& x, 
                               double scale = 1.0, 
                               bool color = true) {
  xtensor3d res = process(x, scale, color);
  return res;
}

// [[Rcpp::export]]
rtensor4d cpp_process_json_vector(const std::vector<std::string>& x, 
                                  double scale = 1.0, 
                                  bool color = false) {
  size_t n = x.size();
  size_t dim = floor(SIZE * scale);
  size_t channels = color ? 3 : 1;
  xtensor4d res({n, dim, dim, channels});
  parallelFor(0, n, [&x, &res, scale, color](int i) {
    xtensor3d tmp = process(x[i], scale, color);
    auto view = xt::view(res, i, xt::all(), xt::all(), xt::all());
    view = tmp;
  });
  return res;
}

Khodi iyi iyenera kuyikidwa mu fayilo src/cv_xt.cpp ndi kusonkhanitsa ndi lamulo Rcpp::sourceCpp(file = "src/cv_xt.cpp", env = .GlobalEnv); zofunikanso ntchito nlohmann/json.hpp kuchokera posungira. Code imagawidwa m'magulu angapo:

  • to_xt - ntchito yoyeserera yosinthira matrix azithunzi (cv::Mat) ku tensor xt::xtensor;

  • parse_json - ntchitoyo imadula chingwe cha JSON, imatulutsa zogwirizanitsa za mfundo, kuziyika mu vekitala;

  • ocv_draw_lines - kuchokera ku vekitala ya mfundo, imajambula mizere yamitundu yambiri;

  • process - amaphatikiza ntchito zomwe zili pamwambapa ndikuwonjezeranso kuthekera kokulitsa chithunzicho;

  • cpp_process_json_str - kukulunga pa ntchito process, yomwe imatumiza zotsatira ku R-chinthu (multidimensional array);

  • cpp_process_json_vector - kukulunga pa ntchito cpp_process_json_str, zomwe zimakupatsani mwayi wokonza vekitala ya chingwe mumitundu yambiri.

Kuti ajambule mizere yamitundu yambiri, mtundu wa HSV unkagwiritsidwa ntchito, wotsatiridwa ndi kusinthidwa kukhala RGB. Tiyeni tiyese zotsatira:

arr <- cpp_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256   3
plot(magick::image_read(arr))

Kuzindikira kwa Draw Doodle mwachangu: momwe mungapangire zibwenzi ndi R, C++ ndi neural network
Kuyerekeza kwa liwiro la kukhazikitsa mu R ndi C ++

res_bench <- bench::mark(
  r_process_json_str(tmp_data[4, drawing], scale = 0.5),
  cpp_process_json_str(tmp_data[4, drawing], scale = 0.5),
  check = FALSE,
  min_iterations = 100
)
# ΠŸΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ Π±Π΅Π½Ρ‡ΠΌΠ°Ρ€ΠΊΠ°
cols <- c("expression", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   expression                min     median       max `itr/sec` total_time  n_itr
#   <chr>                <bch:tm>   <bch:tm>  <bch:tm>     <dbl>   <bch:tm>  <int>
# 1 r_process_json_str     3.49ms     3.55ms    4.47ms      273.      490ms    134
# 2 cpp_process_json_str   1.94ms     2.02ms    5.32ms      489.      497ms    243

library(ggplot2)
# ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ Π·Π°ΠΌΠ΅Ρ€Π°
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    .data <- tmp_data[sample(seq_len(.N), batch_size), drawing]
    bench::mark(
      r_process_json_vector(.data, scale = 0.5),
      cpp_process_json_vector(.data,  scale = 0.5),
      min_iterations = 50,
      check = FALSE
    )
  }
)

res_bench[, cols]

#    expression   batch_size      min   median      max `itr/sec` total_time n_itr
#    <chr>             <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
#  1 r                   16   50.61ms  53.34ms  54.82ms    19.1     471.13ms     9
#  2 cpp                 16    4.46ms   5.39ms   7.78ms   192.      474.09ms    91
#  3 r                   32   105.7ms 109.74ms 212.26ms     7.69        6.5s    50
#  4 cpp                 32    7.76ms  10.97ms  15.23ms    95.6     522.78ms    50
#  5 r                   64  211.41ms 226.18ms 332.65ms     3.85      12.99s    50
#  6 cpp                 64   25.09ms  27.34ms  32.04ms    36.0        1.39s    50
#  7 r                  128   534.5ms 627.92ms 659.08ms     1.61      31.03s    50
#  8 cpp                128   56.37ms  58.46ms  66.03ms    16.9        2.95s    50
#  9 r                  256     1.15s    1.18s    1.29s     0.851     58.78s    50
# 10 cpp                256  114.97ms 117.39ms 130.09ms     8.45       5.92s    50
# 11 r                  512     2.09s    2.15s    2.32s     0.463       1.8m    50
# 12 cpp                512  230.81ms  235.6ms 261.99ms     4.18      11.97s    50
# 13 r                 1024        4s    4.22s     4.4s     0.238       3.5m    50
# 14 cpp               1024  410.48ms 431.43ms 462.44ms     2.33      21.45s    50

ggplot(res_bench, aes(x = factor(batch_size), y = median, 
                      group =  expression, color = expression)) +
  geom_point() +
  geom_line() +
  ylab("median time, s") +
  theme_minimal() +
  scale_color_discrete(name = "", labels = c("cpp", "r")) +
  theme(legend.position = "bottom") 

Kuzindikira kwa Draw Doodle mwachangu: momwe mungapangire zibwenzi ndi R, C++ ndi neural network

Monga mukuonera, kuthamanga kwachangu kunakhala kofunikira kwambiri, ndipo sizingatheke kupeza C ++ code pofananiza R code.

3. Othandizira kutsitsa magulu kuchokera munkhokwe

R ali ndi mbiri yabwino yogwiritsira ntchito deta yomwe ikugwirizana ndi RAM, pamene Python imadziwika kwambiri ndi kukonzanso deta, kukulolani kuti mugwiritse ntchito mosavuta komanso mwachibadwa kuwerengera kunja kwapakati (kuwerengera pogwiritsa ntchito kukumbukira kunja). Chitsanzo chapamwamba komanso choyenera kwa ife munkhani yavuto lomwe tafotokozali ndi maukonde ozama a neural ophunzitsidwa ndi njira yotsika pang'onopang'ono ndi kuyerekeza kwa gradient pa sitepe iliyonse pogwiritsa ntchito gawo laling'ono la zowonera, kapena mini-batch.

Zolemba zakuya zolembedwa mu Python zili ndi makalasi apadera omwe amagwiritsira ntchito obwerezabwereza malinga ndi deta: matebulo, zithunzi mumafoda, mawonekedwe a binary, ndi zina zotero. Mu R titha kugwiritsa ntchito mwayi pazinthu zonse za library ya Python makamera ndi ma backend ake osiyanasiyana pogwiritsa ntchito phukusi la dzina lomwelo, lomwe limagwiranso ntchito pamwamba pa phukusi onaninso. Yotsirizirayo ikuyenera kukhala ndi nkhani yayitali; sizimangokulolani kuti muthamangitse Python code kuchokera ku R, komanso imakulolani kusamutsa zinthu pakati pa R ndi Python magawo, ndikuchita zosintha zonse zofunika.

Tinachotsa kufunikira kosunga deta yonse mu RAM pogwiritsa ntchito MonetDBLite, ntchito zonse za "neural network" zidzachitidwa ndi code yoyambirira ku Python, timangoyenera kulemba cholembera pa deta, popeza palibe chokonzekera. pazifukwa zotere mu R kapena Python. Pali zofunikira ziwiri zokha kwa izo: ziyenera kubweza ma batchi mosalekeza ndikusunga malo ake pakati pa kubwereza (yotsirizirayi mu R imayendetsedwa m'njira yosavuta kugwiritsa ntchito kutseka). M'mbuyomu, zinkafunika kusinthiratu ma R arrays kukhala numpy arrays mkati mwa iterator, koma mtundu waposachedwa wa phukusi. makamera amachita yekha.

Kubwereza kwa data yophunzitsa ndi kutsimikizira kudakhala motere:

Iterator yophunzitsa ndi kutsimikizira deta

train_generator <- function(db_connection = con,
                            samples_index,
                            num_classes = 340,
                            batch_size = 32,
                            scale = 1,
                            color = FALSE,
                            imagenet_preproc = FALSE) {
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²
  checkmate::assert_class(con, "DBIConnection")
  checkmate::assert_integerish(samples_index)
  checkmate::assert_count(num_classes)
  checkmate::assert_count(batch_size)
  checkmate::assert_number(scale, lower = 0.001, upper = 5)
  checkmate::assert_flag(color)
  checkmate::assert_flag(imagenet_preproc)

  # ΠŸΠ΅Ρ€Π΅ΠΌΠ΅ΡˆΠΈΠ²Π°Π΅ΠΌ, Ρ‡Ρ‚ΠΎΠ±Ρ‹ Π±Ρ€Π°Ρ‚ΡŒ ΠΈ ΡƒΠ΄Π°Π»ΡΡ‚ΡŒ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Π½Π½Ρ‹Π΅ индСксы Π±Π°Ρ‚Ρ‡Π΅ΠΉ ΠΏΠΎ порядку
  dt <- data.table::data.table(id = sample(samples_index))
  # ΠŸΡ€ΠΎΡΡ‚Π°Π²Π»ΡΠ΅ΠΌ Π½ΠΎΠΌΠ΅Ρ€Π° Π±Π°Ρ‚Ρ‡Π΅ΠΉ
  dt[, batch := (.I - 1L) %/% batch_size + 1L]
  # ΠžΡΡ‚Π°Π²Π»ΡΠ΅ΠΌ Ρ‚ΠΎΠ»ΡŒΠΊΠΎ ΠΏΠΎΠ»Π½Ρ‹Π΅ Π±Π°Ρ‚Ρ‡ΠΈ ΠΈ индСксируСм
  dt <- dt[, if (.N == batch_size) .SD, keyby = batch]
  # УстанавливаСм счётчик
  i <- 1
  # ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²ΠΎ Π±Π°Ρ‚Ρ‡Π΅ΠΉ
  max_i <- dt[, max(batch)]

  # ΠŸΠΎΠ΄Π³ΠΎΡ‚ΠΎΠ²ΠΊΠ° выраТСния для Π²Ρ‹Π³Ρ€ΡƒΠ·ΠΊΠΈ
  sql <- sprintf(
    "PREPARE SELECT drawing, label_int FROM doodles WHERE id IN (%s)",
    paste(rep("?", batch_size), collapse = ",")
  )
  res <- DBI::dbSendQuery(con, sql)

  # Аналог keras::to_categorical
  to_categorical <- function(x, num) {
    n <- length(x)
    m <- numeric(n * num)
    m[x * n + seq_len(n)] <- 1
    dim(m) <- c(n, num)
    return(m)
  }

  # Π—Π°ΠΌΡ‹ΠΊΠ°Π½ΠΈΠ΅
  function() {
    # НачинаСм Π½ΠΎΠ²ΡƒΡŽ эпоху
    if (i > max_i) {
      dt[, id := sample(id)]
      data.table::setkey(dt, batch)
      # БбрасываСм счётчик
      i <<- 1
      max_i <<- dt[, max(batch)]
    }

    # ID для Π²Ρ‹Π³Ρ€ΡƒΠ·ΠΊΠΈ Π΄Π°Π½Π½Ρ‹Ρ…
    batch_ind <- dt[batch == i, id]
    # Π’Ρ‹Π³Ρ€ΡƒΠ·ΠΊΠ° Π΄Π°Π½Π½Ρ‹Ρ…
    batch <- DBI::dbFetch(DBI::dbBind(res, as.list(batch_ind)), n = -1)

    # Π£Π²Π΅Π»ΠΈΡ‡ΠΈΠ²Π°Π΅ΠΌ счётчик
    i <<- i + 1

    # ΠŸΠ°Ρ€ΡΠΈΠ½Π³ JSON ΠΈ ΠΏΠΎΠ΄Π³ΠΎΡ‚ΠΎΠ²ΠΊΠ° массива
    batch_x <- cpp_process_json_vector(batch$drawing, scale = scale, color = color)
    if (imagenet_preproc) {
      # Π¨ΠΊΠ°Π»ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ c ΠΈΠ½Ρ‚Π΅Ρ€Π²Π°Π»Π° [0, 1] Π½Π° ΠΈΠ½Ρ‚Π΅Ρ€Π²Π°Π» [-1, 1]
      batch_x <- (batch_x - 0.5) * 2
    }

    batch_y <- to_categorical(batch$label_int, num_classes)
    result <- list(batch_x, batch_y)
    return(result)
  }
}

Ntchitoyi imatenga ngati kulowetsa kusintha komwe kumalumikizidwa ndi database, manambala a mizere yogwiritsidwa ntchito, kuchuluka kwa makalasi, kukula kwa batch, sikelo (scale = 1 zikugwirizana ndi kupereka zithunzi za 256x256 pixels, scale = 0.5 - 128x128 pixels), chizindikiro cha mtundu (color = FALSE imatchula kumasulira mu grayscale ikagwiritsidwa ntchito color = TRUE sitiroko iliyonse imakokedwa ndi mtundu watsopano) ndi chizindikiro chokonzekera ma network ophunzitsidwa kale pa imagenet. Chotsatiracho chikufunika kuti muwonjeze ma pixel kuchokera pakapita nthawi [0, 1] mpaka pakapita nthawi [-1, 1], yomwe imagwiritsidwa ntchito pophunzitsa zomwe zaperekedwa. makamera zitsanzo.

Ntchito yakunja ili ndi kuwunika kwa mtundu wa mkangano, tebulo data.table ndi manambala osakanikirana osakanikirana kuchokera samples_index ndi manambala a batch, counter ndi kuchuluka kwa magulu, komanso mawu a SQL otsitsa deta kuchokera ku database. Kuphatikiza apo, tidafotokozera analogue yachangu yantchito mkati keras::to_categorical(). Tidagwiritsa ntchito pafupifupi data yonse pophunzitsa, ndikusiya theka la zana kuti litsimikizidwe, kotero kukula kwa epoch kunali kochepa ndi parameter. steps_per_epoch ataitanidwa keras::fit_generator(), ndi chikhalidwe if (i > max_i) adangogwira ntchito yotsimikiziranso.

Muzochita zamkati, mindandanda yamizere imabwezedwa pamndandanda wotsatira, zolembedwa zimatsitsidwa kuchokera ku nkhokwe ndikuwonjezera batch counter, JSON parsing (function). cpp_process_json_vector(), yolembedwa mu C ++) ndikupanga magulu ofanana ndi zithunzi. Kenako ma vector otentha amodzi okhala ndi zilembo zamakalasi amapangidwa, magulu okhala ndi ma pixel ndi zilembo amaphatikizidwa pamndandanda, womwe ndi mtengo wobwerera. Kuti tifulumizitse ntchito, tidagwiritsa ntchito kupanga ma index m'matebulo data.table ndi kusinthidwa kudzera pa ulalo - popanda "chips" ichi deta.table Ndizovuta kulingalira kugwira ntchito moyenera ndi kuchuluka kwa data mu R.

Zotsatira za kuyeza liwiro pa laputopu ya Core i5 ndi izi:

Iterator benchmark

library(Rcpp)
library(keras)
library(ggplot2)

source("utils/rcpp.R")
source("utils/keras_iterator.R")

con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

ind <- seq_len(DBI::dbGetQuery(con, "SELECT count(*) FROM doodles")[[1L]])
num_classes <- DBI::dbGetQuery(con, "SELECT max(label_int) + 1 FROM doodles")[[1L]]

# Π˜Π½Π΄Π΅ΠΊΡΡ‹ для ΠΎΠ±ΡƒΡ‡Π°ΡŽΡ‰Π΅ΠΉ Π²Ρ‹Π±ΠΎΡ€ΠΊΠΈ
train_ind <- sample(ind, floor(length(ind) * 0.995))
# Π˜Π½Π΄Π΅ΠΊΡΡ‹ для ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΎΡ‡Π½ΠΎΠΉ Π²Ρ‹Π±ΠΎΡ€ΠΊΠΈ
val_ind <- ind[-train_ind]
rm(ind)
# ΠšΠΎΡΡ„Ρ„ΠΈΡ†ΠΈΠ΅Π½Ρ‚ ΠΌΠ°ΡΡˆΡ‚Π°Π±Π°
scale <- 0.5

# ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ Π·Π°ΠΌΠ΅Ρ€Π°
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    it1 <- train_generator(
      db_connection = con,
      samples_index = train_ind,
      num_classes = num_classes,
      batch_size = batch_size,
      scale = scale
    )
    bench::mark(
      it1(),
      min_iterations = 50L
    )
  }
)
# ΠŸΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ Π±Π΅Π½Ρ‡ΠΌΠ°Ρ€ΠΊΠ°
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   batch_size      min   median      max `itr/sec` total_time n_itr
#        <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
# 1         16     25ms  64.36ms   92.2ms     15.9       3.09s    49
# 2         32   48.4ms 118.13ms 197.24ms     8.17       5.88s    48
# 3         64   69.3ms 117.93ms 181.14ms     8.57       5.83s    50
# 4        128  157.2ms 240.74ms 503.87ms     3.85      12.71s    49
# 5        256  359.3ms 613.52ms 988.73ms     1.54       30.5s    47
# 6        512  884.7ms    1.53s    2.07s     0.674      1.11m    45
# 7       1024     2.7s    3.83s    5.47s     0.261      2.81m    44

ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
    geom_point() +
    geom_line() +
    ylab("median time, s") +
    theme_minimal()

DBI::dbDisconnect(con, shutdown = TRUE)

Kuzindikira kwa Draw Doodle mwachangu: momwe mungapangire zibwenzi ndi R, C++ ndi neural network

Ngati muli ndi RAM yokwanira, mutha kufulumizitsa kwambiri ntchito ya nkhokwe posamutsira ku RAM yomweyi (32 GB ndiyokwanira pantchito yathu). Mu Linux, gawoli limayikidwa mwachisawawa /dev/shm, yomwe imakhala ndi theka la RAM. Mutha kuwunikira zambiri posintha /etc/fstabkuti mupeze mbiri ngati tmpfs /dev/shm tmpfs defaults,size=25g 0 0. Onetsetsani kuti mwayambitsanso ndikuyang'ana zotsatira zake poyendetsa lamulo df -h.

Kubwereza kwa data yoyeserera kumawoneka kosavuta, chifukwa dataset yoyeserera imalowa mu RAM:

Iterator ya data yoyeserera

test_generator <- function(dt,
                           batch_size = 32,
                           scale = 1,
                           color = FALSE,
                           imagenet_preproc = FALSE) {

  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²
  checkmate::assert_data_table(dt)
  checkmate::assert_count(batch_size)
  checkmate::assert_number(scale, lower = 0.001, upper = 5)
  checkmate::assert_flag(color)
  checkmate::assert_flag(imagenet_preproc)

  # ΠŸΡ€ΠΎΡΡ‚Π°Π²Π»ΡΠ΅ΠΌ Π½ΠΎΠΌΠ΅Ρ€Π° Π±Π°Ρ‚Ρ‡Π΅ΠΉ
  dt[, batch := (.I - 1L) %/% batch_size + 1L]
  data.table::setkey(dt, batch)
  i <- 1
  max_i <- dt[, max(batch)]

  # Π—Π°ΠΌΡ‹ΠΊΠ°Π½ΠΈΠ΅
  function() {
    batch_x <- cpp_process_json_vector(dt[batch == i, drawing], 
                                       scale = scale, color = color)
    if (imagenet_preproc) {
      # Π¨ΠΊΠ°Π»ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ c ΠΈΠ½Ρ‚Π΅Ρ€Π²Π°Π»Π° [0, 1] Π½Π° ΠΈΠ½Ρ‚Π΅Ρ€Π²Π°Π» [-1, 1]
      batch_x <- (batch_x - 0.5) * 2
    }
    result <- list(batch_x)
    i <<- i + 1
    return(result)
  }
}

4. Kusankhidwa kwa zomangamanga zachitsanzo

Zomangamanga zoyamba kugwiritsidwa ntchito zinali Mobilenet v1, zomwe zafotokozedwa mu izi uthenga. Ikuphatikizidwa ngati muyezo makamera ndipo, motero, likupezeka mu phukusi la dzina lomwelo la R. Koma poyesera kugwiritsa ntchito ndi zithunzi za njira imodzi, chinthu chachilendo chinatulukira: cholembera cholowetsa chiyenera kukhala ndi kukula kwake. (batch, height, width, 3), ndiko kuti, chiwerengero cha ma tchanelo sichingasinthidwe. Palibe malire ku Python, kotero tidathamangira ndikulemba momwe tingakhazikitsire zomanga izi, kutsatira nkhani yoyambirira (popanda kusiya komwe kuli mu mtundu wa keras):

Zomangamanga za Mobilenet v1

library(keras)

top_3_categorical_accuracy <- custom_metric(
    name = "top_3_categorical_accuracy",
    metric_fn = function(y_true, y_pred) {
         metric_top_k_categorical_accuracy(y_true, y_pred, k = 3)
    }
)

layer_sep_conv_bn <- function(object, 
                              filters,
                              alpha = 1,
                              depth_multiplier = 1,
                              strides = c(2, 2)) {

  # NB! depth_multiplier !=  resolution multiplier
  # https://github.com/keras-team/keras/issues/10349

  layer_depthwise_conv_2d(
    object = object,
    kernel_size = c(3, 3), 
    strides = strides,
    padding = "same",
    depth_multiplier = depth_multiplier
  ) %>%
  layer_batch_normalization() %>% 
  layer_activation_relu() %>%
  layer_conv_2d(
    filters = filters * alpha,
    kernel_size = c(1, 1), 
    strides = c(1, 1)
  ) %>%
  layer_batch_normalization() %>% 
  layer_activation_relu() 
}

get_mobilenet_v1 <- function(input_shape = c(224, 224, 1),
                             num_classes = 340,
                             alpha = 1,
                             depth_multiplier = 1,
                             optimizer = optimizer_adam(lr = 0.002),
                             loss = "categorical_crossentropy",
                             metrics = c("categorical_crossentropy",
                                         top_3_categorical_accuracy)) {

  inputs <- layer_input(shape = input_shape)

  outputs <- inputs %>%
    layer_conv_2d(filters = 32, kernel_size = c(3, 3), strides = c(2, 2), padding = "same") %>%
    layer_batch_normalization() %>% 
    layer_activation_relu() %>%
    layer_sep_conv_bn(filters = 64, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 128, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 128, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 256, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 256, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 1024, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 1024, strides = c(1, 1)) %>%
    layer_global_average_pooling_2d() %>%
    layer_dense(units = num_classes) %>%
    layer_activation_softmax()

    model <- keras_model(
      inputs = inputs,
      outputs = outputs
    )

    model %>% compile(
      optimizer = optimizer,
      loss = loss,
      metrics = metrics
    )

    return(model)
}

Zoipa za njira imeneyi ndi zoonekeratu. Ndikufuna kuyesa zitsanzo zambiri, koma m'malo mwake, sindikufuna kulembanso kamangidwe kalikonse pamanja. Tinalandidwanso mwayi wogwiritsa ntchito zolemera zamitundu yophunzitsidwa kale pa imagenet. Monga mwachizolowezi, kuphunzira zolembedwazo kunathandiza. Ntchito get_config() amakulolani kuti mupeze kufotokozera kwachitsanzo mu mawonekedwe oyenera kusintha (base_model_conf$layers - mndandanda wanthawi zonse wa R), ndi ntchito from_config() imapanga kutembenuza kobwerera ku chinthu chachitsanzo:

base_model_conf <- get_config(base_model)
base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
base_model <- from_config(base_model_conf)

Tsopano sikovuta kulemba ntchito yapadziko lonse lapansi kuti mupeze chilichonse chomwe chaperekedwa makamera zitsanzo zokhala ndi zolemera kapena zopanda zolemera zophunzitsidwa pa imagenet:

Ntchito yotsitsa zomanga zopangidwa kale

get_model <- function(name = "mobilenet_v2",
                      input_shape = NULL,
                      weights = "imagenet",
                      pooling = "avg",
                      num_classes = NULL,
                      optimizer = keras::optimizer_adam(lr = 0.002),
                      loss = "categorical_crossentropy",
                      metrics = NULL,
                      color = TRUE,
                      compile = FALSE) {
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²
  checkmate::assert_string(name)
  checkmate::assert_integerish(input_shape, lower = 1, upper = 256, len = 3)
  checkmate::assert_count(num_classes)
  checkmate::assert_flag(color)
  checkmate::assert_flag(compile)

  # ΠŸΠΎΠ»ΡƒΡ‡Π°Π΅ΠΌ ΠΎΠ±ΡŠΠ΅ΠΊΡ‚ ΠΈΠ· ΠΏΠ°ΠΊΠ΅Ρ‚Π° keras
  model_fun <- get0(paste0("application_", name), envir = asNamespace("keras"))
  # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° наличия ΠΎΠ±ΡŠΠ΅ΠΊΡ‚Π° Π² ΠΏΠ°ΠΊΠ΅Ρ‚Π΅
  if (is.null(model_fun)) {
    stop("Model ", shQuote(name), " not found.", call. = FALSE)
  }

  base_model <- model_fun(
    input_shape = input_shape,
    include_top = FALSE,
    weights = weights,
    pooling = pooling
  )

  # Если ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅ Π½Π΅ Ρ†Π²Π΅Ρ‚Π½ΠΎΠ΅, мСняСм Ρ€Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒ Π²Ρ…ΠΎΠ΄Π°
  if (!color) {
    base_model_conf <- keras::get_config(base_model)
    base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
    base_model <- keras::from_config(base_model_conf)
  }

  predictions <- keras::get_layer(base_model, "global_average_pooling2d_1")$output
  predictions <- keras::layer_dense(predictions, units = num_classes, activation = "softmax")
  model <- keras::keras_model(
    inputs = base_model$input,
    outputs = predictions
  )

  if (compile) {
    keras::compile(
      object = model,
      optimizer = optimizer,
      loss = loss,
      metrics = metrics
    )
  }

  return(model)
}

Mukamagwiritsa ntchito zithunzi zanjira imodzi, palibe zolemetsa zomwe zimagwiritsidwa ntchito kale. Izi zitha kukhazikitsidwa: kugwiritsa ntchito ntchitoyi get_weights() pezani zolemera zachitsanzo monga mndandanda wa R arrays, sinthani kukula kwa chinthu choyamba pamndandandawu (potenga njira imodzi yamtundu kapena kuwerengera zonse zitatu), ndiyeno bweretsani zolemerazo mmbuyo mu chitsanzo ndi ntchitoyo. set_weights(). Sitinawonjezerepo ntchitoyi, chifukwa panthawiyi zinali zoonekeratu kuti zinali zopindulitsa kwambiri kugwira ntchito ndi zithunzi zamitundu.

Tidayesa zambiri pogwiritsa ntchito mitundu ya mobilenet 1 ndi 2, komanso resnet34. Zomangamanga zamakono monga SE-ResNeXt zidachita bwino pampikisanowu. Tsoka ilo, tinalibe zida zokonzekera zomwe tinali nazo, ndipo sitinalembe zathu (koma tidzalembadi).

5. Parameterization ya zolembedwa

Kuti zitheke, ma code onse oyambira maphunziro adapangidwa ngati script imodzi, yokhazikika pogwiritsa ntchito dokotala motere:

doc <- '
Usage:
  train_nn.R --help
  train_nn.R --list-models
  train_nn.R [options]

Options:
  -h --help                   Show this message.
  -l --list-models            List available models.
  -m --model=<model>          Neural network model name [default: mobilenet_v2].
  -b --batch-size=<size>      Batch size [default: 32].
  -s --scale-factor=<ratio>   Scale factor [default: 0.5].
  -c --color                  Use color lines [default: FALSE].
  -d --db-dir=<path>          Path to database directory [default: Sys.getenv("db_dir")].
  -r --validate-ratio=<ratio> Validate sample ratio [default: 0.995].
  -n --n-gpu=<number>         Number of GPUs [default: 1].
'
args <- docopt::docopt(doc)

Phukusi dokotala imayimira kukhazikitsa http://docopt.org/ kwa R. Ndi chithandizo chake, zolemba zimayambitsidwa ndi malamulo osavuta monga Rscript bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db kapena ./bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db, ngati file train_nn.R imatha kuchitidwa (lamulo ili liyamba kuphunzitsa chitsanzocho resnet50 pazithunzi zamitundu itatu zokhala ndi ma pixel 128x128, nkhokwe iyenera kukhala mufoda. /home/andrey/doodle_db). Mutha kuwonjezera liwiro la kuphunzira, mtundu wa optimizer, ndi zina zilizonse zomwe mungasinthe pamndandanda. Pokonzekera zofalitsa, zinapezeka kuti zomangamanga mobilenet_v2 kuchokera ku mtundu wamakono makamera mu R ntchito sangathe chifukwa cha zosintha zomwe sizinaganizidwe mu phukusi la R, tikudikirira kuti akonze.

Njirayi idapangitsa kuti zitheke kufulumizitsa kuyesa ndi mitundu yosiyanasiyana poyerekeza ndi kukhazikitsidwa kwachikhalidwe chazolemba mu RStudio (tikuwona phukusi ngati njira ina yothekera. tfruns). Koma mwayi waukulu ndikutha kuyendetsa mosavuta kukhazikitsidwa kwa zolemba mu Docker kapena pa seva, osayika RStudio pa izi.

6. Dockerization ya zolemba

Tidagwiritsa ntchito Docker kuwonetsetsa kuti chilengedwe chizikhala chamitundu yophunzitsira pakati pa mamembala amagulu komanso kuti atumizidwe mwachangu mumtambo. Mutha kuyamba kuzolowerana ndi chida ichi, chomwe ndi chachilendo kwa wopanga mapulogalamu a R, ndi izi mndandanda wa zofalitsa kapena kanema maphunziro.

Docker imakupatsani mwayi wopanga zithunzi zanu kuyambira poyambira ndikugwiritsa ntchito zithunzi zina ngati maziko opangira zanu. Posanthula zomwe zilipo, tidazindikira kuti kukhazikitsa madalaivala a NVIDIA, CUDA + cuDNN ndi malaibulale a Python ndi gawo lalikulu lachithunzichi, ndipo tidaganiza zotenga chithunzichi ngati maziko. tensorflow/tensorflow:1.12.0-gpu, ndikuwonjezera phukusi lofunikira la R pamenepo.

Fayilo yomaliza ya docker idawoneka motere:

Dockerfile

FROM tensorflow/tensorflow:1.12.0-gpu

MAINTAINER Artem Klevtsov <[email protected]>

SHELL ["/bin/bash", "-c"]

ARG LOCALE="en_US.UTF-8"
ARG APT_PKG="libopencv-dev r-base r-base-dev littler"
ARG R_BIN_PKG="futile.logger checkmate data.table rcpp rapidjsonr dbi keras jsonlite curl digest remotes"
ARG R_SRC_PKG="xtensor RcppThread docopt MonetDBLite"
ARG PY_PIP_PKG="keras"
ARG DIRS="/db /app /app/data /app/models /app/logs"

RUN source /etc/os-release && 
    echo "deb https://cloud.r-project.org/bin/linux/ubuntu ${UBUNTU_CODENAME}-cran35/" > /etc/apt/sources.list.d/cran35.list && 
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 && 
    add-apt-repository -y ppa:marutter/c2d4u3.5 && 
    add-apt-repository -y ppa:timsc/opencv-3.4 && 
    apt-get update && 
    apt-get install -y locales && 
    locale-gen ${LOCALE} && 
    apt-get install -y --no-install-recommends ${APT_PKG} && 
    ln -s /usr/lib/R/site-library/littler/examples/install.r /usr/local/bin/install.r && 
    ln -s /usr/lib/R/site-library/littler/examples/install2.r /usr/local/bin/install2.r && 
    ln -s /usr/lib/R/site-library/littler/examples/installGithub.r /usr/local/bin/installGithub.r && 
    echo 'options(Ncpus = parallel::detectCores())' >> /etc/R/Rprofile.site && 
    echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site && 
    apt-get install -y $(printf "r-cran-%s " ${R_BIN_PKG}) && 
    install.r ${R_SRC_PKG} && 
    pip install ${PY_PIP_PKG} && 
    mkdir -p ${DIRS} && 
    chmod 777 ${DIRS} && 
    rm -rf /tmp/downloaded_packages/ /tmp/*.rds && 
    rm -rf /var/lib/apt/lists/*

COPY utils /app/utils
COPY src /app/src
COPY tests /app/tests
COPY bin/*.R /app/

ENV DBDIR="/db"
ENV CUDA_HOME="/usr/local/cuda"
ENV PATH="/app:${PATH}"

WORKDIR /app

VOLUME /db
VOLUME /app

CMD bash

Kuti zikhale zosavuta, mapaketi omwe amagwiritsidwa ntchito adayikidwa m'mitundu yosiyanasiyana; zambiri zolembedwa zimakopera mkati mwa zotengera panthawi ya msonkhano. Tinasinthanso chipolopolo cha lamulo kuti /bin/bash kuti zitheke kugwiritsa ntchito zomwe zili /etc/os-release. Izi zidalepheretsa kufunikira kofotokozera mtundu wa OS mu code.

Kuphatikiza apo, bash script yaying'ono idalembedwa yomwe imakulolani kuyambitsa chidebe chokhala ndi malamulo osiyanasiyana. Mwachitsanzo, awa atha kukhala zolemba zophunzitsira ma neural network omwe adayikidwa kale mkati mwa chidebecho, kapena chipolopolo cholamula chowongolera ndikuwunika momwe chidebe chimagwirira ntchito:

Script kuti mutsegule chidebecho

#!/bin/sh

DBDIR=${PWD}/db
LOGSDIR=${PWD}/logs
MODELDIR=${PWD}/models
DATADIR=${PWD}/data
ARGS="--runtime=nvidia --rm -v ${DBDIR}:/db -v ${LOGSDIR}:/app/logs -v ${MODELDIR}:/app/models -v ${DATADIR}:/app/data"

if [ -z "$1" ]; then
    CMD="Rscript /app/train_nn.R"
elif [ "$1" = "bash" ]; then
    ARGS="${ARGS} -ti"
else
    CMD="Rscript /app/train_nn.R $@"
fi

docker run ${ARGS} doodles-tf ${CMD}

Ngati bash script iyi imayendetsedwa popanda magawo, script imatchedwa mkati mwa chidebe train_nn.R okhala ndi zikhalidwe zosasintha; ngati mkangano woyamba ndi "bash", ndiye kuti chidebecho chidzayamba molumikizana ndi chipolopolo cholamula. Muzochitika zina zonse, mikangano yokhazikika imalowetsedwa m'malo: CMD="Rscript /app/train_nn.R $@".

Ndizofunikira kudziwa kuti zolembera zomwe zili ndi magwero a data ndi nkhokwe, komanso chikwatu chosungiramo zitsanzo zophunzitsidwa bwino, zimayikidwa mkati mwa chidebe kuchokera ku makina ochitira alendo, zomwe zimakupatsani mwayi wopeza zotsatira za zolembedwa popanda kusintha kosafunikira.

7. Kugwiritsa ntchito ma GPU angapo pa Google Cloud

Chimodzi mwa zinthu za mpikisano chinali deta yaphokoso kwambiri (onani chithunzi chamutu, chobwereka kuchokera ku @Leigh.plt kuchokera ku ODS slack). Magulu akulu amathandizira kuthana ndi izi, ndipo titayesa pa PC yokhala ndi 1 GPU, tidaganiza zophunzira bwino ma GPU angapo pamtambo. Ntchito GoogleCloud (chitsogozo chabwino ku zoyambira) chifukwa cha masankhidwe ambiri omwe alipo, mitengo yabwino ndi bonasi ya $ 300. Chifukwa cha umbombo, ndinalamula chitsanzo cha 4xV100 ndi SSD ndi tani ya RAM, ndipo kunali kulakwitsa kwakukulu. Makina oterowo amadya ndalama mwachangu; mutha kupita kukayesa popanda payipi yotsimikizika. Pofuna kuphunzitsa, ndi bwino kutenga K80. Koma kuchuluka kwa RAM kunabwera kothandiza - SSD yamtambo sinasangalale ndi magwiridwe ake, kotero nkhokweyo idasamutsidwa dev/shm.

Chochititsa chidwi kwambiri ndi kachidutswa kakang'ono kamene kamagwiritsa ntchito ma GPU angapo. Choyamba, chitsanzocho chimapangidwa pa CPU pogwiritsa ntchito woyang'anira nkhani, monga Python:

with(tensorflow::tf$device("/cpu:0"), {
  model_cpu <- get_model(
    name = model_name,
    input_shape = input_shape,
    weights = weights,
    metrics =(top_3_categorical_accuracy,
    compile = FALSE
  )
})

Kenako mtundu wosasankhidwa (izi ndizofunikira) zimakopera ku ma GPU angapo omwe alipo, ndipo pambuyo pake amapangidwa:

model <- keras::multi_gpu_model(model_cpu, gpus = n_gpu)
keras::compile(
  object = model,
  optimizer = keras::optimizer_adam(lr = 0.0004),
  loss = "categorical_crossentropy",
  metrics = c(top_3_categorical_accuracy)
)

Njira yachikale yoziziritsa zigawo zonse kupatula yomaliza, kuphunzitsa wosanjikiza womaliza, kumasula ndikukonzanso mtundu wonse wa ma GPU angapo sikutheka.

Maphunziro anali kuyang'aniridwa popanda ntchito. tensorboard, tikumangojambulitsa zipika ndi kusunga zitsanzo zokhala ndi mayina odziwitsa pambuyo pa nthawi iliyonse:

Ma callbacks

# Π¨Π°Π±Π»ΠΎΠ½ ΠΈΠΌΠ΅Π½ΠΈ Ρ„Π°ΠΉΠ»Π° Π»ΠΎΠ³Π°
log_file_tmpl <- file.path("logs", sprintf(
  "%s_%d_%dch_%s.csv",
  model_name,
  dim_size,
  channels,
  format(Sys.time(), "%Y%m%d%H%M%OS")
))
# Π¨Π°Π±Π»ΠΎΠ½ ΠΈΠΌΠ΅Π½ΠΈ Ρ„Π°ΠΉΠ»Π° ΠΌΠΎΠ΄Π΅Π»ΠΈ
model_file_tmpl <- file.path("models", sprintf(
  "%s_%d_%dch_{epoch:02d}_{val_loss:.2f}.h5",
  model_name,
  dim_size,
  channels
))

callbacks_list <- list(
  keras::callback_csv_logger(
    filename = log_file_tmpl
  ),
  keras::callback_early_stopping(
    monitor = "val_loss",
    min_delta = 1e-4,
    patience = 8,
    verbose = 1,
    mode = "min"
  ),
  keras::callback_reduce_lr_on_plateau(
    monitor = "val_loss",
    factor = 0.5, # ΡƒΠΌΠ΅Π½ΡŒΡˆΠ°Π΅ΠΌ lr Π² 2 Ρ€Π°Π·Π°
    patience = 4,
    verbose = 1,
    min_delta = 1e-4,
    mode = "min"
  ),
  keras::callback_model_checkpoint(
    filepath = model_file_tmpl,
    monitor = "val_loss",
    save_best_only = FALSE,
    save_weights_only = FALSE,
    mode = "min"
  )
)

8. M'malo momaliza

Mavuto angapo omwe takumana nawo sanathebe:

  • Π² makamera palibe ntchito yomwe idapangidwa kuti ifufuze yokhayokha mulingo woyenera wamaphunziro (analogue lr_finder mulaibulale fast.ai); Ndi khama, ndizotheka kuyika zokhazikitsidwa ndi gulu lachitatu ku R, mwachitsanzo, izi;
  • monga chotsatira cha mfundo yapitayi, sikunali kotheka kusankha liwiro loyenera la maphunziro pogwiritsa ntchito ma GPU angapo;
  • pali kusowa kwa mapangidwe amakono a neural network, makamaka omwe adaphunzitsidwa kale pa imagenet;
  • palibe ndondomeko yozungulira yozungulira komanso maphunzilo atsankho (cosine annealing inali pa pempho lathu zakhazikitsidwa, zikomo skydan).

Ndi zinthu zothandiza ziti zomwe zaphunziridwa pampikisanowu:

  • Pazida zotsika mphamvu, mutha kugwira ntchito ndi ma data abwino (nthawi zambiri kukula kwa RAM) popanda kupweteka. Chikwama chapulasitiki deta.table imasunga kukumbukira chifukwa chakusintha kwamatebulo, komwe kumapewa kuwakopera, ndipo ikagwiritsidwa ntchito moyenera, kuthekera kwake pafupifupi nthawi zonse kumawonetsa kuthamanga kwambiri pakati pa zida zonse zomwe timazidziwa pazilankhulo zolembera. Kusunga deta mu nkhokwe kumakupatsani mwayi, nthawi zambiri, kuti musaganize konse za kufunikira kofinya deta yonse mu RAM.
  • Ntchito zochepa mu R zitha kusinthidwa ndi zofulumira mu C ++ pogwiritsa ntchito phukusi Rcpp. Ngati kuwonjezera ntchito RcppThread kapena RcppParallel, timapeza makonzedwe amitundu yambiri, kotero palibe chifukwa chofananira ndi code pa mlingo wa R.
  • Phukusi Rcpp angagwiritsidwe ntchito popanda kudziwa kwambiri C ++, zochepa zofunika zafotokozedwa apa. Mafayilo apamutu angapo abwino C-malaibulale ngati xtensor kupezeka pa CRAN, ndiye kuti, maziko akupangidwa kuti akwaniritse ma projekiti omwe amaphatikiza kachidindo ka C++ kokonzedwa kale kukhala R. Kusavuta kowonjezera ndikuwunikira kwa syntax ndi static C ++ code analyzer mu RStudio.
  • dokotala amakulolani kuyendetsa zolemba zokhala ndi magawo. Izi ndizosavuta kugwiritsa ntchito pa seva yakutali, kuphatikiza. pansi pa docker. Mu RStudio, ndizosasangalatsa kuchita zoyeserera maola ambiri pophunzitsa ma neural network, ndikuyika IDE pa seva palokha sikoyenera nthawi zonse.
  • Docker imawonetsetsa kusinthika kwa ma code ndi kubwezeredwa kwa zotsatira pakati pa opanga omwe ali ndi mitundu yosiyanasiyana ya OS ndi malaibulale, komanso kumasuka kwa ma seva. Mutha kuyambitsa payipi yonse yophunzitsira ndi lamulo limodzi lokha.
  • Google Cloud ndi njira yosavuta kugwiritsa ntchito bajeti yoyesera pazinthu zodula, koma muyenera kusankha masinthidwe mosamala.
  • Kuyeza liwiro la zidutswa za code ndizothandiza kwambiri, makamaka pophatikiza R ndi C ++, komanso phukusi. benchi - komanso zosavuta kwambiri.

Ponseponse chokumana nachochi chinali chopindulitsa kwambiri ndipo tikupitilizabe kuyesetsa kuthetsa zina mwazovuta zomwe zidadzutsidwa.

Source: www.habr.com

Kuwonjezera ndemanga