Nchọpụta Doodle Draw ngwa ngwa: otu esi eme enyi na netwọkụ R, C++ na akwara

Nchọpụta Doodle Draw ngwa ngwa: otu esi eme enyi na netwọkụ R, C++ na akwara

Ndewo, Habr!

N'oge mgbụsị akwụkwọ gara aga, Kaggle kwadoro asọmpi iji kewaa foto ndị ejiri aka, Ndenye Draw Doodle, nke, n'etiti ndị ọzọ, otu ndị ọkà mmụta sayensị R sonyere: Artem Klevtsova, Philippa Manager и Andrey Ogurtsov. Anyị agaghị akọwa asọmpi ahụ n'ụzọ zuru ezu; nke emelarị na mbipụta na-adịbeghị anya.

N'oge a, ọ naghị arụ ọrụ na ọrụ ugbo nrite, ma enwere ọtụtụ ahụmahụ bara uru, n'ihi ya, m ga-achọ ịgwa ndị obodo banyere ọtụtụ ihe na-adọrọ mmasị ma bara uru na Kagle na ọrụ kwa ụbọchị. N'ime isiokwu ndị a tụlere: ndụ siri ike na-enweghị Mepee, JSON parsing (ihe atụ ndị a na-enyocha ntinye nke koodu C ++ n'ime edemede ma ọ bụ ngwugwu na R na-eji Rcpp), parameterization nke scripts na dockerization nke ikpeazụ ngwọta. Koodu niile sitere na ozi n'ụdị dabara adaba maka ogbugbu dị na ebe nchekwa.

Ọdịnaya:

  1. Budata data sitere na CSV nke ọma n'ime MonetDB
  2. Na-akwado batches
  3. Ndị na-emegharị ihe maka nbudata batches na nchekwa data
  4. Ịhọrọ ihe owuwu ihe nlereanya
  5. Nhazi nke edemede
  6. Dockerization nke scripts
  7. Iji ọtụtụ GPU na Google Cloud
  8. Kama nkwubi okwu

1. Budata data sitere na CSV nke ọma n'ime nchekwa data MonetDB

A na-enye data dị na asọmpi a ọ bụghị n'ụdị onyonyo emebere, kama n'ụdị faịlụ 340 CSV (otu faịlụ maka klaasị ọ bụla) nwere JSON nwere nhazi isi. Site na iji ahịrị jikọọ isi ihe ndị a, anyị ga-enweta onyonyo ikpeazụ na-atụ 256x256 pikselụ. Ọzọkwa maka ndekọ nke ọ bụla, e nwere akara na-egosi ma ọ̀ ghọtara foto ahụ nke ọma site n'aka onye nhazi oge a na-anakọta dataset, koodu akwụkwọ ozi abụọ nke obodo obibi nke onye dere foto a, ihe nchọpụta pụrụ iche, akara timestamp. na aha klaasị dabara na aha faịlụ. Ụdị dị mfe nke data mbụ na-atụ 7.4 GB na ebe nchekwa yana ihe dịka 20 GB ka ewepụsịrị ya, data zuru ezu mgbe ebupụchara na-ewe 240 GB. Ndị nhazi ahụ hụkwara na nsụgharị abụọ ahụ megharịrị otu eserese ahụ, nke pụtara na ụdị zuru ezu adịghị arụ ọrụ. N'ọnọdụ ọ bụla, ịchekwa ihe oyiyi nde 50 na faịlụ eserese ma ọ bụ n'ụdị nhazi ka a na-ewere ozugbo enweghị uru, anyị kpebiri ijikọ faịlụ CSV niile na ebe nchekwa. train_simplified.zip banye na nchekwa data nwere ọgbọ na-esote onyonyo nke nha achọrọ "na ofufe" maka ogbe ọ bụla.

A họọrọ usoro egosipụtara nke ọma dị ka DBMS MonetDB, ya bụ mmejuputa iwu maka R dị ka ngwugwu MonetDLite. Ngwungwu ahụ gụnyere ụdị agbakwunyere nke ihe nkesa nchekwa data na-enye gị ohere iburu ihe nkesa ozugbo site na nnọkọ R wee rụọ ọrụ na ya n'ebe ahụ. Ịmepụta nchekwa data na ijikọ ya na-eji otu iwu:

con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

Anyị ga-achọ ịmepụta tebụl abụọ: otu maka data niile, nke ọzọ maka ozi ọrụ gbasara faịlụ ebudatara (ọ bara uru ma ọ bụrụ na ihe na-ezighị ezi na usoro ahụ ga-amaliteghachi mgbe nbudata ọtụtụ faịlụ):

Ịmepụta tebụl

if (!DBI::dbExistsTable(con, "doodles")) {
  DBI::dbCreateTable(
    con = con,
    name = "doodles",
    fields = c(
      "countrycode" = "char(2)",
      "drawing" = "text",
      "key_id" = "bigint",
      "recognized" = "bool",
      "timestamp" = "timestamp",
      "word" = "text"
    )
  )
}

if (!DBI::dbExistsTable(con, "upload_log")) {
  DBI::dbCreateTable(
    con = con,
    name = "upload_log",
    fields = c(
      "id" = "serial",
      "file_name" = "text UNIQUE",
      "uploaded" = "bool DEFAULT false"
    )
  )
}

Ụzọ kachasị ọsọ iji tinye data n'ime nchekwa data bụ iji SQL-iwu detuo faịlụ CSV ozugbo COPY OFFSET 2 INTO tablename FROM path USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORTebe tablename - tebụl aha na path - ụzọ na faịlụ. Mgbe a na-arụ ọrụ na ebe nchekwa ahụ, a chọpụtara na mmejuputa a rụrụ arụ unzip na R anaghị arụ ọrụ n'ụzọ ziri ezi na ọtụtụ faịlụ sitere na ebe a na-edebe ihe ochie, yabụ anyị jiri usoro ahụ unzip (iji parameter getOption("unzip")).

Ọrụ maka ide na nchekwa data

#' @title Извлечение и загрузка файлов
#'
#' @description
#' Извлечение CSV-файлов из ZIP-архива и загрузка их в базу данных
#'
#' @param con Объект подключения к базе данных (класс `MonetDBEmbeddedConnection`).
#' @param tablename Название таблицы в базе данных.
#' @oaram zipfile Путь к ZIP-архиву.
#' @oaram filename Имя файла внури ZIP-архива.
#' @param preprocess Функция предобработки, которая будет применена извлечённому файлу.
#'   Должна принимать один аргумент `data` (объект `data.table`).
#'
#' @return `TRUE`.
#'
upload_file <- function(con, tablename, zipfile, filename, preprocess = NULL) {
  # Проверка аргументов
  checkmate::assert_class(con, "MonetDBEmbeddedConnection")
  checkmate::assert_string(tablename)
  checkmate::assert_string(filename)
  checkmate::assert_true(DBI::dbExistsTable(con, tablename))
  checkmate::assert_file_exists(zipfile, access = "r", extension = "zip")
  checkmate::assert_function(preprocess, args = c("data"), null.ok = TRUE)

  # Извлечение файла
  path <- file.path(tempdir(), filename)
  unzip(zipfile, files = filename, exdir = tempdir(), 
        junkpaths = TRUE, unzip = getOption("unzip"))
  on.exit(unlink(file.path(path)))

  # Применяем функция предобработки
  if (!is.null(preprocess)) {
    .data <- data.table::fread(file = path)
    .data <- preprocess(data = .data)
    data.table::fwrite(x = .data, file = path, append = FALSE)
    rm(.data)
  }

  # Запрос к БД на импорт CSV
  sql <- sprintf(
    "COPY OFFSET 2 INTO %s FROM '%s' USING DELIMITERS ',','n','"' NULL AS '' BEST EFFORT",
    tablename, path
  )
  # Выполнение запроса к БД
  DBI::dbExecute(con, sql)

  # Добавление записи об успешной загрузке в служебную таблицу
  DBI::dbExecute(con, sprintf("INSERT INTO upload_log(file_name, uploaded) VALUES('%s', true)",
                              filename))

  return(invisible(TRUE))
}

Ọ bụrụ na ịchọrọ ịgbanwe tebụl tupu ịde ya na nchekwa data, ọ ga-ezuru ịgafe na arụmụka ahụ preprocess ọrụ nke ga-agbanwe data.

Koodu maka ibunye data n'usoro n'ime nchekwa data:

Na-ede data na nchekwa data

# Список файлов для записи
files <- unzip(zipfile, list = TRUE)$Name

# Список исключений, если часть файлов уже была загружена
to_skip <- DBI::dbGetQuery(con, "SELECT file_name FROM upload_log")[[1L]]
files <- setdiff(files, to_skip)

if (length(files) > 0L) {
  # Запускаем таймер
  tictoc::tic()
  # Прогресс бар
  pb <- txtProgressBar(min = 0L, max = length(files), style = 3)
  for (i in seq_along(files)) {
    upload_file(con = con, tablename = "doodles", 
                zipfile = zipfile, filename = files[i])
    setTxtProgressBar(pb, i)
  }
  close(pb)
  # Останавливаем таймер
  tictoc::toc()
}

# 526.141 sec elapsed - копирование SSD->SSD
# 558.879 sec elapsed - копирование USB->SSD

Oge ntinye data nwere ike ịdịgasị iche dabere na njirimara ọsọ nke draịva eji. N'ọnọdụ anyị, ịgụ na ide ihe n'ime otu SSD ma ọ bụ site na draịva flash (faịlụ isi mmalite) gaa na SSD (DB) na-ewe ihe na-erughị nkeji iri.

Ọ na-ewe sekọnd ole na ole ọzọ iji mepụta kọlụm nwere akara klaasị integer na kọlụm ndeksi (ORDERED INDEX) nwere nọmba ahịrị nke a ga-esi na-enyocha nleba anya mgbe ị na-eke batches:

Ịmepụta kọlụm ndị ọzọ na ndenye aha

message("Generate lables")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD label_int int"))
invisible(DBI::dbExecute(con, "UPDATE doodles SET label_int = dense_rank() OVER (ORDER BY word) - 1"))

message("Generate row numbers")
invisible(DBI::dbExecute(con, "ALTER TABLE doodles ADD id serial"))
invisible(DBI::dbExecute(con, "CREATE ORDERED INDEX doodles_id_ord_idx ON doodles(id)"))

Iji dozie nsogbu nke ịmepụta batch na ofufe, anyị kwesịrị iji nweta oke ọsọ nke iwepụta ahịrị ndị na-enweghị usoro na tebụl. doodles. Maka nke a, anyị na-eji aghụghọ atọ. Nke mbụ bụ ibelata akụkụ nke ụdị na-echekwa ID nleba anya. Na ntọala data izizi, ụdị achọrọ iji chekwaa ID bụ bigint, ma ọnụ ọgụgụ nke nleba anya na-eme ka o kwe omume ịkwado ihe nchọpụta ha, hà nhata nọmba ordinal, n'ime ụdị. int. Nchọta dị ngwa ngwa na nke a. Aghụghọ nke abụọ bụ iji ORDERED INDEX - anyị bịara mkpebi a n'ụzọ doro anya, ebe anyị gafeworo ihe niile dịnụ nhọrọ. Nke atọ bụ iji ajụjụ ndị a na-atụgharị anya. Isi ihe nke usoro a bụ ime iwu otu ugboro PREPARE site n'iji okwu akwadoro eme ihe mgbe ị na-emepụta ụyọkọ ajụjụ nke otu ụdị, ma n'ezie enwere uru ma e jiri ya tụnyere nke dị mfe. SELECT tụgharịrị bụrụ n'ime oke njehie ndekọ ọnụ ọgụgụ.

Usoro nke bulite data anaghị eri ihe karịrị 450 MB nke RAM. Ya bụ, ụzọ akọwara na-enye gị ohere ịkwaga datasets na-atụ iri gigabytes na ihe fọrọ nke nta ka ọ bụrụ ngwaike mmefu ego ọ bụla, gụnyere ụfọdụ ngwaọrụ otu bọọdụ, nke mara mma.

Naanị ihe fọdụrụ bụ ịlele ọsọ nke iweghachite data (random) wee nyochaa nha nha mgbe ị na-elele batches nke nha dị iche iche:

Akara nchekwa data

library(ggplot2)

set.seed(0)
# Подключение к базе данных
con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

# Функция для подготовки запроса на стороне сервера
prep_sql <- function(batch_size) {
  sql <- sprintf("PREPARE SELECT id FROM doodles WHERE id IN (%s)",
                 paste(rep("?", batch_size), collapse = ","))
  res <- DBI::dbSendQuery(con, sql)
  return(res)
}

# Функция для извлечения данных
fetch_data <- function(rs, batch_size) {
  ids <- sample(seq_len(n), batch_size)
  res <- DBI::dbFetch(DBI::dbBind(rs, as.list(ids)))
  return(res)
}

# Проведение замера
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    rs <- prep_sql(batch_size)
    bench::mark(
      fetch_data(rs, batch_size),
      min_iterations = 50L
    )
  }
)
# Параметры бенчмарка
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   batch_size      min   median      max `itr/sec` total_time n_itr
#        <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
# 1         16   23.6ms  54.02ms  93.43ms     18.8        2.6s    49
# 2         32     38ms  84.83ms 151.55ms     11.4       4.29s    49
# 3         64   63.3ms 175.54ms 248.94ms     5.85       8.54s    50
# 4        128   83.2ms 341.52ms 496.24ms     3.00      16.69s    50
# 5        256  232.8ms 653.21ms 847.44ms     1.58      31.66s    50
# 6        512  784.6ms    1.41s    1.98s     0.740       1.1m    49
# 7       1024  681.7ms    2.72s    4.06s     0.377      2.16m    49

ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
  geom_point() +
  geom_line() +
  ylab("median time, s") +
  theme_minimal()

DBI::dbDisconnect(con, shutdown = TRUE)

Nchọpụta Doodle Draw ngwa ngwa: otu esi eme enyi na netwọkụ R, C++ na akwara

2. Na-akwadebe batches

Usoro nkwadebe batch dum nwere usoro ndị a:

  1. Ịtụle ọtụtụ JSON nwere vectors nke eriri nwere nhazi isi ihe.
  2. Ịdọrọ ahịrị agba dabere na nhazi isi ihe na onyonyo nke nha achọrọ (dịka ọmụmaatụ, 256 × 256 ma ọ bụ 128 × 128).
  3. Na-atụgharị onyonyo ndị a ka ọ bụrụ tensor.

Dịka akụkụ nke asọmpi n'etiti kernel Python, a na-edozi nsogbu ahụ site na iji Mepee. Otu n'ime analogues kachasị mfe na nke doro anya na R ga-adị ka nke a:

Na-eme mgbanwe JSON ka ọ bụrụ Tensor na R

r_process_json_str <- function(json, line.width = 3, 
                               color = TRUE, scale = 1) {
  # Парсинг JSON
  coords <- jsonlite::fromJSON(json, simplifyMatrix = FALSE)
  tmp <- tempfile()
  # Удаляем временный файл по завершению функции
  on.exit(unlink(tmp))
  png(filename = tmp, width = 256 * scale, height = 256 * scale, pointsize = 1)
  # Пустой график
  plot.new()
  # Размер окна графика
  plot.window(xlim = c(256 * scale, 0), ylim = c(256 * scale, 0))
  # Цвета линий
  cols <- if (color) rainbow(length(coords)) else "#000000"
  for (i in seq_along(coords)) {
    lines(x = coords[[i]][[1]] * scale, y = coords[[i]][[2]] * scale, 
          col = cols[i], lwd = line.width)
  }
  dev.off()
  # Преобразование изображения в 3-х мерный массив
  res <- png::readPNG(tmp)
  return(res)
}

r_process_json_vector <- function(x, ...) {
  res <- lapply(x, r_process_json_str, ...)
  # Объединение 3-х мерных массивов картинок в 4-х мерный в тензор
  res <- do.call(abind::abind, c(res, along = 0))
  return(res)
}

A na-eme eserese site na iji ngwaọrụ R ọkọlọtọ wee chekwaa ya na PNG nwa oge echekwara na RAM (na Linux, akwụkwọ ndekọ R nwa oge dị na ndekọ. /tmp, etinyere na RAM). A na-agụ faịlụ a dị ka nhazi akụkụ atọ nwere ọnụọgụ sitere na 0 ruo 1. Nke a dị mkpa n'ihi na a ga-agụ BMP nke a na-emekarị ka ọ bụrụ n'usoro ihe ọkụkụ nwere koodu agba hex.

Ka anyị nwalee nsonaazụ:

zip_file <- file.path("data", "train_simplified.zip")
csv_file <- "cat.csv"
unzip(zip_file, files = csv_file, exdir = tempdir(), 
      junkpaths = TRUE, unzip = getOption("unzip"))
tmp_data <- data.table::fread(file.path(tempdir(), csv_file), sep = ",", 
                              select = "drawing", nrows = 10000)
arr <- r_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256   3
plot(magick::image_read(arr))

Nchọpụta Doodle Draw ngwa ngwa: otu esi eme enyi na netwọkụ R, C++ na akwara

A ga-emepụta ogbe ahụ n'onwe ya dị ka ndị a:

res <- r_process_json_vector(tmp_data[1:4, drawing], scale = 0.5)
str(res)
 # num [1:4, 1:128, 1:128, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
 # - attr(*, "dimnames")=List of 4
 #  ..$ : NULL
 #  ..$ : NULL
 #  ..$ : NULL
 #  ..$ : NULL

Nke a mmejuputa atumatu yiri suboptimal n'anya anyị, ebe ọ bụ na guzobe nke nnukwu batches na-ewe ogologo oge na-ekwesịghị ekwesị, na anyị kpebiri iri uru nke ahụmahụ nke ndị ọrụ ibe anyị site na iji a dị ike n'ọbá akwụkwọ. Mepee. N'oge ahụ, ọ dịghị ngwugwu a kwadebere maka R (ọ dịghị ugbu a), ya mere e dere ntakịrị mmejuputa ọrụ achọrọ na C ++ na ntinye n'ime koodu R site na iji. Rcpp.

Iji dozie nsogbu ahụ, ejiri ngwugwu na ọba akwụkwọ ndị a:

  1. Mepee maka ịrụ ọrụ na onyonyo na eserese eserese. A na-eji ọba akwụkwọ sistemụ arụnyere mbụ na faịlụ nkụnye eji isi mee, yana njikọ dị ike.

  2. xtensor maka ịrụ ọrụ na multidimensional arrays na tenors. Anyị ji faịlụ nkụnye eji isi mee gụnyere na ngwugwu R nke otu aha ahụ. Ọbá akwụkwọ ahụ na-enye gị ohere ịrụ ọrụ n'usoro multidimensional, ma n'usoro n'usoro na kọlụm bụ isi.

  3. ndjson maka ịkọwapụta JSON. A na-eji ọba akwụkwọ a xtensor na-akpaghị aka ma ọ bụrụ na ọ dị na oru ngo.

  4. eriri Rcpp maka ịhazi nhazi multi-threaded nke vector sitere na JSON. Jiri faịlụ nkụnye eji isi mee nke ngwugwu a wetara. Site na ewu ewu RcppParallel Ngwungwu, n'etiti ihe ndị ọzọ, nwere usoro nkwụsịtụ aka arụnyere n'ime ya.

Ekwesiri ighota na xtensor tụgharịrị bụrụ ihe na-atọ ụtọ: na mgbakwunye na eziokwu na ọ nwere nnukwu ọrụ yana arụmọrụ dị elu, ndị mmepe ya tụgharịrị na-anabata nke ọma ma zaa ajụjụ ozugbo na n'ụzọ zuru ezu. Site n'enyemaka ha, ọ ga-ekwe omume iji mejuputa mgbanwe nke OpenCV matrices n'ime xtensor tenors, yana ụzọ iji jikọta ihe oyiyi 3 akụkụ n'ime tensor 4-akụkụ nke ziri ezi (ogbe ahụ n'onwe ya).

Ngwa maka mmụta Rcpp, xtensor na RcppThread

https://thecoatlessprofessor.com/programming/unofficial-rcpp-api-documentation

https://docs.opencv.org/4.0.1/d7/dbd/group__imgproc.html

https://xtensor.readthedocs.io/en/latest/

https://xtensor.readthedocs.io/en/latest/file_loading.html#loading-json-data-into-xtensor

https://cran.r-project.org/web/packages/RcppThread/vignettes/RcppThread-vignette.pdf

Iji chịkọta faịlụ ndị na-eji faịlụ sistemu yana njikọ siri ike na ọba akwụkwọ arụnyere na sistemụ, anyị jiri usoro ngwa mgbakwunye etinyere na ngwugwu ahụ. Rcpp. Iji chọta ụzọ na ọkọlọtọ na-akpaghị aka, anyị jiri ọrụ Linux na-ewu ewu pkg-nhazi.

Mmejuputa ngwa mgbakwunye Rcpp maka iji ọbá akwụkwọ OpenCV

Rcpp::registerPlugin("opencv", function() {
  # Возможные названия пакета
  pkg_config_name <- c("opencv", "opencv4")
  # Бинарный файл утилиты pkg-config
  pkg_config_bin <- Sys.which("pkg-config")
  # Проврека наличия утилиты в системе
  checkmate::assert_file_exists(pkg_config_bin, access = "x")
  # Проверка наличия файла настроек OpenCV для pkg-config
  check <- sapply(pkg_config_name, 
                  function(pkg) system(paste(pkg_config_bin, pkg)))
  if (all(check != 0)) {
    stop("OpenCV config for the pkg-config not found", call. = FALSE)
  }

  pkg_config_name <- pkg_config_name[check == 0]
  list(env = list(
    PKG_CXXFLAGS = system(paste(pkg_config_bin, "--cflags", pkg_config_name), 
                          intern = TRUE),
    PKG_LIBS = system(paste(pkg_config_bin, "--libs", pkg_config_name), 
                      intern = TRUE)
  ))
})

N'ihi ọrụ ngwa mgbakwunye ahụ, a ga-edochi ụkpụrụ ndị a n'oge usoro nchịkọta:

Rcpp:::.plugins$opencv()$env

# $PKG_CXXFLAGS
# [1] "-I/usr/include/opencv"
#
# $PKG_LIBS
# [1] "-lopencv_shape -lopencv_stitching -lopencv_superres -lopencv_videostab -lopencv_aruco -lopencv_bgsegm -lopencv_bioinspired -lopencv_ccalib -lopencv_datasets -lopencv_dpm -lopencv_face -lopencv_freetype -lopencv_fuzzy -lopencv_hdf -lopencv_line_descriptor -lopencv_optflow -lopencv_video -lopencv_plot -lopencv_reg -lopencv_saliency -lopencv_stereo -lopencv_structured_light -lopencv_phase_unwrapping -lopencv_rgbd -lopencv_viz -lopencv_surface_matching -lopencv_text -lopencv_ximgproc -lopencv_calib3d -lopencv_features2d -lopencv_flann -lopencv_xobjdetect -lopencv_objdetect -lopencv_ml -lopencv_xphoto -lopencv_highgui -lopencv_videoio -lopencv_imgcodecs -lopencv_photo -lopencv_imgproc -lopencv_core"

A na-enye koodu mmejuputa iwu maka ịkọwa JSON na ịmepụta ogbe maka nnyefe na ihe nlereanya n'okpuru onye na-emebi ihe. Nke mbụ, tinye ndekọ ndekọ ọrụ mpaghara iji chọọ faịlụ nkụnye eji isi mee (chọrọ maka ndjson):

Sys.setenv("PKG_CXXFLAGS" = paste0("-I", normalizePath(file.path("src"))))

Mmejuputa JSON ka ọ bụrụ tensor na C++

// [[Rcpp::plugins(cpp14)]]
// [[Rcpp::plugins(opencv)]]
// [[Rcpp::depends(xtensor)]]
// [[Rcpp::depends(RcppThread)]]

#include <xtensor/xjson.hpp>
#include <xtensor/xadapt.hpp>
#include <xtensor/xview.hpp>
#include <xtensor-r/rtensor.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <Rcpp.h>
#include <RcppThread.h>

// Синонимы для типов
using RcppThread::parallelFor;
using json = nlohmann::json;
using points = xt::xtensor<double,2>;     // Извлечённые из JSON координаты точек
using strokes = std::vector<points>;      // Извлечённые из JSON координаты точек
using xtensor3d = xt::xtensor<double, 3>; // Тензор для хранения матрицы изоображения
using xtensor4d = xt::xtensor<double, 4>; // Тензор для хранения множества изображений
using rtensor3d = xt::rtensor<double, 3>; // Обёртка для экспорта в R
using rtensor4d = xt::rtensor<double, 4>; // Обёртка для экспорта в R

// Статические константы
// Размер изображения в пикселях
const static int SIZE = 256;
// Тип линии
// См. https://en.wikipedia.org/wiki/Pixel_connectivity#2-dimensional
const static int LINE_TYPE = cv::LINE_4;
// Толщина линии в пикселях
const static int LINE_WIDTH = 3;
// Алгоритм ресайза
// https://docs.opencv.org/3.1.0/da/d54/group__imgproc__transform.html#ga5bb5a1fea74ea38e1a5445ca803ff121
const static int RESIZE_TYPE = cv::INTER_LINEAR;

// Шаблон для конвертирования OpenCV-матрицы в тензор
template <typename T, int NCH, typename XT=xt::xtensor<T,3,xt::layout_type::column_major>>
XT to_xt(const cv::Mat_<cv::Vec<T, NCH>>& src) {
  // Размерность целевого тензора
  std::vector<int> shape = {src.rows, src.cols, NCH};
  // Общее количество элементов в массиве
  size_t size = src.total() * NCH;
  // Преобразование cv::Mat в xt::xtensor
  XT res = xt::adapt((T*) src.data, size, xt::no_ownership(), shape);
  return res;
}

// Преобразование JSON в список координат точек
strokes parse_json(const std::string& x) {
  auto j = json::parse(x);
  // Результат парсинга должен быть массивом
  if (!j.is_array()) {
    throw std::runtime_error("'x' must be JSON array.");
  }
  strokes res;
  res.reserve(j.size());
  for (const auto& a: j) {
    // Каждый элемент массива должен быть 2-мерным массивом
    if (!a.is_array() || a.size() != 2) {
      throw std::runtime_error("'x' must include only 2d arrays.");
    }
    // Извлечение вектора точек
    auto p = a.get<points>();
    res.push_back(p);
  }
  return res;
}

// Отрисовка линий
// Цвета HSV
cv::Mat ocv_draw_lines(const strokes& x, bool color = true) {
  // Исходный тип матрицы
  auto stype = color ? CV_8UC3 : CV_8UC1;
  // Итоговый тип матрицы
  auto dtype = color ? CV_32FC3 : CV_32FC1;
  auto bg = color ? cv::Scalar(0, 0, 255) : cv::Scalar(255);
  auto col = color ? cv::Scalar(0, 255, 220) : cv::Scalar(0);
  cv::Mat img = cv::Mat(SIZE, SIZE, stype, bg);
  // Количество линий
  size_t n = x.size();
  for (const auto& s: x) {
    // Количество точек в линии
    size_t n_points = s.shape()[1];
    for (size_t i = 0; i < n_points - 1; ++i) {
      // Точка начала штриха
      cv::Point from(s(0, i), s(1, i));
      // Точка окончания штриха
      cv::Point to(s(0, i + 1), s(1, i + 1));
      // Отрисовка линии
      cv::line(img, from, to, col, LINE_WIDTH, LINE_TYPE);
    }
    if (color) {
      // Меняем цвет линии
      col[0] += 180 / n;
    }
  }
  if (color) {
    // Меняем цветовое представление на RGB
    cv::cvtColor(img, img, cv::COLOR_HSV2RGB);
  }
  // Меняем формат представления на float32 с диапазоном [0, 1]
  img.convertTo(img, dtype, 1 / 255.0);
  return img;
}

// Обработка JSON и получение тензора с данными изображения
xtensor3d process(const std::string& x, double scale = 1.0, bool color = true) {
  auto p = parse_json(x);
  auto img = ocv_draw_lines(p, color);
  if (scale != 1) {
    cv::Mat out;
    cv::resize(img, out, cv::Size(), scale, scale, RESIZE_TYPE);
    cv::swap(img, out);
    out.release();
  }
  xtensor3d arr = color ? to_xt<double,3>(img) : to_xt<double,1>(img);
  return arr;
}

// [[Rcpp::export]]
rtensor3d cpp_process_json_str(const std::string& x, 
                               double scale = 1.0, 
                               bool color = true) {
  xtensor3d res = process(x, scale, color);
  return res;
}

// [[Rcpp::export]]
rtensor4d cpp_process_json_vector(const std::vector<std::string>& x, 
                                  double scale = 1.0, 
                                  bool color = false) {
  size_t n = x.size();
  size_t dim = floor(SIZE * scale);
  size_t channels = color ? 3 : 1;
  xtensor4d res({n, dim, dim, channels});
  parallelFor(0, n, [&x, &res, scale, color](int i) {
    xtensor3d tmp = process(x[i], scale, color);
    auto view = xt::view(res, i, xt::all(), xt::all(), xt::all());
    view = tmp;
  });
  return res;
}

Ekwesịrị idowe koodu a na faịlụ src/cv_xt.cpp ma chịkọta ya na iwu ahụ Rcpp::sourceCpp(file = "src/cv_xt.cpp", env = .GlobalEnv); achọrọkwa maka ọrụ nlohmann/json.hpp si ebe nchekwa. E kewara koodu ahụ n'ọtụtụ ọrụ:

  • to_xt - arụrụ arụ ọrụ maka ịgbanwe matrix onyonyo (cv::Mat) na tensor xt::xtensor;

  • parse_json - ọrụ ahụ na-atụgharị eriri JSON, na-ewepụta ihe nhazi nke isi, na-ebukọta ha na vector;

  • ocv_draw_lines - si n'ihi vector nke ihe, na-adọta multi-acha ahịrị;

  • process - na-ejikọta ọrụ ndị a dị n'elu ma na-agbakwụnyekwa ike ịmepụta ihe oyiyi ahụ;

  • cpp_process_json_str - wrapper n'elu ọrụ process, nke na-ebupụ nsonaazụ na ihe R-ihe (multidimensional array);

  • cpp_process_json_vector - wrapper n'elu ọrụ cpp_process_json_str, nke na-enye gị ohere ịhazi vector eriri n'ụdị multi-threaded.

Iji see ahịrị nwere ọtụtụ agba, a na-eji ụdị agba HSV mee ihe, ntụgharị na RGB sochiri ya. Ka anyị nwalee nsonaazụ:

arr <- cpp_process_json_str(tmp_data[4, drawing])
dim(arr)
# [1] 256 256   3
plot(magick::image_read(arr))

Nchọpụta Doodle Draw ngwa ngwa: otu esi eme enyi na netwọkụ R, C++ na akwara
Tụnyere ọsọ nke mmejuputa iwu na R na C++

res_bench <- bench::mark(
  r_process_json_str(tmp_data[4, drawing], scale = 0.5),
  cpp_process_json_str(tmp_data[4, drawing], scale = 0.5),
  check = FALSE,
  min_iterations = 100
)
# Параметры бенчмарка
cols <- c("expression", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   expression                min     median       max `itr/sec` total_time  n_itr
#   <chr>                <bch:tm>   <bch:tm>  <bch:tm>     <dbl>   <bch:tm>  <int>
# 1 r_process_json_str     3.49ms     3.55ms    4.47ms      273.      490ms    134
# 2 cpp_process_json_str   1.94ms     2.02ms    5.32ms      489.      497ms    243

library(ggplot2)
# Проведение замера
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    .data <- tmp_data[sample(seq_len(.N), batch_size), drawing]
    bench::mark(
      r_process_json_vector(.data, scale = 0.5),
      cpp_process_json_vector(.data,  scale = 0.5),
      min_iterations = 50,
      check = FALSE
    )
  }
)

res_bench[, cols]

#    expression   batch_size      min   median      max `itr/sec` total_time n_itr
#    <chr>             <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
#  1 r                   16   50.61ms  53.34ms  54.82ms    19.1     471.13ms     9
#  2 cpp                 16    4.46ms   5.39ms   7.78ms   192.      474.09ms    91
#  3 r                   32   105.7ms 109.74ms 212.26ms     7.69        6.5s    50
#  4 cpp                 32    7.76ms  10.97ms  15.23ms    95.6     522.78ms    50
#  5 r                   64  211.41ms 226.18ms 332.65ms     3.85      12.99s    50
#  6 cpp                 64   25.09ms  27.34ms  32.04ms    36.0        1.39s    50
#  7 r                  128   534.5ms 627.92ms 659.08ms     1.61      31.03s    50
#  8 cpp                128   56.37ms  58.46ms  66.03ms    16.9        2.95s    50
#  9 r                  256     1.15s    1.18s    1.29s     0.851     58.78s    50
# 10 cpp                256  114.97ms 117.39ms 130.09ms     8.45       5.92s    50
# 11 r                  512     2.09s    2.15s    2.32s     0.463       1.8m    50
# 12 cpp                512  230.81ms  235.6ms 261.99ms     4.18      11.97s    50
# 13 r                 1024        4s    4.22s     4.4s     0.238       3.5m    50
# 14 cpp               1024  410.48ms 431.43ms 462.44ms     2.33      21.45s    50

ggplot(res_bench, aes(x = factor(batch_size), y = median, 
                      group =  expression, color = expression)) +
  geom_point() +
  geom_line() +
  ylab("median time, s") +
  theme_minimal() +
  scale_color_discrete(name = "", labels = c("cpp", "r")) +
  theme(legend.position = "bottom") 

Nchọpụta Doodle Draw ngwa ngwa: otu esi eme enyi na netwọkụ R, C++ na akwara

Dị ka ị na-ahụ, mmụba ọsọ ahụ bịara bụrụ ihe dị ịrịba ama, ma ọ gaghị ekwe omume ịnweta koodu C ++ site na iji koodu R.

3. Iterators maka nbudata batches na nchekwa data

R nwere aha kwesịrị ekwesị maka nhazi data dabara na RAM, ebe Eke Ọgba na-eji nhazi data eme ihe, na-enye gị ohere ịme ngwa ngwa na n'ụzọ nkịtị mejuputa mgbako na-enweghị isi (mgbakọ na-eji ebe nchekwa mpụga). Ihe atụ ama ama na nke dabara adaba maka anyị n'ọnọdụ nke nsogbu a kọwara bụ netwọkụ akwara miri emi nke a zụrụ site na usoro mgbada gradient na mkpokọta gradient na nzọụkwụ ọ bụla site na iji obere akụkụ nlele, ma ọ bụ obere ogbe.

Usoro mmụta miri emi nke edere na Python nwere klaasị pụrụ iche na-emejuputa atumatu dabere na data: tebụl, foto dị na nchekwa, usoro ọnụọgụ abụọ, wdg. Ị nwere ike iji nhọrọ emebere ma ọ bụ dee nke gị maka ọrụ ụfọdụ. Na R anyị nwere ike iji atụmatụ niile nke ọba akwụkwọ Python mee ihe keras ya na azụ azụ dị iche iche na-eji ngwugwu nke otu aha, nke na-arụ ọrụ n'elu ngwugwu ahụ edeghachi. Nke ikpeazụ kwesịrị a iche ogologo isiokwu; ọ bụghị naanị na-enye gị ohere ịme koodu Python site na R, kamakwa na-enye gị ohere ịnyefe ihe n'etiti oge R na Python, na-eme mgbanwe niile dị mkpa na-akpaghị aka.

Anyị kpochapụrụ mkpa ọ dị ịchekwa data niile na RAM site na iji MonetDLite, ọrụ niile "netwọk neural" ga-arụ ọrụ site na koodu mbụ na Python, anyị ga-edepụta iterator n'elu data, ebe ọ bụ na ọ dịghị ihe dị njikere. maka ọnọdụ dị otú ahụ na R ma ọ bụ Python. Enwere naanị ihe abụọ achọrọ maka ya: ọ ga-eweghachite batches na akaghị ngwụcha wee chekwaa ọnọdụ ya n'etiti iterations (nke ikpeazụ na R na-emejuputa n'ụzọ kachasị mfe site na iji mmechi). Na mbụ, a chọrọ ka ịgbanwee usoro R n'ụzọ doro anya ka ọ bụrụ nhazi ọnụọgụ dị n'ime iterator, mana ụdị ngwugwu dị ugbu a. keras na-eme ya n'onwe ya.

The iterator maka ọzụzụ na nkwado data tụgharịrị dị ka ndị a:

Iterator maka ọzụzụ na nkwado data

train_generator <- function(db_connection = con,
                            samples_index,
                            num_classes = 340,
                            batch_size = 32,
                            scale = 1,
                            color = FALSE,
                            imagenet_preproc = FALSE) {
  # Проверка аргументов
  checkmate::assert_class(con, "DBIConnection")
  checkmate::assert_integerish(samples_index)
  checkmate::assert_count(num_classes)
  checkmate::assert_count(batch_size)
  checkmate::assert_number(scale, lower = 0.001, upper = 5)
  checkmate::assert_flag(color)
  checkmate::assert_flag(imagenet_preproc)

  # Перемешиваем, чтобы брать и удалять использованные индексы батчей по порядку
  dt <- data.table::data.table(id = sample(samples_index))
  # Проставляем номера батчей
  dt[, batch := (.I - 1L) %/% batch_size + 1L]
  # Оставляем только полные батчи и индексируем
  dt <- dt[, if (.N == batch_size) .SD, keyby = batch]
  # Устанавливаем счётчик
  i <- 1
  # Количество батчей
  max_i <- dt[, max(batch)]

  # Подготовка выражения для выгрузки
  sql <- sprintf(
    "PREPARE SELECT drawing, label_int FROM doodles WHERE id IN (%s)",
    paste(rep("?", batch_size), collapse = ",")
  )
  res <- DBI::dbSendQuery(con, sql)

  # Аналог keras::to_categorical
  to_categorical <- function(x, num) {
    n <- length(x)
    m <- numeric(n * num)
    m[x * n + seq_len(n)] <- 1
    dim(m) <- c(n, num)
    return(m)
  }

  # Замыкание
  function() {
    # Начинаем новую эпоху
    if (i > max_i) {
      dt[, id := sample(id)]
      data.table::setkey(dt, batch)
      # Сбрасываем счётчик
      i <<- 1
      max_i <<- dt[, max(batch)]
    }

    # ID для выгрузки данных
    batch_ind <- dt[batch == i, id]
    # Выгрузка данных
    batch <- DBI::dbFetch(DBI::dbBind(res, as.list(batch_ind)), n = -1)

    # Увеличиваем счётчик
    i <<- i + 1

    # Парсинг JSON и подготовка массива
    batch_x <- cpp_process_json_vector(batch$drawing, scale = scale, color = color)
    if (imagenet_preproc) {
      # Шкалирование c интервала [0, 1] на интервал [-1, 1]
      batch_x <- (batch_x - 0.5) * 2
    }

    batch_y <- to_categorical(batch$label_int, num_classes)
    result <- list(batch_x, batch_y)
    return(result)
  }
}

Ọrụ ahụ na-ewe dị ka ntinye mgbanwe na njikọ na nchekwa data, ọnụọgụ nke ahịrị ejiri, ọnụ ọgụgụ klaasị, nha batch, ọnụ ọgụgụ (scale = 1 kwekọrọ na ịse foto nke 256x256 pikselụ, scale = 0.5 - 128x128 pikselụ), egosi agba (color = FALSE na-akọwapụta nsụgharị n'ụdị isi awọ mgbe ejiri ya color = TRUE A na-adọta ọrịa strok ọ bụla na agba ọhụrụ) na ihe ngosi nhazi maka netwọk ndị a zụrụ azụ na imagenet. A chọrọ nke ikpeazụ iji tụọ ụkpụrụ pikselụ site na etiti oge [0, 1] ruo na etiti [-1, 1], nke ejiri mee ihe mgbe a na-azụ ihe ndị enyere. keras ụdị.

Ọrụ mpụga nwere ụdị nlele arụmụka, tebụl data.table ya na ọnụọgụ ahịrị agwakọtara enweghị usoro si samples_index na ọnụ ọgụgụ batch, counter na ọnụ ọgụgụ kachasị elu, yana okwu SQL maka nbudata data na nchekwa data. Na mgbakwunye, anyị kọwapụtara analọgụ ngwa ngwa nke ọrụ dị n'ime keras::to_categorical(). Anyị na-eji ihe fọrọ nke nta ka ọ bụrụ data niile maka ọzụzụ, na-ahapụ ọkara pasent maka nkwado, ya mere, oke oge ahụ bụ oke site na oke. steps_per_epoch mgbe a na-akpọ ya keras::fit_generator(), na ọnọdụ if (i > max_i) Na-arụ ọrụ naanị maka onye nrụpụta nkwado.

N'ime ọrụ dị n'ime, a na-eweghachite ahịrị ahịrị maka ogbe na-esote, a na-ebupụ ihe ndekọ site na nchekwa data na ọnụ ọgụgụ batch na-abawanye, JSON parsing (ọrụ cpp_process_json_vector(), nke edere na C++) na ịmepụta usoro kwekọrọ na foto. Mgbe ahụ, a na-emepụta vectors na-ekpo ọkụ nke nwere akara klaasị, a na-ejikọta usoro nwere ụkpụrụ pixel na akara n'ime ndepụta, nke bụ uru nloghachi. Iji mee ka ọrụ dị ngwa, anyị na-eji ihe okike nke index na tebụl data.table na mgbanwe site na njikọ - na-enweghị ngwugwu "chips" ndị a. data O siri ike iche n'echiche na-arụ ọrụ nke ọma na oke data ọ bụla dị na R.

Nsonaazụ nke nha ọsọ na laptọọpụ Core i5 bụ ndị a:

Iterator benchmark

library(Rcpp)
library(keras)
library(ggplot2)

source("utils/rcpp.R")
source("utils/keras_iterator.R")

con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))

ind <- seq_len(DBI::dbGetQuery(con, "SELECT count(*) FROM doodles")[[1L]])
num_classes <- DBI::dbGetQuery(con, "SELECT max(label_int) + 1 FROM doodles")[[1L]]

# Индексы для обучающей выборки
train_ind <- sample(ind, floor(length(ind) * 0.995))
# Индексы для проверочной выборки
val_ind <- ind[-train_ind]
rm(ind)
# Коэффициент масштаба
scale <- 0.5

# Проведение замера
res_bench <- bench::press(
  batch_size = 2^(4:10),
  {
    it1 <- train_generator(
      db_connection = con,
      samples_index = train_ind,
      num_classes = num_classes,
      batch_size = batch_size,
      scale = scale
    )
    bench::mark(
      it1(),
      min_iterations = 50L
    )
  }
)
# Параметры бенчмарка
cols <- c("batch_size", "min", "median", "max", "itr/sec", "total_time", "n_itr")
res_bench[, cols]

#   batch_size      min   median      max `itr/sec` total_time n_itr
#        <dbl> <bch:tm> <bch:tm> <bch:tm>     <dbl>   <bch:tm> <int>
# 1         16     25ms  64.36ms   92.2ms     15.9       3.09s    49
# 2         32   48.4ms 118.13ms 197.24ms     8.17       5.88s    48
# 3         64   69.3ms 117.93ms 181.14ms     8.57       5.83s    50
# 4        128  157.2ms 240.74ms 503.87ms     3.85      12.71s    49
# 5        256  359.3ms 613.52ms 988.73ms     1.54       30.5s    47
# 6        512  884.7ms    1.53s    2.07s     0.674      1.11m    45
# 7       1024     2.7s    3.83s    5.47s     0.261      2.81m    44

ggplot(res_bench, aes(x = factor(batch_size), y = median, group = 1)) +
    geom_point() +
    geom_line() +
    ylab("median time, s") +
    theme_minimal()

DBI::dbDisconnect(con, shutdown = TRUE)

Nchọpụta Doodle Draw ngwa ngwa: otu esi eme enyi na netwọkụ R, C++ na akwara

Ọ bụrụ na ị nwere oke RAM zuru oke, ị nwere ike ịme ngwa ngwa nke nchekwa data site na ịnyefe ya na RAM a (32 GB ezuru maka ọrụ anyị). Na Linux, a na-etinye nkebi ahụ na ndabara /dev/shm, na-eburu ihe ruru ọkara ikike RAM. Ị nwere ike ime ka ndị ọzọ pụta ìhè site na dezie /etc/fstabiji nweta ndekọ dị ka tmpfs /dev/shm tmpfs defaults,size=25g 0 0. Jide n'aka na ịmalitegharịa wee lelee nsonaazụ site na ịme iwu ahụ df -h.

The iterator maka ule data dị nnọọ mfe, ebe ọ bụ na ule dataset dabara kpamkpam na RAM:

Iterator maka data nnwale

test_generator <- function(dt,
                           batch_size = 32,
                           scale = 1,
                           color = FALSE,
                           imagenet_preproc = FALSE) {

  # Проверка аргументов
  checkmate::assert_data_table(dt)
  checkmate::assert_count(batch_size)
  checkmate::assert_number(scale, lower = 0.001, upper = 5)
  checkmate::assert_flag(color)
  checkmate::assert_flag(imagenet_preproc)

  # Проставляем номера батчей
  dt[, batch := (.I - 1L) %/% batch_size + 1L]
  data.table::setkey(dt, batch)
  i <- 1
  max_i <- dt[, max(batch)]

  # Замыкание
  function() {
    batch_x <- cpp_process_json_vector(dt[batch == i, drawing], 
                                       scale = scale, color = color)
    if (imagenet_preproc) {
      # Шкалирование c интервала [0, 1] на интервал [-1, 1]
      batch_x <- (batch_x - 0.5) * 2
    }
    result <- list(batch_x)
    i <<- i + 1
    return(result)
  }
}

4. Nhọrọ nke ụkpụrụ ụlọ

Ihe owuwu mbụ e ji mee ihe bụ ekwentị mkpanaaka v1, atụmatụ nke a na-atụle na nke a ozi. Agụnyere ya dị ka ọkọlọtọ keras na, ya mere, dị na ngwugwu nke otu aha maka R. Ma mgbe ị na-agbalị iji ya na otu ihe oyiyi ọwa, ihe dị ịtụnanya tụgharịrị: tensor ntinye ga-enwerịrị akụkụ ahụ mgbe niile. (batch, height, width, 3), ya bụ, ọnụ ọgụgụ nke ọwa enweghị ike ịgbanwe. Enweghị oke dị otú ahụ na Python, yabụ anyị gbara ọsọ wee dee mmejuputa iwu nke anyị, na-agbaso edemede mbụ (na-enweghị nkwụsị nke dị na ụdị keras):

Mobilenet v1 architecture

library(keras)

top_3_categorical_accuracy <- custom_metric(
    name = "top_3_categorical_accuracy",
    metric_fn = function(y_true, y_pred) {
         metric_top_k_categorical_accuracy(y_true, y_pred, k = 3)
    }
)

layer_sep_conv_bn <- function(object, 
                              filters,
                              alpha = 1,
                              depth_multiplier = 1,
                              strides = c(2, 2)) {

  # NB! depth_multiplier !=  resolution multiplier
  # https://github.com/keras-team/keras/issues/10349

  layer_depthwise_conv_2d(
    object = object,
    kernel_size = c(3, 3), 
    strides = strides,
    padding = "same",
    depth_multiplier = depth_multiplier
  ) %>%
  layer_batch_normalization() %>% 
  layer_activation_relu() %>%
  layer_conv_2d(
    filters = filters * alpha,
    kernel_size = c(1, 1), 
    strides = c(1, 1)
  ) %>%
  layer_batch_normalization() %>% 
  layer_activation_relu() 
}

get_mobilenet_v1 <- function(input_shape = c(224, 224, 1),
                             num_classes = 340,
                             alpha = 1,
                             depth_multiplier = 1,
                             optimizer = optimizer_adam(lr = 0.002),
                             loss = "categorical_crossentropy",
                             metrics = c("categorical_crossentropy",
                                         top_3_categorical_accuracy)) {

  inputs <- layer_input(shape = input_shape)

  outputs <- inputs %>%
    layer_conv_2d(filters = 32, kernel_size = c(3, 3), strides = c(2, 2), padding = "same") %>%
    layer_batch_normalization() %>% 
    layer_activation_relu() %>%
    layer_sep_conv_bn(filters = 64, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 128, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 128, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 256, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 256, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 512, strides = c(1, 1)) %>%
    layer_sep_conv_bn(filters = 1024, strides = c(2, 2)) %>%
    layer_sep_conv_bn(filters = 1024, strides = c(1, 1)) %>%
    layer_global_average_pooling_2d() %>%
    layer_dense(units = num_classes) %>%
    layer_activation_softmax()

    model <- keras_model(
      inputs = inputs,
      outputs = outputs
    )

    model %>% compile(
      optimizer = optimizer,
      loss = loss,
      metrics = metrics
    )

    return(model)
}

Ọdịmma nke ụzọ a doro anya. Achọrọ m ịnwale ọtụtụ ụdị, mana n'ụzọ megidere nke ahụ, achọghị m iji aka degharịa ihe owuwu ọ bụla. A napụkwara anyị ohere iji ịdị arọ nke ụdị a zụrụ azụ na imagenet. Dị ka ọ dị na mbụ, ịmụ akwụkwọ nyere aka. Ọrụ get_config() na-enye gị ohere ịnweta nkọwa nke ihe nlereanya n'ụdị kwesịrị ekwesị maka idezi (base_model_conf$layers - ndepụta R mgbe niile), na ọrụ ahụ from_config() na-eme ntụgharị ntụgharị gaa na ihe nlereanya:

base_model_conf <- get_config(base_model)
base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
base_model <- from_config(base_model_conf)

Ugbu a, ọ naghị esiri ike ide ọrụ zuru ụwa ọnụ iji nweta nke ọ bụla enyere keras ụdị nwere ma ọ bụ na-enweghị ibu a zụrụ azụ na imagenet:

Ọrụ maka ị na-ebunye architectures ndị emebere

get_model <- function(name = "mobilenet_v2",
                      input_shape = NULL,
                      weights = "imagenet",
                      pooling = "avg",
                      num_classes = NULL,
                      optimizer = keras::optimizer_adam(lr = 0.002),
                      loss = "categorical_crossentropy",
                      metrics = NULL,
                      color = TRUE,
                      compile = FALSE) {
  # Проверка аргументов
  checkmate::assert_string(name)
  checkmate::assert_integerish(input_shape, lower = 1, upper = 256, len = 3)
  checkmate::assert_count(num_classes)
  checkmate::assert_flag(color)
  checkmate::assert_flag(compile)

  # Получаем объект из пакета keras
  model_fun <- get0(paste0("application_", name), envir = asNamespace("keras"))
  # Проверка наличия объекта в пакете
  if (is.null(model_fun)) {
    stop("Model ", shQuote(name), " not found.", call. = FALSE)
  }

  base_model <- model_fun(
    input_shape = input_shape,
    include_top = FALSE,
    weights = weights,
    pooling = pooling
  )

  # Если изображение не цветное, меняем размерность входа
  if (!color) {
    base_model_conf <- keras::get_config(base_model)
    base_model_conf$layers[[1]]$config$batch_input_shape[[4]] <- 1L
    base_model <- keras::from_config(base_model_conf)
  }

  predictions <- keras::get_layer(base_model, "global_average_pooling2d_1")$output
  predictions <- keras::layer_dense(predictions, units = num_classes, activation = "softmax")
  model <- keras::keras_model(
    inputs = base_model$input,
    outputs = predictions
  )

  if (compile) {
    keras::compile(
      object = model,
      optimizer = optimizer,
      loss = loss,
      metrics = metrics
    )
  }

  return(model)
}

Mgbe ị na-eji onyonyo otu ọwa, ọ nweghị ihe ọ̀tụ̀tụ̀ a zụrụ azụ ka a na-eji. Enwere ike idozi nke a: iji ọrụ ahụ get_weights() nweta ihe atụ dị arọ n'ụdị ndepụta R, gbanwee akụkụ nke akụkụ mbụ nke ndepụta a (site n'iji otu ọwa agba ma ọ bụ nkezi atọ), wee buo ibu ahụ azụ n'ime ihe nlereanya ahụ na ọrụ ahụ. set_weights(). Ọ dịghị mgbe anyị gbakwunyere ọrụ a, n'ihi na n'oge a, o doro anya na ọ na-arụpụta ihe na-arụ ọrụ na foto agba.

Anyị mere ọtụtụ n'ime nnwale ndị a site na iji mobilenet ụdị 1 na 2, yana resnet34. Ihe owuwu ọgbara ọhụrụ dị ka SE-ResNeXt rụrụ nke ọma na asọmpi a. N'ụzọ dị mwute, anyị enweghị njikere mere mmejuputa iwu na anyị nwere, na anyị edeghị nke anyị (ma anyị ga-maa dee).

5. Parameterization nke scripts

Maka ịdị mma, koodu niile maka ịmalite ọzụzụ bụ nke e mere ka ọ bụrụ otu edemede, na-eji parameterized docopt dị ka ndị a:

doc <- '
Usage:
  train_nn.R --help
  train_nn.R --list-models
  train_nn.R [options]

Options:
  -h --help                   Show this message.
  -l --list-models            List available models.
  -m --model=<model>          Neural network model name [default: mobilenet_v2].
  -b --batch-size=<size>      Batch size [default: 32].
  -s --scale-factor=<ratio>   Scale factor [default: 0.5].
  -c --color                  Use color lines [default: FALSE].
  -d --db-dir=<path>          Path to database directory [default: Sys.getenv("db_dir")].
  -r --validate-ratio=<ratio> Validate sample ratio [default: 0.995].
  -n --n-gpu=<number>         Number of GPUs [default: 1].
'
args <- docopt::docopt(doc)

Ihe ngwugwu docopt na-anọchi anya mmejuputa iwu http://docopt.org/ maka R. Site n'enyemaka ya, a na-eji iwu dị mfe malite edemede dịka Rscript bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db ma ọ bụ ./bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db, ọ bụrụ faịlụ train_nn.R bụ executable (iwu a ga-amalite ịzụ ihe nlereanya resnet50 na onyonyo agba atọ na-atụ 128x128 pikselụ, nchekwa data ga-adịrịrị na nchekwa /home/andrey/doodle_db). Ị nwere ike ịgbakwunye ọsọ mmụta, ụdị njikarịcha, na ihe ọ bụla ọzọ nwere ike ịhazi ya na ndepụta ahụ. Na usoro nke na-akwadebe akwụkwọ, ọ tụgharịrị na architecture mobilenet_v2 site na ụdị dị ugbu a keras na R eji enweghị ike n'ihi mgbanwe ndị a na-ejighị n'aka na ngwugwu R, anyị na-echere ka ha dozie ya.

Usoro a mere ka o kwe omume iji ụdị dị iche iche mee ka nnwale dị ngwa ngwa ma e jiri ya tụnyere mmalite ọdịnala nke scripts na RStudio (anyị na-ahụ ngwugwu ahụ dị ka ihe ọzọ ga-ekwe omume. tfruns). Mana isi uru bụ ikike ijikwa ngwa ngwa mmalite nke edemede na Docker ma ọ bụ naanị na sava ahụ, na-etinyeghị RStudio maka nke a.

6. Dockerization nke scripts

Anyị na-eji Docker iji hụ na ibugharị gburugburu ebe obibi maka ụdị ọzụzụ n'etiti ndị otu na maka itinye ngwa ngwa na igwe ojii. Ị nwere ike ịmalite ịmara ngwá ọrụ a, nke na-adịghị ahụkebe maka onye mmemme R, na nke a usoro mbipụta ma ọ bụ video N'ezie.

Docker na-enye gị ohere ịmepụta ihe oyiyi nke gị site na ọkọ ma jiri ihe oyiyi ndị ọzọ dị ka ihe ndabere maka ịmepụta nke gị. Mgbe anyị na-enyocha nhọrọ ndị dịnụ, anyị bịara na nkwubi okwu na ịwụnye NVIDIA, CUDA + cuDNN ndị ọkwọ ụgbọ ala na ọba akwụkwọ Python bụ akụkụ dị oke egwu nke onyonyo a, anyị kpebiri iwere onyonyo gọọmentị ka ndabere. tensorflow/tensorflow:1.12.0-gpu, na-agbakwunye ngwugwu R dị mkpa n'ebe ahụ.

Faịlụ docker ikpeazụ dị ka nke a:

dockerfile

FROM tensorflow/tensorflow:1.12.0-gpu

MAINTAINER Artem Klevtsov <[email protected]>

SHELL ["/bin/bash", "-c"]

ARG LOCALE="en_US.UTF-8"
ARG APT_PKG="libopencv-dev r-base r-base-dev littler"
ARG R_BIN_PKG="futile.logger checkmate data.table rcpp rapidjsonr dbi keras jsonlite curl digest remotes"
ARG R_SRC_PKG="xtensor RcppThread docopt MonetDBLite"
ARG PY_PIP_PKG="keras"
ARG DIRS="/db /app /app/data /app/models /app/logs"

RUN source /etc/os-release && 
    echo "deb https://cloud.r-project.org/bin/linux/ubuntu ${UBUNTU_CODENAME}-cran35/" > /etc/apt/sources.list.d/cran35.list && 
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 && 
    add-apt-repository -y ppa:marutter/c2d4u3.5 && 
    add-apt-repository -y ppa:timsc/opencv-3.4 && 
    apt-get update && 
    apt-get install -y locales && 
    locale-gen ${LOCALE} && 
    apt-get install -y --no-install-recommends ${APT_PKG} && 
    ln -s /usr/lib/R/site-library/littler/examples/install.r /usr/local/bin/install.r && 
    ln -s /usr/lib/R/site-library/littler/examples/install2.r /usr/local/bin/install2.r && 
    ln -s /usr/lib/R/site-library/littler/examples/installGithub.r /usr/local/bin/installGithub.r && 
    echo 'options(Ncpus = parallel::detectCores())' >> /etc/R/Rprofile.site && 
    echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site && 
    apt-get install -y $(printf "r-cran-%s " ${R_BIN_PKG}) && 
    install.r ${R_SRC_PKG} && 
    pip install ${PY_PIP_PKG} && 
    mkdir -p ${DIRS} && 
    chmod 777 ${DIRS} && 
    rm -rf /tmp/downloaded_packages/ /tmp/*.rds && 
    rm -rf /var/lib/apt/lists/*

COPY utils /app/utils
COPY src /app/src
COPY tests /app/tests
COPY bin/*.R /app/

ENV DBDIR="/db"
ENV CUDA_HOME="/usr/local/cuda"
ENV PATH="/app:${PATH}"

WORKDIR /app

VOLUME /db
VOLUME /app

CMD bash

Maka ịdị mma, a na-etinye ngwugwu ndị e ji mee ihe n'ime mgbanwe; A na-eṅomi ọtụtụ n'ime ihe odide ederede n'ime akpa n'oge mgbakọ. Anyị gbanwekwara shei iwu ka ọ bụrụ /bin/bash maka mfe iji ọdịnaya /etc/os-release. Nke a zere mkpa ezipụta ụdị OS na koodu.

Na mgbakwunye, e dere obere edemede bash nke na-enye gị ohere ịmalite akpa nwere iwu dị iche iche. Dịka ọmụmaatụ, ndị a nwere ike ịbụ edemede maka ọzụzụ netwọkụ akwara ozi nke etinyeburu n'ime akpa ahụ, ma ọ bụ shei iwu maka nbipu na nyochaa ọrụ nke akpa:

Ederede iji malite akpa ahụ

#!/bin/sh

DBDIR=${PWD}/db
LOGSDIR=${PWD}/logs
MODELDIR=${PWD}/models
DATADIR=${PWD}/data
ARGS="--runtime=nvidia --rm -v ${DBDIR}:/db -v ${LOGSDIR}:/app/logs -v ${MODELDIR}:/app/models -v ${DATADIR}:/app/data"

if [ -z "$1" ]; then
    CMD="Rscript /app/train_nn.R"
elif [ "$1" = "bash" ]; then
    ARGS="${ARGS} -ti"
else
    CMD="Rscript /app/train_nn.R $@"
fi

docker run ${ARGS} doodles-tf ${CMD}

Ọ bụrụ na-agba ọsọ edemede bash a na-enweghị paramita, a ga-akpọ edemede ahụ n'ime akpa ahụ train_nn.R na ụkpụrụ ndabara; ọ bụrụ na arụmụka ọnọdụ mbụ bụ "bash", mgbe ahụ akpa ahụ ga-amalite mkparịta ụka na shei iwu. N'ọnọdụ ndị ọzọ niile, a na-edochi ụkpụrụ nke arụmụka ọnọdụ: CMD="Rscript /app/train_nn.R $@".

Ọ dị mma ịmara na akwụkwọ ndekọ aha nwere data isi iyi na nchekwa data, yana ndekọ ndekọ maka ịchekwa ụdị a zụrụ azụ, etinyere n'ime akpa ahụ site na sistemụ nnabata, nke na-enye gị ohere ịnweta nsonaazụ nke edemede na-enweghị mkpachapụ na-enweghị isi.

7. Iji multiple GPUs na Google Cloud

Otu n'ime njirimara nke asọmpi ahụ bụ data na-eme mkpọtụ (lee foto aha, nke e nwetara site na @Leigh.plt site na ODS slack). Nnukwu batches na-enyere aka ịlụso nke a ọgụ, mgbe nnwale na PC nwere 1 GPU, anyị kpebiri ịmụta ụdị ọzụzụ na ọtụtụ GPU na igwe ojii. GoogleCloud eji (ezi ndu na isi) n'ihi nnukwu nhọrọ nke nhazi dịnụ, ọnụ ahịa ezi uche dị na ya na ego $ 300. N'ihi anyaukwu, m nyere iwu ihe atụ 4xV100 na SSD na ton nke RAM, nke ahụ bụkwa nnukwu mmejọ. Igwe dị otú ahụ na-eri ego ngwa ngwa; ị nwere ike ịga mebie nnwale na-enweghị ezigbo pipeline. Maka ebumnuche agụmakwụkwọ, ọ ka mma ị were K80. Ma nnukwu ego nke RAM batara - igwe ojii SSD adịghị amasị ya na arụmọrụ ya, ya mere ebufe data nchekwa data na ya. dev/shm.

Nke kacha amasị bụ iberi koodu maka iji ọtụtụ GPU. Nke mbụ, a na-emepụta ihe nlereanya na CPU site na iji njikwa gburugburu, dị ka Python:

with(tensorflow::tf$device("/cpu:0"), {
  model_cpu <- get_model(
    name = model_name,
    input_shape = input_shape,
    weights = weights,
    metrics =(top_3_categorical_accuracy,
    compile = FALSE
  )
})

Mgbe ahụ, a na-e depụtaghachi ihe atụ a na-achịkọtaghị (nke a dị mkpa) na ọnụ ọgụgụ GPU dịnụ, ma mgbe nke ahụ gasịrị, a na-achịkọta ya:

model <- keras::multi_gpu_model(model_cpu, gpus = n_gpu)
keras::compile(
  object = model,
  optimizer = keras::optimizer_adam(lr = 0.0004),
  loss = "categorical_crossentropy",
  metrics = c(top_3_categorical_accuracy)
)

Usoro kpochapụwo nke ịsacha oyi akwa niile ma e wezụga nke ikpeazụ, na-azụ oyi akwa ikpeazụ, ihichapụ na ịzụghachi ihe nlereanya niile maka ọtụtụ GPU.

A na-enyocha ọzụzụ n'ejighị ya. tensorboard, na-amachi onwe anyị na ịdekọ ndekọ na ịchekwa ụdị nwere aha ọmụma mgbe oge ọ bụla gasịrị:

Ndaghachi azụ

# Шаблон имени файла лога
log_file_tmpl <- file.path("logs", sprintf(
  "%s_%d_%dch_%s.csv",
  model_name,
  dim_size,
  channels,
  format(Sys.time(), "%Y%m%d%H%M%OS")
))
# Шаблон имени файла модели
model_file_tmpl <- file.path("models", sprintf(
  "%s_%d_%dch_{epoch:02d}_{val_loss:.2f}.h5",
  model_name,
  dim_size,
  channels
))

callbacks_list <- list(
  keras::callback_csv_logger(
    filename = log_file_tmpl
  ),
  keras::callback_early_stopping(
    monitor = "val_loss",
    min_delta = 1e-4,
    patience = 8,
    verbose = 1,
    mode = "min"
  ),
  keras::callback_reduce_lr_on_plateau(
    monitor = "val_loss",
    factor = 0.5, # уменьшаем lr в 2 раза
    patience = 4,
    verbose = 1,
    min_delta = 1e-4,
    mode = "min"
  ),
  keras::callback_model_checkpoint(
    filepath = model_file_tmpl,
    monitor = "val_loss",
    save_best_only = FALSE,
    save_weights_only = FALSE,
    mode = "min"
  )
)

8. Kama nkwubi okwu

Ọtụtụ nsogbu ndị anyị zutere enwebeghị emeri:

  • в keras Enweghị ọrụ emebere maka ịchọ na-akpaghị aka maka ọnụego mmụta kacha mma (analogue lr_finder n'ọbá akwụkwọ ngwa ngwa.ai); Site na mbọ ụfọdụ, ọ ga-ekwe omume ibubata mmejuputa ndị ọzọ na R, dịka ọmụmaatụ, nke a;
  • N'ihi isi ihe gara aga, ọ gaghị ekwe omume ịhọrọ ọsọ ọzụzụ ziri ezi mgbe ị na-eji ọtụtụ GPU;
  • enweghi ihe owuwu netwọk neural nke oge a, karịsịa ndị a zụrụ azụ na imagenet;
  • Ọ dịghị onye usoro okirikiri na ọnụego mmụta ịkpa oke (cosine annealing bụ na arịrịọ anyị emejuputa atumatu, Daalụ skydan).

Kedu ihe bara uru a mụtara na asọmpi a:

  • Na ngwaike dị obere, ị nwere ike iji data dị mma (ọtụtụ oge nha RAM) rụọ ọrụ na-enweghị mgbu. Akpa rọba data na-echekwa ebe nchekwa n'ihi ngbanwe nke tebụl n'ime ebe, nke na-ezere iṅomi ha, na mgbe ejiri ya mee ihe n'ụzọ ziri ezi, ike ya fọrọ nke nta ka ọ bụrụ mgbe niile na-egosipụta ọsọ kachasị elu n'etiti ngwaọrụ niile mara anyị maka ide asụsụ. Ịchekwa data na nchekwa data na-enye gị ohere, n'ọtụtụ ọnọdụ, ịghara iche echiche ọ bụla gbasara mkpa ọ dị ịpịnye ihe niile dataset n'ime RAM.
  • Enwere ike iji ngwa ngwa dochie ọrụ ndị dị nwayọọ na R na C ++ site na iji ngwugwu Rcpp. Ọ bụrụ na mgbakwunye na iji eriri Rcpp ma ọ bụ RcppParallel, Anyị na-enweta mmejuputa multi-threaded cross-platform, n'ihi ya, ọ dịghị mkpa ka ọ dakọtara koodu na ọkwa R.
  • ngwugwu Rcpp enwere ike iji ya na-enweghị ezigbo ihe ọmụma nke C ++, a na-akọwapụta opekempe achọrọ ebe a. Faịlụ nkụnye eji isi mee maka ọtụtụ ụlọ akwụkwọ C dị mma dị ka xtensor dị na CRAN, ya bụ, a na-emepụta akụrụngwa maka mmejuputa ọrụ nke na-ejikọta koodu C++ dị elu nke emebere n'ime R. Ọdịmma agbakwunyere bụ ịkọwapụta syntax yana ihe nyocha koodu C++ static na RStudio.
  • docopt na-enye gị ohere ịme edemede nke nwere onwe ya na paramita. Nke a dị mma maka ojiji na sava dịpụrụ adịpụ, gụnyere. n'okpuru docker. Na RStudio, ọ naghị adị mfe ịme ọtụtụ awa nnwale na netwọkụ akwara ọzụzụ, yana ịwụnye IDE na sava n'onwe ya anaghị akwado ya mgbe niile.
  • Docker na-ahụ maka mbugharị koodu na nrụpụta nke nsonaazụ n'etiti ndị mmepe nwere ụdị OS na ọba akwụkwọ dị iche iche, yana ịdị mfe nke ogbugbu na sava. Ị nwere ike iji naanị otu iwu malite pipeline ọzụzụ niile.
  • Google Cloud bụ ụzọ enyi na mmefu ego iji nwalee na ngwaike dị oke ọnụ, mana ịkwesịrị ịhọrọ nhazi nke ọma.
  • Ịtụ ọsọ nke iberi koodu n'otu n'otu bara ezigbo uru, karịsịa mgbe ị na-ejikọta R na C++, yana ngwugwu. bench - dịkwa nnọọ mfe.

N'ozuzu, ahụmahụ a na-akwụghachi ụgwọ nke ukwuu ma anyị na-aga n'ihu na-arụ ọrụ iji dozie ụfọdụ nsogbu ndị e welitere.

isi: www.habr.com

Tinye a comment