Mbukak Apache Spark ing Kubernetes

Para maos, sugeng siang. Dina iki kita bakal ngomong sethithik babagan Apache Spark lan prospek pangembangane.

Mbukak Apache Spark ing Kubernetes

Ing jagad Big Data modern, Apache Spark minangka standar de facto kanggo ngembangake tugas pangolahan data batch. Kajaba iku, uga digunakake kanggo nggawe aplikasi streaming sing bisa digunakake ing konsep kumpulan mikro, ngolah lan ngirim data ing bagian cilik (Spark Structured Streaming). Lan sacara tradisional wis dadi bagean saka tumpukan Hadoop sakabèhé, nggunakake YARN (utawa ing sawetara kasus Apache Mesos) minangka manajer sumber daya. Ing taun 2020, panggunaan ing wangun tradisional dadi pitakonan kanggo umume perusahaan amarga ora duwe distribusi Hadoop sing layak - pangembangan HDP lan CDH mandheg, CDH ora dikembangake kanthi apik lan duwe biaya sing dhuwur, lan pemasok Hadoop sing isih ana ora ana maneh utawa duwe masa depan sing surem. Mula, peluncuran Apache Spark nggunakake Kubernetes nambah minat masyarakat lan perusahaan gedhe - dadi standar ing orkestrasi kontainer lan manajemen sumber daya ing awan pribadi lan umum, ngrampungake masalah karo jadwal sumber daya sing ora trep kanggo tugas Spark ing YARN lan nyedhiyakake platform sing terus berkembang karo akeh distribusi komersial lan mbukak kanggo perusahaan kabeh ukuran lan loreng. Kajaba iku, amarga popularitas, umume wis entuk sawetara instalasi dhewe lan nambah keahliane babagan panggunaan, sing nyederhanakake pamindhahan.

Miwiti karo versi 2.3.0, Apache Spark entuk dhukungan resmi kanggo nglakokake tugas ing kluster Kubernetes lan dina iki, kita bakal ngomong babagan kedewasaan pendekatan iki, macem-macem opsi kanggo panggunaan lan pitfalls sing bakal ditemoni sajrone implementasine.

Kaping pisanan, ayo deleng proses ngembangake tugas lan aplikasi adhedhasar Apache Spark lan nyorot kasus-kasus khas sing kudu ditindakake ing klompok Kubernetes. Nalika nyiapake kiriman iki, OpenShift digunakake minangka distribusi lan prentah sing cocog karo utilitas baris perintah (oc) bakal diwenehake. Kanggo distribusi Kubernetes liyane, prentah sing cocog saka utilitas baris perintah Kubernetes standar (kubectl) utawa analoge (contone, kanggo kabijakan oc adm) bisa digunakake.

Kasus panggunaan pisanan - spark-submit

Sajrone pangembangan tugas lan aplikasi, pangembang kudu mbukak tugas kanggo debug transformasi data. Secara teoritis, rintisan bisa digunakake kanggo tujuan kasebut, nanging pangembangan kanthi partisipasi nyata (sanajan tes) saka sistem pungkasan wis kabukten luwih cepet lan luwih apik ing kelas tugas iki. Ing kasus nalika kita debug ing kedadean nyata saka sistem pungkasan, rong skenario bisa:

  • pangembang nganggo tugas Spark lokal ing mode dewekan;

    Mbukak Apache Spark ing Kubernetes

  • pangembang mbukak tugas Spark ing kluster Kubernetes ing daur ulang test.

    Mbukak Apache Spark ing Kubernetes

Opsi pisanan nduweni hak kanggo ana, nanging nduweni sawetara kekurangan:

  • Saben pangembang kudu diwenehi akses saka papan kerja menyang kabeh kedadeyan sistem pungkasan sing dibutuhake;
  • jumlah cekap saka sumber daya dibutuhake ing mesin apa kanggo mbukak tugas kang dikembangaké.

Opsi kaloro ora duwe kekurangan kasebut, amarga panggunaan kluster Kubernetes ngidini sampeyan nyedhiyakake blumbang sumber daya sing dibutuhake kanggo nglakokake tugas lan menehi akses sing dibutuhake kanggo mungkasi kedadeyan sistem, kanthi fleksibel nyedhiyakake akses menyang nggunakake model peran Kubernetes kanggo kabeh anggota tim pangembangan. Ayo nyorot minangka kasus panggunaan pertama - ngluncurake tugas Spark saka mesin pangembang lokal ing kluster Kubernetes ing sirkuit uji.

Ayo dadi pirembagan liyane babagan proses nyetel Spark kanggo mbukak lokal. Kanggo miwiti nggunakake Spark sampeyan kudu nginstal:

mkdir /opt/spark
cd /opt/spark
wget http://mirror.linux-ia64.org/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar zxvf spark-2.4.5.tgz
rm -f spark-2.4.5.tgz

Kita ngumpulake paket sing dibutuhake kanggo nggarap Kubernetes:

cd spark-2.4.5/
./build/mvn -Pkubernetes -DskipTests clean package

Mbangun lengkap mbutuhake wektu akeh, lan kanggo nggawe gambar Docker lan mbukak ing kluster Kubernetes, sampeyan mung butuh file jar saka direktori "assembly /", supaya sampeyan mung bisa mbangun subproyek iki:

./build/mvn -f ./assembly/pom.xml -Pkubernetes -DskipTests clean package

Kanggo mbukak proyek Spark ing Kubernetes, sampeyan kudu nggawe gambar Docker kanggo digunakake minangka gambar dhasar. Ana 2 pendekatan sing bisa ditindakake ing kene:

  • Gambar Docker sing digawe kalebu kode tugas Spark sing bisa dieksekusi;
  • Gambar sing digawe mung kalebu Spark lan dependensi sing dibutuhake, kode eksekusi di-host saka jarak jauh (contone, ing HDFS).

Pisanan, ayo gawe gambar Docker sing ngemot conto tes tugas Spark. Kanggo nggawe gambar Docker, Spark duwe sarana sing diarani "docker-image-tool". Ayo sinau babagan bantuan kasebut:

./bin/docker-image-tool.sh --help

Kanthi bantuan, sampeyan bisa nggawe gambar Docker lan ngunggah menyang registri remot, nanging kanthi standar duwe sawetara kekurangan:

  • tanpa gagal nggawe 3 gambar Docker bebarengan - kanggo Spark, PySpark lan R;
  • ora ngidini sampeyan nemtokake jeneng gambar.

Mulane, kita bakal nggunakake versi modifikasi saka sarana iki ing ngisor iki:

vi bin/docker-image-tool-upd.sh

#!/usr/bin/env bash

function error {
  echo "$@" 1>&2
  exit 1
}

if [ -z "${SPARK_HOME}" ]; then
  SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"

function image_ref {
  local image="$1"
  local add_repo="${2:-1}"
  if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
    image="$REPO/$image"
  fi
  if [ -n "$TAG" ]; then
    image="$image:$TAG"
  fi
  echo "$image"
}

function build {
  local BUILD_ARGS
  local IMG_PATH

  if [ ! -f "$SPARK_HOME/RELEASE" ]; then
    IMG_PATH=$BASEDOCKERFILE
    BUILD_ARGS=(
      ${BUILD_PARAMS}
      --build-arg
      img_path=$IMG_PATH
      --build-arg
      datagram_jars=datagram/runtimelibs
      --build-arg
      spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
    )
  else
    IMG_PATH="kubernetes/dockerfiles"
    BUILD_ARGS=(${BUILD_PARAMS})
  fi

  if [ -z "$IMG_PATH" ]; then
    error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
  fi

  if [ -z "$IMAGE_REF" ]; then
    error "Cannot find docker image reference. Please add -i arg."
  fi

  local BINDING_BUILD_ARGS=(
    ${BUILD_PARAMS}
    --build-arg
    base_img=$(image_ref $IMAGE_REF)
  )
  local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}

  docker build $NOCACHEARG "${BUILD_ARGS[@]}" 
    -t $(image_ref $IMAGE_REF) 
    -f "$BASEDOCKERFILE" .
}

function push {
  docker push "$(image_ref $IMAGE_REF)"
}

function usage {
  cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.

Commands:
  build       Build image. Requires a repository address to be provided if the image will be
              pushed to a different registry.
  push        Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
  -f file               Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
  -p file               Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
  -R file               Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
  -r repo               Repository address.
  -i name               Image name to apply to the built image, or to identify the image to be pushed.  
  -t tag                Tag to apply to the built image, or to identify the image to be pushed.
  -m                    Use minikube's Docker daemon.
  -n                    Build docker image with --no-cache
  -b arg      Build arg to build or push the image. For multiple build args, this option needs to
              be used separately for each build arg.

Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.

Check the following documentation for more information on using the minikube Docker daemon:

  https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon

Examples:
  - Build image in minikube with tag "testing"
    $0 -m -t testing build

  - Build and push image with tag "v2.3.0" to docker.io/myrepo
    $0 -r docker.io/myrepo -t v2.3.0 build
    $0 -r docker.io/myrepo -t v2.3.0 push
EOF
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  usage
  exit 0
fi

REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
 case "${option}"
 in
 f) BASEDOCKERFILE=${OPTARG};;
 r) REPO=${OPTARG};;
 t) TAG=${OPTARG};;
 n) NOCACHEARG="--no-cache";;
 i) IMAGE_REF=${OPTARG};;
 b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
 esac
done

case "${@: -1}" in
  build)
    build
    ;;
  push)
    if [ -z "$REPO" ]; then
      usage
      exit 1
    fi
    push
    ;;
  *)
    usage
    exit 1
    ;;
esac

Kanthi bantuan, kita ngumpulake gambar Spark dhasar sing ngemot tugas tes kanggo ngitung Pi nggunakake Spark (kene {docker-registry-url} minangka URL registri gambar Docker sampeyan, {repo} yaiku jeneng repositori ing njero pendaptaran, sing cocog karo proyek ing OpenShift , {image-name} - jeneng gambar (yen pemisahan gambar telung tingkat digunakake, contone, kaya ing registri terpadu gambar Red Hat OpenShift), {tag} - tag iki versi gambar):

./bin/docker-image-tool-upd.sh -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile -r {docker-registry-url}/{repo} -i {image-name} -t {tag} build

Mlebu menyang kluster OKD nggunakake sarana konsol (kene {OKD-API-URL} yaiku URL API kluster OKD):

oc login {OKD-API-URL}

Ayo entuk token pangguna saiki kanggo wewenang ing Registry Docker:

oc whoami -t

Mlebu menyang Registry Docker internal kluster OKD (kita nggunakake token sing dipikolehi nggunakake prentah sadurunge minangka sandhi):

docker login {docker-registry-url}

Ayo upload gambar Docker sing dirakit menyang Docker Registry OKD:

./bin/docker-image-tool-upd.sh -r {docker-registry-url}/{repo} -i {image-name} -t {tag} push

Ayo priksa manawa gambar sing dirakit kasedhiya ing OKD. Kanggo nindakake iki, bukak URL ing browser kanthi dhaptar gambar proyek sing cocog (kene {proyek} minangka jeneng proyek ing kluster OpenShift, {OKD-WEBUI-URL} minangka URL saka konsol Web OpenShift ) - https://{OKD-WEBUI-URL}/console /project/{project}/browse/images/{image-name}.

Kanggo mbukak tugas, akun layanan kudu digawe kanthi hak istimewa kanggo mbukak pods minangka root (kita bakal ngrembug babagan iki mengko):

oc create sa spark -n {project}
oc adm policy add-scc-to-user anyuid -z spark -n {project}

Ayo mbukak perintah spark-submit kanggo nerbitake tugas Spark menyang kluster OKD, nemtokake akun layanan sing digawe lan gambar Docker:

 /opt/spark/bin/spark-submit --name spark-test --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar

Ing kene:

—jeneng — jeneng tugas sing bakal melu mbentuk jeneng pods Kubernetes;

—class — kelas file sing bisa dieksekusi, diarani nalika tugas diwiwiti;

—conf — Paramèter konfigurasi Spark;

spark.executor.instances - jumlah eksekutor Spark sing bakal diluncurake;

spark.kubernetes.authenticate.driver.serviceAccountName - jeneng akun layanan Kubernetes sing digunakake nalika mbukak pods (kanggo nemtokake konteks keamanan lan kemampuan nalika sesambungan karo API Kubernetes);

spark.kubernetes.namespace - Kubernetes namespace ing ngendi driver lan eksekutor pods bakal dibukak;

spark.submit.deployMode - cara ngetokake Spark (kanggo standar spark-submit "cluster" digunakake, kanggo Spark Operator lan versi anyar saka Spark "klien");

spark.kubernetes.container.image - Gambar Docker digunakake kanggo miwiti pods;

spark.master - URL API Kubernetes (eksternal ditemtokake supaya akses dumadi saka mesin lokal);

local: // minangka path menyang eksekusi Spark ing gambar Docker.

Kita pindhah menyang proyek OKD sing cocog lan sinau pods sing digawe - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods.

Kanggo nyederhanakake proses pangembangan, opsi liyane bisa digunakake, ing ngendi gambar umum Spark digawe, digunakake dening kabeh tugas sing bakal ditindakake, lan gambar file eksekusi diterbitake ing panyimpenan eksternal (contone, Hadoop) lan ditemtokake nalika nelpon. spark-kirim minangka link. Ing kasus iki, sampeyan bisa mbukak macem-macem versi tugas Spark tanpa mbangun maneh gambar Docker, nggunakake, contone, WebHDFS kanggo nerbitaké gambar. Kita ngirim panjalukan kanggo nggawe file (kene {host} minangka host layanan WebHDFS, {port} minangka port layanan WebHDFS, {path-to-file-on-hdfs} minangka path sing dikarepake menyang file ing HDFS):

curl -i -X PUT "http://{host}:{port}/webhdfs/v1/{path-to-file-on-hdfs}?op=CREATE

Sampeyan bakal nampa respon kaya iki (kene {lokasi} URL sing kudu digunakake kanggo ngundhuh file):

HTTP/1.1 307 TEMPORARY_REDIRECT
Location: {location}
Content-Length: 0

Muat file eksekusi Spark menyang HDFS (kene {path-to-local-file} minangka path menyang file eksekusi Spark ing host saiki):

curl -i -X PUT -T {path-to-local-file} "{location}"

Sawise iki, kita bisa nindakake spark-submit nggunakake file Spark sing diunggah menyang HDFS (kene {class-name} minangka jeneng kelas sing kudu diluncurake kanggo ngrampungake tugas):

/opt/spark/bin/spark-submit --name spark-test --class {class-name} --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  hdfs://{host}:{port}/{path-to-file-on-hdfs}

Perlu dicathet yen kanggo ngakses HDFS lan mesthekake yen tugas kasebut bisa digunakake, sampeyan bisa uga kudu ngganti Dockerfile lan skrip entrypoint.sh - nambah arahan menyang Dockerfile kanggo nyalin perpustakaan gumantung menyang direktori /opt/spark/jars lan kalebu file konfigurasi HDFS ing SPARK_CLASSPATH ing entrypoint. sh.

Kasus panggunaan kapindho - Apache Livy

Salajengipun, nalika tugas wis dikembangaké lan asil kudu dites, pitakonan muncul kanggo mbukak minangka bagéan saka proses CI / CD lan nelusuri status eksekusi. Mesthi, sampeyan bisa mbukak nggunakake panggilan spark-kirim lokal, nanging iki complicates CI / infrastruktur CD amarga mbutuhake nginstal lan konfigurasi Spark ing agen server CI / pelari lan nyetel akses menyang API Kubernetes. Kanggo kasus iki, implementasi target milih nggunakake Apache Livy minangka REST API kanggo mbukak tugas Spark sing di-host ing kluster Kubernetes. Kanthi bantuan, sampeyan bisa mbukak tugas Spark ing kluster Kubernetes nggunakake panjalukan cURL biasa, sing gampang dileksanakake adhedhasar solusi CI, lan panggonane ing kluster Kubernetes ngatasi masalah otentikasi nalika sesambungan karo API Kubernetes.

Mbukak Apache Spark ing Kubernetes

Ayo dadi nyorot minangka kasus panggunaan kapindho - mbukak tugas Spark minangka bagéan saka CI / proses CD ing kluster Kubernetes ing daur ulang test.

Sithik babagan Apache Livy - kerjane minangka server HTTP sing nyedhiyakake antarmuka Web lan API RESTful sing ngidini sampeyan miwiti ngirim spark saka jarak jauh kanthi ngliwati parameter sing dibutuhake. Cara tradisional wis dikirim minangka bagéan saka distribusi HDP, nanging uga bisa disebarake menyang OKD utawa instalasi Kubernetes liyane nggunakake manifest sing cocok lan sakumpulan gambar Docker, kayata iki - github.com/ttauveron/k8s-big-data-experiments/tree/master/livy-spark-2.3. Kanggo kasus kita, gambar Docker sing padha dibangun, kalebu versi Spark 2.4.5 saka Dockerfile ing ngisor iki:

FROM java:8-alpine

ENV SPARK_HOME=/opt/spark
ENV LIVY_HOME=/opt/livy
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV SPARK_USER=spark

WORKDIR /opt

RUN apk add --update openssl wget bash && 
    wget -P /opt https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz && 
    tar xvzf spark-2.4.5-bin-hadoop2.7.tgz && 
    rm spark-2.4.5-bin-hadoop2.7.tgz && 
    ln -s /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark

RUN wget http://mirror.its.dal.ca/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip && 
    unzip apache-livy-0.7.0-incubating-bin.zip && 
    rm apache-livy-0.7.0-incubating-bin.zip && 
    ln -s /opt/apache-livy-0.7.0-incubating-bin /opt/livy && 
    mkdir /var/log/livy && 
    ln -s /var/log/livy /opt/livy/logs && 
    cp /opt/livy/conf/log4j.properties.template /opt/livy/conf/log4j.properties

ADD livy.conf /opt/livy/conf
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ADD entrypoint.sh /entrypoint.sh

ENV PATH="/opt/livy/bin:${PATH}"

EXPOSE 8998

ENTRYPOINT ["/entrypoint.sh"]
CMD ["livy-server"]

Gambar sing digawe bisa dibangun lan diunggah menyang repositori Docker sing wis ana, kayata repositori OKD internal. Kanggo nyebarake, gunakake manifest ing ngisor iki ({registry-url} - URL registri gambar Docker, {image-name} - Jeneng gambar Docker, {tag} - Tag gambar Docker, {livy-url} - URL sing dikarepake ing ngendi server bakal bisa diakses Livy; manifest "Rute" digunakake yen Red Hat OpenShift digunakake minangka distribusi Kubernetes, yen ora, ingress utawa Service manifest saka jinis NodePort digunakake):

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: livy
  name: livy
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: livy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: livy
    spec:
      containers:
        - command:
            - livy-server
          env:
            - name: K8S_API_HOST
              value: localhost
            - name: SPARK_KUBERNETES_IMAGE
              value: 'gnut3ll4/spark:v1.0.14'
          image: '{registry-url}/{image-name}:{tag}'
          imagePullPolicy: Always
          name: livy
          ports:
            - containerPort: 8998
              name: livy-rest
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/log/livy
              name: livy-log
            - mountPath: /opt/.livy-sessions/
              name: livy-sessions
            - mountPath: /opt/livy/conf/livy.conf
              name: livy-config
              subPath: livy.conf
            - mountPath: /opt/spark/conf/spark-defaults.conf
              name: spark-config
              subPath: spark-defaults.conf
        - command:
            - /usr/local/bin/kubectl
            - proxy
            - '--port'
            - '8443'
          image: 'gnut3ll4/kubectl-sidecar:latest'
          imagePullPolicy: Always
          name: kubectl
          ports:
            - containerPort: 8443
              name: k8s-api
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: spark
      serviceAccountName: spark
      terminationGracePeriodSeconds: 30
      volumes:
        - emptyDir: {}
          name: livy-log
        - emptyDir: {}
          name: livy-sessions
        - configMap:
            defaultMode: 420
            items:
              - key: livy.conf
                path: livy.conf
            name: livy-config
          name: livy-config
        - configMap:
            defaultMode: 420
            items:
              - key: spark-defaults.conf
                path: spark-defaults.conf
            name: livy-config
          name: spark-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: livy-config
data:
  livy.conf: |-
    livy.spark.deploy-mode=cluster
    livy.file.local-dir-whitelist=/opt/.livy-sessions/
    livy.spark.master=k8s://http://localhost:8443
    livy.server.session.state-retain.sec = 8h
  spark-defaults.conf: 'spark.kubernetes.container.image        "gnut3ll4/spark:v1.0.14"'
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: livy
  name: livy
spec:
  ports:
    - name: livy-rest
      port: 8998
      protocol: TCP
      targetPort: 8998
  selector:
    component: livy
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: livy
  name: livy
spec:
  host: {livy-url}
  port:
    targetPort: livy-rest
  to:
    kind: Service
    name: livy
    weight: 100
  wildcardPolicy: None

Sawise nglamar lan sukses mbukak pod, antarmuka grafis Livy kasedhiya ing link: http://{livy-url}/ui. Kanthi Livy, kita bisa nerbitake tugas Spark nggunakake panjaluk REST saka, contone, Postman. Conto koleksi kanthi panjalukan ditampilake ing ngisor iki (argumen konfigurasi karo variabel sing dibutuhake kanggo operasi tugas sing diluncurake bisa dilewati ing array "args"):

{
    "info": {
        "_postman_id": "be135198-d2ff-47b6-a33e-0d27b9dba4c8",
        "name": "Spark Livy",
        "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
    },
    "item": [
        {
            "name": "1 Submit job with jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar", nt"className": "org.apache.spark.examples.SparkPi",nt"numExecutors":1,nt"name": "spark-test-1",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt}n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        },
        {
            "name": "2 Submit job without jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "hdfs://{host}:{port}/{path-to-file-on-hdfs}", nt"className": "{class-name}",nt"numExecutors":1,nt"name": "spark-test-2",nt"proxyUser": "0",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt},nt"args": [ntt"HADOOP_CONF_DIR=/opt/spark/hadoop-conf",ntt"MASTER=k8s://https://kubernetes.default.svc:8443"nt]n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        }
    ],
    "event": [
        {
            "listen": "prerequest",
            "script": {
                "id": "41bea1d0-278c-40c9-ad42-bf2e6268897d",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        },
        {
            "listen": "test",
            "script": {
                "id": "3cdd7736-a885-4a2d-9668-bd75798f4560",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        }
    ],
    "protocolProfileBehavior": {}
}

Ayo tindakake panjalukan pisanan saka koleksi, pindhah menyang antarmuka OKD lan priksa manawa tugas wis diluncurake kanthi sukses - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods. Ing wektu sing padha, sesi bakal katon ing antarmuka Livy (http://{livy-url}/ui), ing ngendi, nggunakake API Livy utawa antarmuka grafis, sampeyan bisa nglacak kemajuan tugas lan sinau sesi log.

Saiki ayo tuduhake cara kerjane Livy. Kanggo nindakake iki, ayo mriksa log wadhah Livy ing njero pod nganggo server Livy - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods/{livy-pod-name }?tab=log. Saka wong-wong mau, kita bisa ndeleng manawa nalika nelpon Livy REST API ing wadhah sing jenenge "livy", spark-submit dieksekusi, padha karo sing digunakake ing ndhuwur (kene {livy-pod-name} yaiku jeneng pod sing digawe. karo server Livy). Koleksi kasebut uga ngenalake pitakon liya sing ngidini sampeyan mbukak tugas sing bisa dadi host Spark sing bisa dieksekusi kanthi nggunakake server Livy.

Kasus panggunaan katelu - Operator Spark

Saiki tugas wis dites, pitakonan nglakokake kanthi rutin. Cara asli kanggo mbukak tugas kanthi rutin ing kluster Kubernetes yaiku entitas CronJob lan sampeyan bisa nggunakake, nanging saiki panggunaan operator kanggo ngatur aplikasi ing Kubernetes populer banget lan kanggo Spark ana operator sing cukup diwasa, sing uga digunakake ing solusi tingkat Enterprise (contone, Lightbend FastData Platform). Disaranake nggunakake - versi stabil saiki Spark (2.4.5) duwe opsi konfigurasi sing rada winates kanggo mbukak tugas Spark ing Kubernetes, nalika versi utama sabanjure (3.0.0) nyatakake dhukungan lengkap kanggo Kubernetes, nanging tanggal rilis tetep ora dingerteni. . Operator Spark menehi ganti rugi kanggo kekurangan iki kanthi nambahake pilihan konfigurasi sing penting (contone, masang ConfigMap karo konfigurasi akses Hadoop menyang Spark pods) lan kemampuan kanggo mbukak tugas sing dijadwalake kanthi rutin.

Mbukak Apache Spark ing Kubernetes
Ayo nyorot minangka kasus panggunaan katelu - kanthi rutin nglakokake tugas Spark ing kluster Kubernetes ing loop produksi.

Operator Spark mbukak sumber lan dikembangake ing Google Cloud Platform - github.com/GoogleCloudPlatform/spark-on-k8s-operator. Instalasi kasebut bisa ditindakake kanthi 3 cara:

  1. Minangka bagéan saka instalasi Lightbend FastData Platform/Cloudflow;
  2. Nggunakake Helm:
    helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
    helm install incubator/sparkoperator --namespace spark-operator
    	

  3. Nggunakake manifests saka repositori resmi (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest). Wigati dicathet ing ngisor iki - Cloudflow kalebu operator karo versi API v1beta1. Yen jinis instalasi iki digunakake, Spark deskripsi manifest aplikasi kudu adhedhasar tag conto ing Git karo versi API cocok, contone, "v1beta1-0.9.0-2.4.0". Versi operator bisa ditemokake ing katrangan saka CRD sing kalebu ing operator ing kamus "versi":
    oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
    	

Yen operator wis diinstal kanthi bener, polong aktif karo operator Spark bakal katon ing proyek sing cocog (contone, cloudflow-fdp-sparkoperator ing ruang Cloudflow kanggo instalasi Cloudflow) lan jinis sumber Kubernetes sing cocog karo jeneng "sparkapplications" bakal katon. . Sampeyan bisa njelajah aplikasi Spark sing kasedhiya nganggo printah ing ngisor iki:

oc get sparkapplications -n {project}

Kanggo mbukak tugas nggunakake Spark Operator sampeyan kudu nindakake 3 perkara:

  • nggawe gambar Docker sing kalebu kabeh perpustakaan sing perlu, uga konfigurasi lan file eksekusi. Ing gambar target, iki minangka gambar sing digawe ing tataran CI / CD lan dites ing kluster test;
  • nerbitake gambar Docker menyang registri sing bisa diakses saka kluster Kubernetes;
  • ngasilake manifest kanthi jinis "SparkApplication" lan katrangan babagan tugas sing bakal diluncurake. Conto manifests kasedhiya ing repositori resmi (contone. github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta1-0.9.0-2.4.0/examples/spark-pi.yaml). Ana titik penting sing kudu dicathet babagan manifesto:
    1. kamus "apiVersion" kudu nuduhake versi API sing cocog karo versi operator;
    2. kamus "metadata.namespace" kudu nuduhake ruang jeneng ing ngendi aplikasi bakal diluncurake;
    3. kamus "spec.image" kudu ngemot alamat gambar Docker sing digawe ing pendaptaran sing bisa diakses;
    4. kamus "spec.mainClass" kudu ngemot kelas tugas Spark sing kudu mbukak nalika proses diwiwiti;
    5. kamus "spec.mainApplicationFile" kudu ngemot path menyang file jar eksekusi;
    6. kamus "spec.sparkVersion" kudu nuduhake versi Spark sing digunakake;
    7. kamus "spec.driver.serviceAccount" kudu nemtokake akun layanan ing ruang jeneng Kubernetes sing cocog sing bakal digunakake kanggo mbukak aplikasi;
    8. kamus "spec.executor" kudu nuduhake jumlah sumber daya sing diparengake kanggo aplikasi;
    9. kamus "spec.volumeMounts" kudu nemtokake direktori lokal kang file tugas Spark lokal bakal digawe.

Conto ngasilake manifest (kene {spark-service-account} yaiku akun layanan ing kluster Kubernetes kanggo nglakokake tugas Spark):

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: {spark-service-account}
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Manifes iki nemtokake akun layanan sing, sadurunge nerbitake manifest, sampeyan kudu nggawe ikatan peran sing dibutuhake sing nyedhiyakake hak akses sing dibutuhake kanggo aplikasi Spark kanggo sesambungan karo API Kubernetes (yen perlu). Ing kasus kita, aplikasi mbutuhake hak kanggo nggawe Pods. Ayo nggawe ikatan peran sing dibutuhake:

oc adm policy add-role-to-user edit system:serviceaccount:{project}:{spark-service-account} -n {project}

Sampeyan uga kudu dicathet yen spesifikasi manifest iki bisa uga kalebu parameter "hadoopConfigMap", sing ngidini sampeyan nemtokake ConfigMap karo konfigurasi Hadoop tanpa kudu nyelehake file sing cocog ing gambar Docker. Iku uga cocok kanggo mbukak tugas ajeg - nggunakake parameter "jadwal", jadwal kanggo mbukak tugas tartamtu bisa ditemtokake.

Sawisé iku, kita nyimpen manifest kita menyang file spark-pi.yaml lan aplikasi menyang cluster Kubernetes kita:

oc apply -f spark-pi.yaml

Iki bakal nggawe obyek saka jinis "sparkapplications":

oc get sparkapplications -n {project}
> NAME       AGE
> spark-pi   22h

Ing kasus iki, pod karo aplikasi bakal digawe, status sing bakal ditampilake ing "sparkapplications" digawe. Sampeyan bisa ndeleng kanthi printah ing ngisor iki:

oc get sparkapplications spark-pi -o yaml -n {project}

Sawise rampung tugas, POD bakal pindhah menyang status "Rampung", sing uga bakal nganyari ing "sparkapplications". Log aplikasi bisa dideleng ing browser utawa nggunakake printah ing ngisor iki (ing kene {sparkapplications-pod-name} minangka jeneng pod saka tugas sing mlaku):

oc logs {sparkapplications-pod-name} -n {project}

Tugas Spark uga bisa dikelola nggunakake utilitas sparkctl khusus. Kanggo nginstal, clone repositori kanthi kode sumber, instal Go lan gawe sarana iki:

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin

Ayo mriksa dhaptar tugas Spark sing mlaku:

sparkctl list -n {project}

Ayo nggawe katrangan kanggo tugas Spark:

vi spark-app.yaml

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Ayo mbukak tugas sing diterangake nggunakake sparkctl:

sparkctl create spark-app.yaml -n {project}

Ayo mriksa dhaptar tugas Spark sing mlaku:

sparkctl list -n {project}

Ayo mriksa dhaptar acara saka tugas Spark sing diluncurake:

sparkctl event spark-pi -n {project} -f

Ayo mriksa status tugas Spark sing mlaku:

sparkctl status spark-pi -n {project}

Kesimpulane, aku pengin nimbang kekurangan sing ditemokake nggunakake versi stabil Spark (2.4.5) ing Kubernetes:

  1. Kaping pisanan lan, mbok menawa, kerugian utama yaiku kekurangan Lokalitas Data. Senadyan kabeh kekurangan saka BENANG, ana uga kaluwihan kanggo nggunakake, contone, prinsip ngirim kode kanggo data (tinimbang data kanggo kode). Thanks kanggo iki, tugas Spark dileksanakake ing simpul sing ana data sing ana ing petungan, lan wektu sing dibutuhake kanggo ngirim data liwat jaringan saya suda. Nalika nggunakake Kubernetes, kita ngadhepi kabutuhan kanggo mindhah data sing ana ing tugas ing jaringan. Yen padha cukup gedhe, wektu eksekusi tugas bisa nambah Ngartekno, lan uga mbutuhake jumlah sing cukup gedhe saka papan disk diparengake kanggo kedadean tugas Spark kanggo panyimpenan sauntara sing. Kerugian iki bisa dikurangi kanthi nggunakake piranti lunak khusus sing njamin lokalitas data ing Kubernetes (contone, Alluxio), nanging iki tegese kudu nyimpen salinan lengkap data ing simpul kluster Kubernetes.
  2. Kerugian penting nomer loro yaiku keamanan. Kanthi gawan, fitur sing gegandhengan karo keamanan babagan nglakokake tugas Spark dipatèni, panggunaan Kerberos ora dilindhungi ing dokumentasi resmi (sanajan opsi sing cocog wis dikenalaké ing versi 3.0.0, sing mbutuhake karya tambahan), lan dokumentasi keamanan kanggo nggunakake Spark (https://spark.apache.org/docs/2.4.5/security.html) mung BENANG, Mesos lan Cluster Standalone katon minangka toko tombol. Ing wektu sing padha, pangguna sing diluncurake tugas Spark ora bisa ditemtokake langsung - kita mung nemtokake akun layanan sing bakal digunakake, lan pangguna dipilih adhedhasar kabijakan keamanan sing dikonfigurasi. Ing babagan iki, pangguna root digunakake, sing ora aman ing lingkungan sing produktif, utawa pangguna kanthi UID acak, sing ora trep nalika nyebarake hak akses menyang data (iki bisa ditanggulangi kanthi nggawe PodSecurityPolicies lan nyambungake menyang akun layanan sing cocog). Saiki, solusi kasebut yaiku nyelehake kabeh file sing dibutuhake langsung menyang gambar Docker, utawa ngowahi skrip peluncuran Spark kanggo nggunakake mekanisme kanggo nyimpen lan njupuk rahasia sing diadopsi ing organisasi sampeyan.
  3. Nglakokake proyek Spark nggunakake Kubernetes sacara resmi isih ana ing mode eksperimen lan bisa uga ana owah-owahan sing signifikan ing artefak sing digunakake (file konfigurasi, gambar dasar Docker, lan skrip peluncuran) ing mangsa ngarep. Lan pancen, nalika nyiapake materi, versi 2.3.0 lan 2.4.5 diuji, prilaku kasebut beda banget.

Ayo ngenteni nganyari - versi anyar Spark (3.0.0) bubar dirilis, sing nggawa owah-owahan sing signifikan ing karya Spark ing Kubernetes, nanging tetep status eksperimen dhukungan kanggo manajer sumber daya iki. Mungkin nganyari sabanjure bakal ngidini sampeyan nyaranake ninggalake YARN lan mbukak tugas Spark ing Kubernetes tanpa wedi kanggo keamanan sistem sampeyan lan tanpa kudu ngowahi komponen fungsional kanthi mandiri.

Pungkasan.

Source: www.habr.com

Add a comment