Gudun Apache Spark akan Kubernetes

Yan uwa masu karatu barkanmu da rana. A yau za mu yi magana kaɗan game da Apache Spark da ci gaban sa.

Gudun Apache Spark akan Kubernetes

A cikin duniyar Big Data ta zamani, Apache Spark shine ainihin ma'auni don haɓaka ayyukan sarrafa bayanai. Bugu da ƙari, ana amfani da shi don ƙirƙirar aikace-aikace masu gudana waɗanda ke aiki a cikin ra'ayi na micro batch, sarrafawa da jigilar bayanai a cikin ƙananan sassa (Spark Structured Streaming). Kuma bisa ga al'ada ya kasance wani ɓangare na tarin Hadoop gabaɗaya, ta amfani da YARN (ko a wasu lokuta Apache Mesos) a matsayin manajan albarkatun. Zuwa shekarar 2020, amfani da shi a tsarinsa na al'ada yana cikin tambaya ga yawancin kamfanoni saboda rashin ingantaccen rabon Hadoop - ci gaban HDP da CDH ya tsaya, CDH ba ta da kyau kuma tana da tsada mai yawa, sauran masu samar da Hadoop sun daina. ko dai ya daina wanzuwa ko kuma yana da ƙarancin gaba. Sabili da haka, ƙaddamar da Apache Spark ta amfani da Kubernetes yana haɓaka sha'awa a tsakanin al'umma da manyan kamfanoni - zama ma'auni a cikin kide-kide na kwantena da sarrafa kayan aiki a cikin gajimare masu zaman kansu da na jama'a, yana magance matsalar tare da tsara tsarin albarkatu marasa dacewa na ayyukan Spark akan YARN kuma yana ba da gudummawa. dandamali mai tasowa a hankali tare da yawancin tallace-tallace da rarrabawa ga kamfanoni masu girma da ratsi. Bugu da kari, bayan shaharar da aka samu, galibinsu sun riga sun sami damar samun wasu nau'ikan na'urori na kansu kuma sun kara kwarewarsu wajen amfani da shi, wanda hakan ya sauwaka tafiyar.

Farawa da sigar 2.3.0, Apache Spark ya sami goyan bayan hukuma don gudanar da ayyuka a cikin gungu na Kubernetes kuma a yau, za mu yi magana game da balaga na yanzu na wannan hanyar, zaɓuɓɓuka daban-daban don amfani da shi da ramukan da za a fuskanta yayin aiwatarwa.

Da farko, bari mu kalli tsarin haɓaka ayyuka da aikace-aikace bisa Apache Spark kuma mu haskaka al'amuran al'ada waɗanda kuke buƙatar gudanar da ɗawainiya akan gungu na Kubernetes. A cikin shirya wannan sakon, ana amfani da OpenShift azaman rarrabawa kuma za a ba da umarnin da suka dace da amfanin layin umarni (oc). Don sauran rarrabawar Kubernetes, ana iya amfani da umarni masu dacewa daga daidaitattun layin umarni Kubernetes (kubectl) ko analogues ɗin su (misali, don manufofin oc adm).

Halin amfani na farko - spark-submit

A lokacin haɓaka ayyuka da aikace-aikace, mai haɓakawa yana buƙatar gudanar da ayyuka don cire canjin bayanai. A ka'ida, ana iya amfani da stubs don waɗannan dalilai, amma haɓakawa tare da sa hannu na ainihi (ko da yake gwaji) lokuta na tsarin ƙarshe ya tabbatar da sauri kuma mafi kyau a cikin wannan aji na ayyuka. A cikin yanayin lokacin da muka yi kuskure akan ainihin yanayin tsarin ƙarshe, yanayi biyu yana yiwuwa:

  • mai haɓaka yana gudanar da aikin Spark a cikin gida a cikin keɓantaccen yanayi;

    Gudun Apache Spark akan Kubernetes

  • mai haɓakawa yana gudanar da aikin Spark akan gungu na Kubernetes a cikin madauki na gwaji.

    Gudun Apache Spark akan Kubernetes

Zaɓin na farko yana da haƙƙin wanzuwa, amma ya haɗa da rashin amfani da yawa:

  • Dole ne a samar da kowane mai haɓakawa tare da samun dama daga wurin aiki zuwa duk lokuta na tsarin ƙarshen da yake buƙata;
  • ana buƙatar isassun kayan aiki akan injin aiki don gudanar da aikin da ake haɓakawa.

Zaɓin na biyu ba shi da waɗannan lahani, tun da yin amfani da gungu na Kubernetes yana ba ku damar keɓance wuraren da ake buƙata don gudanar da ayyuka da samar da ita ta hanyar da ta dace don kawo ƙarshen tsarin tsarin, a sauƙaƙe samar da damar yin amfani da shi ta amfani da abin koyi na Kubernetes. duk membobin ƙungiyar ci gaba. Bari mu haskaka shi azaman yanayin amfani na farko - ƙaddamar da ayyukan Spark daga injin haɓaka na gida akan gungu na Kubernetes a cikin da'irar gwaji.

Bari mu ƙara magana game da tsarin kafa Spark don gudana a cikin gida. Don fara amfani da Spark kuna buƙatar shigar da shi:

mkdir /opt/spark
cd /opt/spark
wget http://mirror.linux-ia64.org/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar zxvf spark-2.4.5.tgz
rm -f spark-2.4.5.tgz

Muna tattara fakitin da ake buƙata don aiki tare da Kubernetes:

cd spark-2.4.5/
./build/mvn -Pkubernetes -DskipTests clean package

Cikakken gini yana ɗaukar lokaci mai yawa, kuma don ƙirƙirar hotunan Docker da gudanar da su akan gungu na Kubernetes, da gaske kuna buƙatar fayilolin jakunkuna daga littafin “taro/”, don haka kawai kuna iya gina wannan aikin:

./build/mvn -f ./assembly/pom.xml -Pkubernetes -DskipTests clean package

Don gudanar da ayyukan Spark akan Kubernetes, kuna buƙatar ƙirƙirar hoton Docker don amfani da matsayin tushe. Akwai hanyoyi guda 2 masu yiwuwa anan:

  • Hoton Docker da aka ƙirƙira ya haɗa da lambar aikin Spark mai aiwatarwa;
  • Hoton da aka ƙirƙira ya haɗa da Spark kawai da abubuwan dogaro da suka wajaba, lambar da za a iya aiwatarwa ana gudanar da ita daga nesa (misali, a cikin HDFS).

Da farko, bari mu gina hoton Docker mai ɗauke da misalin gwaji na aikin Spark. Don ƙirƙirar hotunan Docker, Spark yana da abin amfani da ake kira "docker-image-tool". Bari mu yi nazarin taimako a kai:

./bin/docker-image-tool.sh --help

Tare da taimakonsa, zaku iya ƙirƙirar hotunan Docker kuma ku loda su zuwa wuraren yin rajista masu nisa, amma ta tsohuwa yana da ƙarancin lahani:

  • ba tare da kasala ba yana ƙirƙirar hotuna 3 Docker lokaci guda - don Spark, PySpark da R;
  • baya ba ka damar saka sunan hoto.

Don haka, za mu yi amfani da ingantaccen sigar wannan kayan aikin da aka bayar a ƙasa:

vi bin/docker-image-tool-upd.sh

#!/usr/bin/env bash

function error {
  echo "$@" 1>&2
  exit 1
}

if [ -z "${SPARK_HOME}" ]; then
  SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"

function image_ref {
  local image="$1"
  local add_repo="${2:-1}"
  if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
    image="$REPO/$image"
  fi
  if [ -n "$TAG" ]; then
    image="$image:$TAG"
  fi
  echo "$image"
}

function build {
  local BUILD_ARGS
  local IMG_PATH

  if [ ! -f "$SPARK_HOME/RELEASE" ]; then
    IMG_PATH=$BASEDOCKERFILE
    BUILD_ARGS=(
      ${BUILD_PARAMS}
      --build-arg
      img_path=$IMG_PATH
      --build-arg
      datagram_jars=datagram/runtimelibs
      --build-arg
      spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
    )
  else
    IMG_PATH="kubernetes/dockerfiles"
    BUILD_ARGS=(${BUILD_PARAMS})
  fi

  if [ -z "$IMG_PATH" ]; then
    error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
  fi

  if [ -z "$IMAGE_REF" ]; then
    error "Cannot find docker image reference. Please add -i arg."
  fi

  local BINDING_BUILD_ARGS=(
    ${BUILD_PARAMS}
    --build-arg
    base_img=$(image_ref $IMAGE_REF)
  )
  local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}

  docker build $NOCACHEARG "${BUILD_ARGS[@]}" 
    -t $(image_ref $IMAGE_REF) 
    -f "$BASEDOCKERFILE" .
}

function push {
  docker push "$(image_ref $IMAGE_REF)"
}

function usage {
  cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.

Commands:
  build       Build image. Requires a repository address to be provided if the image will be
              pushed to a different registry.
  push        Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
  -f file               Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
  -p file               Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
  -R file               Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
  -r repo               Repository address.
  -i name               Image name to apply to the built image, or to identify the image to be pushed.  
  -t tag                Tag to apply to the built image, or to identify the image to be pushed.
  -m                    Use minikube's Docker daemon.
  -n                    Build docker image with --no-cache
  -b arg      Build arg to build or push the image. For multiple build args, this option needs to
              be used separately for each build arg.

Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.

Check the following documentation for more information on using the minikube Docker daemon:

  https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon

Examples:
  - Build image in minikube with tag "testing"
    $0 -m -t testing build

  - Build and push image with tag "v2.3.0" to docker.io/myrepo
    $0 -r docker.io/myrepo -t v2.3.0 build
    $0 -r docker.io/myrepo -t v2.3.0 push
EOF
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  usage
  exit 0
fi

REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
 case "${option}"
 in
 f) BASEDOCKERFILE=${OPTARG};;
 r) REPO=${OPTARG};;
 t) TAG=${OPTARG};;
 n) NOCACHEARG="--no-cache";;
 i) IMAGE_REF=${OPTARG};;
 b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
 esac
done

case "${@: -1}" in
  build)
    build
    ;;
  push)
    if [ -z "$REPO" ]; then
      usage
      exit 1
    fi
    push
    ;;
  *)
    usage
    exit 1
    ;;
esac

Tare da taimakonsa, muna harhada ainihin hoton Spark mai ɗauke da aikin gwaji don ƙididdige Pi ta amfani da Spark (a nan {docker-registry-url} shine URL na rajistar hoton Docker ɗin ku, {repo} shine sunan ma'ajiyar a cikin wurin yin rajista, wanda ya dace da aikin a cikin OpenShift , {image-name} - sunan hoton (idan an yi amfani da rabuwa na matakai uku na hotuna, alal misali, kamar yadda yake cikin rajistar hotuna na Red Hat OpenShift), {tag} - alamar wannan. sigar hoton):

./bin/docker-image-tool-upd.sh -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile -r {docker-registry-url}/{repo} -i {image-name} -t {tag} build

Shiga cikin gungu na OKD ta amfani da kayan aikin wasan bidiyo (a nan {OKD-API-URL} shine URL ɗin API na OKD):

oc login {OKD-API-URL}

Bari mu sami alamar mai amfani na yanzu don izini a cikin Registry Docker:

oc whoami -t

Shiga cikin rajistar Docker na ciki na gungun OKD (muna amfani da alamar da aka samu ta amfani da umarnin da ya gabata azaman kalmar sirri):

docker login {docker-registry-url}

Bari mu loda hoton Docker da aka haɗa zuwa Docker OKD Registry:

./bin/docker-image-tool-upd.sh -r {docker-registry-url}/{repo} -i {image-name} -t {tag} push

Bari mu duba cewa akwai hoton da aka haɗa a cikin OKD. Don yin wannan, buɗe URL ɗin a cikin burauzar tare da jerin hotunan aikin da ya dace (a nan {project} shine sunan aikin a cikin gungu na OpenShift, {OKD-WEBUI-URL} shine URL na OpenShift Web console. ) - https://{OKD-WEBUI-URL}/console /project/{project}/browse/images/{image-name}.

Don gudanar da ayyuka, dole ne a ƙirƙiri asusun sabis tare da gata don gudanar da pods azaman tushen (zamu tattauna wannan batu daga baya):

oc create sa spark -n {project}
oc adm policy add-scc-to-user anyuid -z spark -n {project}

Bari mu gudanar da umarnin ƙaddamar da walƙiya don buga aikin Spark zuwa gungu na OKD, ƙayyadaddun asusun sabis da aka ƙirƙira da hoton Docker:

 /opt/spark/bin/spark-submit --name spark-test --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar

A nan:

-suna - sunan aikin da zai shiga cikin samar da sunan Kubernetes pods;

-class - aji na fayil ɗin da za a iya aiwatarwa, wanda ake kira lokacin da aikin ya fara;

-conf - Siffofin sanyi na Spark;

spark.executor.instances - adadin masu aiwatar da Spark don ƙaddamarwa;

spark.kubernetes.authenticate.driver.serviceAccountName - sunan asusun sabis na Kubernetes da ake amfani dashi lokacin ƙaddamar da kwasfan fayiloli (don ayyana yanayin tsaro da iya aiki yayin hulɗa tare da Kubernetes API);

spark.kubernetes.namespace - Kubernetes namespace a cikin abin da direba da executor pods za a kaddamar;

spark.submit.deployMode - hanyar ƙaddamar da Spark (don daidaitaccen walƙiya-submit "cluster" ana amfani da shi, don Spark Operator da kuma daga baya iri na Spark "abokin ciniki");

spark.kubernetes.container.image - Hoton Docker da aka yi amfani da shi don ƙaddamar da kwasfa;

spark.master - Kubernetes API URL (an ƙayyade waje don haka samun dama yana faruwa daga injin gida);

local:/ ita ce hanyar da za a iya aiwatar da Spark a cikin hoton Docker.

Muna zuwa aikin OKD mai dacewa kuma muyi nazarin kwas ɗin da aka ƙirƙira - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods.

Don sauƙaƙe tsarin ci gaba, ana iya amfani da wani zaɓi, wanda aka ƙirƙiri hoto na gama gari na Spark, wanda duk ayyuka ke amfani da shi, kuma ana buga hotunan fayilolin aiwatarwa zuwa ma'ajin waje (misali, Hadoop) kuma an ƙayyade lokacin kira. walƙiya-ƙalla a matsayin hanyar haɗi. A wannan yanayin, zaku iya gudanar da nau'ikan ayyukan Spark daban-daban ba tare da sake gina hotunan Docker ba, ta amfani da, misali, WebHDFS don buga hotuna. Muna aika buƙatun don ƙirƙirar fayil (a nan {host} shine mai masaukin sabis na WebHDFS, {tashar jiragen ruwa} shine tashar jiragen ruwa na sabis na WebHDFS, {hanyar-zuwa-file-on-hdfs} shine hanyar da ake so zuwa fayil ɗin. na HDFS):

curl -i -X PUT "http://{host}:{port}/webhdfs/v1/{path-to-file-on-hdfs}?op=CREATE

Za ku sami amsa kamar haka (a nan {wuri} shine URL ɗin da ake buƙatar amfani da shi don saukar da fayil ɗin):

HTTP/1.1 307 TEMPORARY_REDIRECT
Location: {location}
Content-Length: 0

Load da fayil ɗin Spark mai aiwatarwa zuwa HDFS (a nan {hanya-zuwa-file-file} shine hanyar zuwa fayil ɗin aiwatar da Spark akan mai masaukin yanzu):

curl -i -X PUT -T {path-to-local-file} "{location}"

Bayan wannan, za mu iya yin spark-submit ta amfani da fayil ɗin Spark da aka ɗora zuwa HDFS (a nan {class-name} shine sunan ajin da ake buƙatar ƙaddamar don kammala aikin):

/opt/spark/bin/spark-submit --name spark-test --class {class-name} --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  hdfs://{host}:{port}/{path-to-file-on-hdfs}

Ya kamata a lura cewa don samun dama ga HDFS kuma tabbatar da aikin yana aiki, kuna iya buƙatar canza Dockerfile da rubutun shigarwar.sh - ƙara umarni zuwa Dockerfile don kwafin ɗakunan karatu masu dogara ga / fita / spark / jars directory kuma hada da fayil ɗin sanyi na HDFS a cikin SPARK_CLASSPATH a wurin shigarwa. sh.

Shari'ar amfani ta biyu - Apache Livy

Bugu da ari, lokacin da aka haɓaka aiki kuma ana buƙatar gwada sakamakon, tambaya ta taso na ƙaddamar da shi a matsayin wani ɓangare na tsarin CI / CD da kuma bin diddigin matsayin aiwatar da shi. Tabbas, zaku iya gudanar da shi ta amfani da kiran ƙaddamar da walƙiya na gida, amma wannan yana rikitar da kayan aikin CI/CD tunda yana buƙatar shigarwa da daidaitawa Spark akan wakilai/masu gudu na uwar garken CI da kafa damar shiga Kubernetes API. Don wannan yanayin, aikin da aka yi niyya ya zaɓi amfani da Apache Livy azaman API REST don gudanar da ayyukan Spark wanda aka shirya a cikin gungu na Kubernetes. Tare da taimakonsa, zaku iya gudanar da ayyukan Spark akan gungu na Kubernetes ta amfani da buƙatun cURL na yau da kullun, wanda aka sauƙaƙe aiwatarwa bisa kowane bayani na CI, kuma sanya shi a cikin gungu na Kubernetes yana warware batun tabbatarwa yayin hulɗa tare da Kubernetes API.

Gudun Apache Spark akan Kubernetes

Bari mu haskaka shi azaman yanayin amfani na biyu - gudanar da ayyukan Spark a matsayin wani ɓangare na tsarin CI/CD akan gungu na Kubernetes a cikin madauki na gwaji.

Kadan game da Apache Livy - yana aiki azaman uwar garken HTTP wanda ke ba da hanyar sadarwa ta Yanar Gizo da API RESTful wanda ke ba ku damar ƙaddamar da ƙaddamar da walƙiya ta hanyar wuce mahimman sigogin da suka dace. A al'adance an tura shi azaman ɓangare na rarraba HDP, amma kuma ana iya tura shi zuwa OKD ko kowane shigarwar Kubernetes ta amfani da bayanan da suka dace da saitin hotunan Docker, kamar wannan - github.com/ttauveron/k8s-big-data-experiments/ tree/master/livy-spark-2.3. Don shari'ar mu, an gina irin wannan hoton Docker, gami da sigar Spark 2.4.5 daga Dockerfile mai zuwa:

FROM java:8-alpine

ENV SPARK_HOME=/opt/spark
ENV LIVY_HOME=/opt/livy
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV SPARK_USER=spark

WORKDIR /opt

RUN apk add --update openssl wget bash && 
    wget -P /opt https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz && 
    tar xvzf spark-2.4.5-bin-hadoop2.7.tgz && 
    rm spark-2.4.5-bin-hadoop2.7.tgz && 
    ln -s /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark

RUN wget http://mirror.its.dal.ca/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip && 
    unzip apache-livy-0.7.0-incubating-bin.zip && 
    rm apache-livy-0.7.0-incubating-bin.zip && 
    ln -s /opt/apache-livy-0.7.0-incubating-bin /opt/livy && 
    mkdir /var/log/livy && 
    ln -s /var/log/livy /opt/livy/logs && 
    cp /opt/livy/conf/log4j.properties.template /opt/livy/conf/log4j.properties

ADD livy.conf /opt/livy/conf
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ADD entrypoint.sh /entrypoint.sh

ENV PATH="/opt/livy/bin:${PATH}"

EXPOSE 8998

ENTRYPOINT ["/entrypoint.sh"]
CMD ["livy-server"]

Za a iya gina hoton da aka ƙirƙira kuma a loda shi zuwa wurin ajiyar ku na Docker, kamar wurin ajiyar OKD na ciki. Don tura shi, yi amfani da bayanan mai zuwa ({registry-url} - URL na rajistar hoton Docker, {image-name} - Sunan hoton Docker, {tag} - alamar hoton Docker, {livy-url} - URL da ake so inda uwar garken za a sami dama ga Livy; ana amfani da bayyanar “Hanyar” idan ana amfani da Red Hat OpenShift azaman rarrabawar Kubernetes, in ba haka ba ana amfani da madaidaicin Ingress ko bayanin Sabis na nau'in NodePort):

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: livy
  name: livy
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: livy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: livy
    spec:
      containers:
        - command:
            - livy-server
          env:
            - name: K8S_API_HOST
              value: localhost
            - name: SPARK_KUBERNETES_IMAGE
              value: 'gnut3ll4/spark:v1.0.14'
          image: '{registry-url}/{image-name}:{tag}'
          imagePullPolicy: Always
          name: livy
          ports:
            - containerPort: 8998
              name: livy-rest
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/log/livy
              name: livy-log
            - mountPath: /opt/.livy-sessions/
              name: livy-sessions
            - mountPath: /opt/livy/conf/livy.conf
              name: livy-config
              subPath: livy.conf
            - mountPath: /opt/spark/conf/spark-defaults.conf
              name: spark-config
              subPath: spark-defaults.conf
        - command:
            - /usr/local/bin/kubectl
            - proxy
            - '--port'
            - '8443'
          image: 'gnut3ll4/kubectl-sidecar:latest'
          imagePullPolicy: Always
          name: kubectl
          ports:
            - containerPort: 8443
              name: k8s-api
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: spark
      serviceAccountName: spark
      terminationGracePeriodSeconds: 30
      volumes:
        - emptyDir: {}
          name: livy-log
        - emptyDir: {}
          name: livy-sessions
        - configMap:
            defaultMode: 420
            items:
              - key: livy.conf
                path: livy.conf
            name: livy-config
          name: livy-config
        - configMap:
            defaultMode: 420
            items:
              - key: spark-defaults.conf
                path: spark-defaults.conf
            name: livy-config
          name: spark-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: livy-config
data:
  livy.conf: |-
    livy.spark.deploy-mode=cluster
    livy.file.local-dir-whitelist=/opt/.livy-sessions/
    livy.spark.master=k8s://http://localhost:8443
    livy.server.session.state-retain.sec = 8h
  spark-defaults.conf: 'spark.kubernetes.container.image        "gnut3ll4/spark:v1.0.14"'
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: livy
  name: livy
spec:
  ports:
    - name: livy-rest
      port: 8998
      protocol: TCP
      targetPort: 8998
  selector:
    component: livy
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: livy
  name: livy
spec:
  host: {livy-url}
  port:
    targetPort: livy-rest
  to:
    kind: Service
    name: livy
    weight: 100
  wildcardPolicy: None

Bayan an yi amfani da shi kuma cikin nasarar ƙaddamar da kwas ɗin, ana samun siginar hoto na Livy a mahaɗin: http://{livy-url}/ui. Tare da Livy, za mu iya buga aikin mu na Spark ta amfani da buƙatun REST daga, misali, Postman. Misali na tarin tare da buƙatun an gabatar da su a ƙasa (hujjar daidaitawa tare da masu canji masu mahimmanci don aiwatar da aikin da aka ƙaddamar ana iya wucewa a cikin tsararrun "args"):

{
    "info": {
        "_postman_id": "be135198-d2ff-47b6-a33e-0d27b9dba4c8",
        "name": "Spark Livy",
        "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
    },
    "item": [
        {
            "name": "1 Submit job with jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar", nt"className": "org.apache.spark.examples.SparkPi",nt"numExecutors":1,nt"name": "spark-test-1",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt}n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        },
        {
            "name": "2 Submit job without jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "hdfs://{host}:{port}/{path-to-file-on-hdfs}", nt"className": "{class-name}",nt"numExecutors":1,nt"name": "spark-test-2",nt"proxyUser": "0",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt},nt"args": [ntt"HADOOP_CONF_DIR=/opt/spark/hadoop-conf",ntt"MASTER=k8s://https://kubernetes.default.svc:8443"nt]n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        }
    ],
    "event": [
        {
            "listen": "prerequest",
            "script": {
                "id": "41bea1d0-278c-40c9-ad42-bf2e6268897d",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        },
        {
            "listen": "test",
            "script": {
                "id": "3cdd7736-a885-4a2d-9668-bd75798f4560",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        }
    ],
    "protocolProfileBehavior": {}
}

Bari mu aiwatar da buƙatun farko daga tarin, je zuwa OKD interface kuma duba cewa an ƙaddamar da aikin cikin nasara - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods. A lokaci guda, wani zama zai bayyana a cikin Livy interface (http://{livy-url}/ui), wanda a ciki, ta amfani da Livy API ko zane mai hoto, za ku iya bibiyar ci gaban aikin da nazarin zaman. rajistan ayyukan.

Yanzu bari mu nuna yadda Livy ke aiki. Don yin wannan, bari mu bincika rajistan ayyukan Livy a cikin kwandon tare da uwar garken Livy - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods/{livy-pod-name }?tab=tambayoyi. Daga gare su za mu iya ganin cewa lokacin da ake kiran Livy REST API a cikin akwati mai suna "livy", ana aiwatar da ƙaddamar da tartsatsi, kama da wanda muka yi amfani da shi a sama (a nan {livy-pod-name} shine sunan da aka ƙirƙira. tare da uwar garken Livy). Tarin ya kuma gabatar da tambaya ta biyu wacce ke ba ku damar gudanar da ayyuka waɗanda ke ɗaukar nauyin aiwatar da Spark ta amfani da sabar Livy.

Halin amfani na uku - Spark Operator

Yanzu da aka gwada aikin, tambayar gudanar da shi akai-akai ta taso. Hanyar asali don gudanar da ayyuka akai-akai a cikin gungu na Kubernetes shine mahallin CronJob kuma zaku iya amfani da shi, amma a halin yanzu amfani da masu aiki don sarrafa aikace-aikace a Kubernetes ya shahara sosai kuma ga Spark akwai ma'aikacin da ya balaga, wanda kuma shine ma'aikaci. da aka yi amfani da su a cikin matakan kasuwanci (misali, Lightbend FastData Platform). Muna ba da shawarar yin amfani da shi - sigar kwanciyar hankali ta yanzu ta Spark (2.4.5) tana da iyakataccen zaɓin daidaitawa don gudanar da ayyukan Spark a cikin Kubernetes, yayin da babban sigar na gaba (3.0.0) ke ba da sanarwar cikakken goyon baya ga Kubernetes, amma har yanzu ba a san ranar da za a saki ta ba. . Spark Operator yana ramawa ga wannan gazawar ta ƙara mahimman zaɓuɓɓukan daidaitawa (misali, hawan ConfigMap tare da daidaitawar damar Hadoop zuwa Spark pods) da ikon gudanar da aikin da aka tsara akai-akai.

Gudun Apache Spark akan Kubernetes
Bari mu haskaka shi azaman yanayin amfani na uku - gudanar da ayyukan Spark akai-akai akan gungu na Kubernetes a cikin madauki na samarwa.

Spark Operator bude tushe ne kuma an haɓaka shi a cikin Google Cloud Platform - github.com/GoogleCloudPlatform/spark-on-k8s-operator. Ana iya yin shigarwa ta hanyoyi 3:

  1. A matsayin wani ɓangare na Lightbend FastData Platform / Cloudflow shigarwa;
  2. Amfani da Helm:
    helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
    helm install incubator/sparkoperator --namespace spark-operator
    	

  3. Yin amfani da bayyananni daga ma'ajiyar hukuma (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest). Yana da kyau a lura da waɗannan abubuwan - Cloudflow ya haɗa da mai aiki tare da sigar API v1beta1. Idan ana amfani da wannan nau'in shigarwa, bayanan bayanan aikace-aikacen Spark yakamata su dogara ne akan alamun misali a Git tare da sigar API ɗin da ta dace, misali, "v1beta1-0.9.0-2.4.0". Ana iya samun sigar mai aiki a cikin bayanin CRD da aka haɗa a cikin mai aiki a cikin ƙamus na “versions”:
    oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
    	

Idan an shigar da ma'aikaci daidai, kwasfa mai aiki tare da mai ba da sabis na Spark zai bayyana a cikin aikin da ya dace (misali, Cloudflow-fdp-sparkoperator a cikin sararin Cloudflow don shigarwar Cloudflow) da kuma nau'in albarkatun Kubernetes mai suna "sparkapplications" zai bayyana. . Kuna iya bincika samammun aikace-aikacen Spark tare da umarni mai zuwa:

oc get sparkapplications -n {project}

Don gudanar da ayyuka ta amfani da Spark Operator kuna buƙatar yin abubuwa 3:

  • ƙirƙirar hoton Docker wanda ya haɗa da duk ɗakunan karatu masu mahimmanci, da kuma daidaitawa da fayilolin aiwatarwa. A cikin hoton da aka yi niyya, wannan hoton ne da aka yi a matakin CI / CD kuma an gwada shi akan gungu na gwaji;
  • buga hoton Docker zuwa wurin yin rajista da ke samun dama daga gungu na Kubernetes;
  • haifar da bayyanannen nau'in "SparkApplication" da bayanin aikin da za a ƙaddamar. Ana samun alamun bayyanar a cikin ma'ajiyar hukuma (misali. github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta1-0.9.0-2.4.0/examples/spark-pi.yaml). Akwai mahimman abubuwan lura game da ma'anar:
    1. ƙamus na “apiVersion” dole ne ya nuna sigar API ɗin da ta dace da sigar mai aiki;
    2. ƙamus na “metadata.namespace” dole ne ya nuna sararin sunan da za a ƙaddamar da aikace-aikacen;
    3. ƙamus na “spec.image” dole ne ya ƙunshi adireshin hoton Docker da aka ƙirƙira a cikin wurin yin rajista;
    4. ƙamus na "spec.mainClass" dole ne ya ƙunshi ajin aikin Spark wanda ke buƙatar gudanar da aiki lokacin da aikin ya fara;
    5. Hanyar zuwa fayil ɗin jar da za a iya aiwatarwa dole ne a ƙayyade a cikin ƙamus na "spec.mainApplicationFile";
    6. ƙamus na "spec.sparkVersion" dole ne ya nuna nau'in Spark da ake amfani da shi;
    7. ƙamus na “spec.driver.serviceAccount” dole ne ya ƙididdige asusun sabis a cikin madaidaicin sunan Kubernetes wanda za a yi amfani da shi don gudanar da aikace-aikacen;
    8. ƙamus na "spec.executor" dole ne ya nuna adadin albarkatun da aka ware wa aikace-aikacen;
    9. ƙamus na "spec.volumeMounts" dole ne ya ƙirƙiri kundin adireshi na gida wanda a cikinsa za a ƙirƙiri fayilolin ayyukan Spark na gida.

Misali na samar da bayyananniyar bayanan (a nan {spark-service-account} shine asusun sabis a cikin gungun Kubernetes don gudanar da ayyukan Spark):

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: {spark-service-account}
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Wannan bayyanuwar tana ƙayyadadden asusun sabis wanda, kafin buga bayanin, dole ne ku ƙirƙiri ɗaurin rawar da suka dace waɗanda ke ba da haƙƙin samun dama ga aikace-aikacen Spark don yin hulɗa tare da Kubernetes API (idan ya cancanta). A cikin yanayinmu, aikace-aikacen yana buƙatar haƙƙoƙi don ƙirƙirar Pods. Bari mu ƙirƙiri rawar da ya dace:

oc adm policy add-role-to-user edit system:serviceaccount:{project}:{spark-service-account} -n {project}

Hakanan yana da kyau a lura cewa wannan ƙayyadaddun ƙayyadaddun bayanai na iya haɗawa da ma'aunin "hadoopConfigMap", wanda ke ba ku damar saka ConfigMap tare da daidaitawar Hadoop ba tare da fara sanya fayil ɗin da ya dace a cikin hoton Docker ba. Har ila yau, ya dace da gudanar da ayyuka akai-akai - ta amfani da ma'auni na "jadawali", za a iya ƙayyade jadawalin gudanar da aikin da aka ba.

Bayan haka, muna adana bayananmu zuwa fayil ɗin spark-pi.yaml kuma mu yi amfani da shi zuwa gungu na Kubernetes:

oc apply -f spark-pi.yaml

Wannan zai haifar da wani abu na nau'in "sparkapplications":

oc get sparkapplications -n {project}
> NAME       AGE
> spark-pi   22h

A wannan yanayin, za a ƙirƙiri wani kwasfa tare da aikace-aikacen, wanda za a nuna matsayinsa a cikin "sparkapplications" da aka ƙirƙira. Kuna iya duba shi tare da umarni mai zuwa:

oc get sparkapplications spark-pi -o yaml -n {project}

Bayan kammala aikin, POD zai matsa zuwa matsayin "Kammala", wanda kuma zai sabunta a cikin "sparkapplications". Ana iya duba rajistan ayyukan aikace-aikacen a cikin mai bincike ko ta amfani da umarni mai zuwa (a nan {sparkapplications-pod-name} shine sunan kwaf ɗin aikin mai gudana):

oc logs {sparkapplications-pod-name} -n {project}

Hakanan za'a iya sarrafa ayyukan Spark ta amfani da kayan aikin sparkctl na musamman. Don shigar da shi, rufe ma'ajiyar tare da lambar tushe, shigar da Go kuma gina wannan kayan aiki:

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin

Bari mu bincika jerin ayyukan Spark masu gudana:

sparkctl list -n {project}

Bari mu ƙirƙiri kwatance don aikin Spark:

vi spark-app.yaml

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Bari mu gudanar da aikin da aka kwatanta ta amfani da sparkctl:

sparkctl create spark-app.yaml -n {project}

Bari mu bincika jerin ayyukan Spark masu gudana:

sparkctl list -n {project}

Bari mu bincika jerin abubuwan da suka faru na aikin Spark da aka ƙaddamar:

sparkctl event spark-pi -n {project} -f

Bari mu bincika matsayin aikin Spark mai gudana:

sparkctl status spark-pi -n {project}

A ƙarshe, Ina so in yi la'akari da rashin amfanin da aka gano na amfani da sigar kwanciyar hankali na Spark (2.4.5) a Kubernetes:

  1. Na farko kuma, watakila, babban hasara shi ne rashin Locality Data. Duk da gazawar YARN, akwai kuma fa'idodi don amfani da shi, alal misali, ka'idar isar da lambar zuwa bayanai (maimakon bayanai zuwa lamba). Godiya ga shi, an aiwatar da ayyukan Spark akan nodes inda bayanan da ke cikin lissafin ke samuwa, kuma lokacin da aka ɗauka don isar da bayanai akan hanyar sadarwar ya ragu sosai. Lokacin amfani da Kubernetes, muna fuskantar buƙatar matsar da bayanan da ke cikin wani aiki a fadin hanyar sadarwa. Idan sun yi girma sosai, lokacin aiwatar da aikin zai iya ƙaruwa sosai, kuma yana buƙatar adadin sarari mai yawa da aka keɓe ga wuraren aikin Spark don ajiyar su na ɗan lokaci. Ana iya rage wannan lahani ta hanyar amfani da software na musamman wanda ke tabbatar da kasancewar bayanai a cikin Kubernetes (misali, Alluxio), amma wannan a zahiri yana nufin buƙatar adana cikakken kwafin bayanai akan nodes na gungu na Kubernetes.
  2. Muhimmin lahani na biyu shine tsaro. Ta hanyar tsoho, abubuwan da ke da alaƙa da tsaro game da gudanar da ayyukan Spark ba su da rauni, ba a rufe amfani da Kerberos a cikin takaddun hukuma (ko da yake an gabatar da zaɓuɓɓuka masu dacewa a cikin sigar 3.0.0, waɗanda za su buƙaci ƙarin aiki), da takaddun tsaro don ta amfani da Spark (https://spark.apache.org/docs/2.4.5/security.html) kawai YARN, Mesos da Standalone Cluster suna bayyana azaman manyan shaguna. A lokaci guda, mai amfani wanda aka ƙaddamar da ayyukan Spark ba za a iya bayyana shi kai tsaye ba - kawai muna ƙayyade asusun sabis wanda zai yi aiki a ƙarƙashinsa, kuma an zaɓi mai amfani bisa ga ƙa'idodin tsaro da aka tsara. A wannan batun, ko dai an yi amfani da tushen mai amfani, wanda ba shi da aminci a cikin yanayi mai albarka, ko mai amfani da UID bazuwar, wanda ba shi da daɗi yayin rarraba haƙƙin samun dama ga bayanai (ana iya magance wannan ta hanyar ƙirƙirar PodSecurityPolicies da haɗa su zuwa ga masu amfani da su). madaidaitan asusun sabis). A halin yanzu, mafita ita ce ko dai sanya duk fayilolin da suka dace kai tsaye cikin hoton Docker, ko kuma canza rubutun ƙaddamar da Spark don amfani da tsarin adanawa da dawo da sirrin da aka karɓa a cikin ƙungiyar ku.
  3. Gudun ayyukan Spark ta amfani da Kubernetes a hukumance har yanzu yana cikin yanayin gwaji kuma ana iya samun manyan canje-canje a cikin kayan tarihi da aka yi amfani da su (fayil ɗin daidaitawa, Hotunan tushe na Docker, da rubutun ƙaddamarwa) a nan gaba. Kuma hakika, lokacin shirya kayan, an gwada nau'ikan 2.3.0 da 2.4.5, halayen sun bambanta sosai.

Bari mu jira sabuntawa - sabon sigar Spark (3.0.0) kwanan nan an sake shi, wanda ya kawo manyan canje-canje ga aikin Spark akan Kubernetes, amma ya riƙe matsayin gwaji na tallafi ga wannan manajan albarkatun. Wataƙila sabuntawa na gaba za su ba da damar cikakken ba da shawarar barin YARN da gudanar da ayyukan Spark akan Kubernetes ba tare da tsoro don amincin tsarin ku ba kuma ba tare da buƙatar canza kayan aikin da kansa ba.

Gamawa

source: www.habr.com

Add a comment