Kumhanya Apache Spark paKubernetes

Vadiwa vaverengi, masikati akanaka. Nhasi tichataura zvishoma nezve Apache Spark uye tarisiro yayo yekusimudzira.

Kumhanya Apache Spark paKubernetes

Munyika yazvino yeBig Data, Apache Spark ndiyo de facto chiyero chekugadzira batch data kugadzirisa mabasa. Uye zvakare, inoshandiswawo kugadzira kushambadzira maapplication anoshanda muiyo micro batch pfungwa, kugadzirisa uye kutumira data muzvikamu zvidiki (Spark Structured Streaming). Uye netsika yanga iri chikamu cheiyo Hadoop stack, uchishandisa YARN (kana mune dzimwe nguva Apache Mesos) semaneja wezvishandiso. Pakazosvika 2020, kushandiswa kwayo muchimiro chayo chechinyakare kuri mubvunzo kumakambani mazhinji nekuda kwekushaikwa kwekugovewa kwakanaka kweHadoop - kuvandudzwa kweHDP neCDH kwakamira, CDH haina kunyatsogadzirwa uye ine mutengo wakakura, uye vakasara Hadoop vatengesi vane. zvakaguma kuvapo kana kuti ramangwana rakajeka. Naizvozvo, kutangwa kweApache Spark uchishandisa Kubernetes kuri kuwedzera kufarira pakati penharaunda nemakambani makuru - kuve chiyero mumidziyo orchestration uye manejimendi ezvishandiso mumakore akavanzika uye eruzhinji, inogadzirisa dambudziko nekusagadzikana kwechishandiso kuronga kweSpark mabasa paYARN uye inopa. ipuratifomu iri kusimukira ine zvakawanda zvekutengesa uye zvakavhurika kugovera kumakambani eese saizi uye mitsetse. Uye zvakare, nekuda kwekuzivikanwa, vazhinji vakatokwanisa kuwana akati wandei ekuisa kwavo uye vakawedzera hunyanzvi hwavo mukushandisa kwayo, izvo zvinorerutsa kufamba.

Kutanga neshanduro 2.3.0, Apache Spark yakawana rutsigiro rwepamutemo rwekuita mabasa muboka reKubernetes uye nhasi, tichataura nezve kukura kwazvino kweiyi nzira, sarudzo dzakasiyana dzekushandisa kwayo uye misungo ichasangana panguva yekuitwa.

Chekutanga pane zvese, ngatitarisei maitiro ekugadzira mabasa uye maapplication akavakirwa paApache Spark uye simbisa akajairwa makesi maunoda kuita basa pane Kubernetes cluster. Mukugadzirira iyi positi, OpenShift inoshandiswa sekugovera uye mirairo yakakodzera kune yayo yekuraira mutsara utility (oc) ichapihwa. Kune kumwe kugoverwa kweKubernetes, iyo inoenderana mirairo kubva kune yakajairwa Kubernetes command line utility (kubectl) kana analogues avo (semuenzaniso, oc adm policy) inogona kushandiswa.

Chekutanga kushandisa kesi - spark-submit

Munguva yekuvandudzwa kwemabasa uye maapplication, mugadziri anofanirwa kumhanyisa mabasa kugadzirisa shanduko yedata. Nechepfungwa, stubs inogona kushandiswa kune izvi zvinangwa, asi kusimudzira nekutora chikamu kwechokwadi (kunyangwe bvunzo) mamiriro ekupedzisira masisitimu aratidza kukurumidza uye nani mukirasi ino yemabasa. Muchiitiko kana isu tichigadzirisa pane chaiyo zviitiko zvemagumo masisitimu, mamiriro maviri anogoneka:

  • mugadziri anomhanyisa Spark basa munharaunda mune yakamira modhi;

    Kumhanya Apache Spark paKubernetes

  • mugadziri anomhanyisa Spark basa pane Kubernetes cluster mune yekuyedza loop.

    Kumhanya Apache Spark paKubernetes

Sarudzo yekutanga ine kodzero yekuvepo, asi inosanganisira zvakati wandei zvisingabatsiri:

  • Wese anovandudza anofanirwa kupihwa mukana kubva kubasa kune ese mamiriro ekupedzisira masisitimu aanoda;
  • huwandu hwakakwana hwezviwanikwa hunodiwa pamushini wekushanda kuti umhanye basa ririkugadziriswa.

Yechipiri sarudzo haina zvipingamupinyi izvi, sezvo kushandiswa kweboka reKubernetes kunokubvumidza kuti ugovere dziva rinodikanwa rekushandisa mabasa uye ugoripa mukana unodiwa wekupedzisira wemamiriro ehurongwa, uchipa mukana kune iyo uchishandisa Kubernetes muenzaniso wekuita. nhengo dzese dzeboka rebudiriro. Ngatiiise pachena seyekutanga kushandisa kesi - kuvhura Spark mabasa kubva kumuchina wekuvandudza wemuno pane Kubernetes cluster mune yekuyedza loop.

Ngatitaure zvakawanda nezve maitiro ekumisikidza Spark kuti imhanye munharaunda. Kuti utange kushandisa Spark unofanirwa kuiisa:

mkdir /opt/spark
cd /opt/spark
wget http://mirror.linux-ia64.org/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar zxvf spark-2.4.5.tgz
rm -f spark-2.4.5.tgz

Isu tinounganidza mapakeji anodiwa ekushanda naKubernetes:

cd spark-2.4.5/
./build/mvn -Pkubernetes -DskipTests clean package

Kuvaka kwakazara kunotora nguva yakawanda, uye kugadzira Docker mifananidzo uye kuimhanyisa paKubernetes cluster, iwe unongoda chete mafaera echirongo kubva ku "assembly/" dhairekitori, saka unogona chete kuvaka iyi subproject:

./build/mvn -f ./assembly/pom.xml -Pkubernetes -DskipTests clean package

Kuti umhanye Spark mabasa paKubernetes, unofanirwa kugadzira mufananidzo weDocker kushandisa semufananidzo wekutanga. Pane 2 nzira dzinogoneka pano:

  • Iyo yakagadzirwa Docker mufananidzo unosanganisira iyo inogoneka Spark basa kodhi;
  • Mufananidzo wakagadzirwa unosanganisira chete Spark uye zvinovimbika zvinodikanwa, iyo kodhi inogadziriswa inobatwa kure (semuenzaniso, muHDFS).

Kutanga, ngativake mufananidzo weDocker une bvunzo muenzaniso weSpark basa. Kugadzira mifananidzo yeDocker, Spark ine chishandiso chinonzi "docker-image-tool". Ngatidzidzei rubatsiro pazviri:

./bin/docker-image-tool.sh --help

Nerubatsiro rwayo, unogona kugadzira mifananidzo yeDocker uye woiisa kune kure kuregistries, asi nekusarudzika ine huwandu hwekuipa:

  • pasina kukundikana inogadzira 3 Docker mifananidzo kamwechete - yeSpark, PySpark uye R;
  • haikubvumidzi kuti utaure zita remufananidzo.

Naizvozvo, isu tichashandisa yakagadziridzwa vhezheni yeiyi yekushandisa yakapihwa pazasi:

vi bin/docker-image-tool-upd.sh

#!/usr/bin/env bash

function error {
  echo "$@" 1>&2
  exit 1
}

if [ -z "${SPARK_HOME}" ]; then
  SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"

function image_ref {
  local image="$1"
  local add_repo="${2:-1}"
  if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
    image="$REPO/$image"
  fi
  if [ -n "$TAG" ]; then
    image="$image:$TAG"
  fi
  echo "$image"
}

function build {
  local BUILD_ARGS
  local IMG_PATH

  if [ ! -f "$SPARK_HOME/RELEASE" ]; then
    IMG_PATH=$BASEDOCKERFILE
    BUILD_ARGS=(
      ${BUILD_PARAMS}
      --build-arg
      img_path=$IMG_PATH
      --build-arg
      datagram_jars=datagram/runtimelibs
      --build-arg
      spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
    )
  else
    IMG_PATH="kubernetes/dockerfiles"
    BUILD_ARGS=(${BUILD_PARAMS})
  fi

  if [ -z "$IMG_PATH" ]; then
    error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
  fi

  if [ -z "$IMAGE_REF" ]; then
    error "Cannot find docker image reference. Please add -i arg."
  fi

  local BINDING_BUILD_ARGS=(
    ${BUILD_PARAMS}
    --build-arg
    base_img=$(image_ref $IMAGE_REF)
  )
  local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}

  docker build $NOCACHEARG "${BUILD_ARGS[@]}" 
    -t $(image_ref $IMAGE_REF) 
    -f "$BASEDOCKERFILE" .
}

function push {
  docker push "$(image_ref $IMAGE_REF)"
}

function usage {
  cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.

Commands:
  build       Build image. Requires a repository address to be provided if the image will be
              pushed to a different registry.
  push        Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
  -f file               Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
  -p file               Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
  -R file               Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
  -r repo               Repository address.
  -i name               Image name to apply to the built image, or to identify the image to be pushed.  
  -t tag                Tag to apply to the built image, or to identify the image to be pushed.
  -m                    Use minikube's Docker daemon.
  -n                    Build docker image with --no-cache
  -b arg      Build arg to build or push the image. For multiple build args, this option needs to
              be used separately for each build arg.

Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.

Check the following documentation for more information on using the minikube Docker daemon:

  https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon

Examples:
  - Build image in minikube with tag "testing"
    $0 -m -t testing build

  - Build and push image with tag "v2.3.0" to docker.io/myrepo
    $0 -r docker.io/myrepo -t v2.3.0 build
    $0 -r docker.io/myrepo -t v2.3.0 push
EOF
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  usage
  exit 0
fi

REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
 case "${option}"
 in
 f) BASEDOCKERFILE=${OPTARG};;
 r) REPO=${OPTARG};;
 t) TAG=${OPTARG};;
 n) NOCACHEARG="--no-cache";;
 i) IMAGE_REF=${OPTARG};;
 b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
 esac
done

case "${@: -1}" in
  build)
    build
    ;;
  push)
    if [ -z "$REPO" ]; then
      usage
      exit 1
    fi
    push
    ;;
  *)
    usage
    exit 1
    ;;
esac

Nerubatsiro rwayo, tinounganidza mufananidzo wekutanga weSpark une basa rekuyedza kuverenga Pi uchishandisa Spark (pano {docker-registry-url} ndiyo URL yeDocker image registry yako, {repo} ndiro zita renzvimbo iri mukati me registry, iyo inofanana neprojekti iri muOpenShift , {image-name} - zita remufananidzo (kana kupatsanurwa kwematatu-matatu emifananidzo kuchishandiswa, semuenzaniso, senge mune yakabatanidzwa registry yeRed Hat OpenShift mifananidzo), {tag} - tag yeiyi vhezheni yemufananidzo):

./bin/docker-image-tool-upd.sh -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile -r {docker-registry-url}/{repo} -i {image-name} -t {tag} build

Pinda kuchikwata cheOKD uchishandisa chishandiso chekushandisa (pano {OKD-API-URL} ndiyo OKD cluster API URL):

oc login {OKD-API-URL}

Ngatitorei chiratidzo chemushandisi chazvino chemvumo muDocker Registry:

oc whoami -t

Pinda mukati Docker Registry yemukati yeOKD cluster (isu tinoshandisa chiratidzo chakawanikwa uchishandisa rairo yapfuura sepassword):

docker login {docker-registry-url}

Ngatiisei yakaunganidzwa Docker mufananidzo kuDocker Registry OKD:

./bin/docker-image-tool-upd.sh -r {docker-registry-url}/{repo} -i {image-name} -t {tag} push

Ngatitarisei kuti mufananidzo wakaunganidzwa unowanikwa muOKD. Kuti uite izvi, vhura iyo URL mubrowser ine runyorwa rwemifananidzo yepurojekiti inoenderana (pano {project} izita rechirongwa chiri mukati meOpenShift cluster, {OKD-WEBUI-URL} ndiyo URL yeOpenShift Web console. ) - https://{OKD-WEBUI-URL}/console/project/{project}/browse/images/{image-name}.

Kumhanyisa mabasa, account yesevhisi inofanirwa kugadzirwa iine ropafadzo dzekumhanyisa pods semudzi (tichakurukura iyi poindi gare gare):

oc create sa spark -n {project}
oc adm policy add-scc-to-user anyuid -z spark -n {project}

Ngatimhanyei spark-submit command kuburitsa Spark basa kune OKD cluster, tichitsanangura iyo yakagadzirwa sevhisi account uye Docker mufananidzo:

 /opt/spark/bin/spark-submit --name spark-test --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar

Π—Π΄Π΅ΡΡŒ:

-zita - zita rebasa iro richatora chikamu mukuumbwa kwezita reKubernetes pods;

-kirasi - kirasi yefaira rinogoneka, rakadanwa kana basa ratanga;

-conf - Spark kumisikidza paramita;

spark.executor.instances - nhamba yeva Spark executors kuti vatange;

spark.kubernetes.authenticate.driver.serviceAccountName - zita reKubernetes service account rinoshandiswa pakuvhura mapods (kutsanangura mamiriro ekuchengetedza uye kugona kana uchidyidzana neKubernetes API);

spark.kubernetes.namespace - Kubernetes namespace umo mutyairi uye executor pods ichavhurwa;

spark.submit.deployMode β€” nzira yekutangisa Spark (yeyakajairwa spark-submit "cluster" inoshandiswa, yeSpark Operator uye gare gare shanduro dzeSpark "client");

spark.kubernetes.container.image - Docker mufananidzo wakashandiswa kuburitsa pods;

spark.master - Kubernetes API URL (yekunze yakatsanangurwa saka kuwana kunoitika kubva kumuchina wemuno);

local:// ndiyo nzira inoenda kuSpark inoitiswa mukati meDocker mufananidzo.

Isu tinoenda kune inoenderana OKD chirongwa uye kudzidza mapods akagadzirwa - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods.

Kurerutsa maitiro ekusimudzira, imwe sarudzo inogona kushandiswa, umo yakajairwa chifananidzo cheSpark chinogadzirwa, chinoshandiswa nemabasa ese ekumhanya, uye snapshots yemafaira anogona kuitiswa anoburitswa kune ekunze kuchengetedza (semuenzaniso, Hadoop) uye inotsanangurwa pakufona. spark-submit sechibatanidza. Muchiitiko ichi, unogona kumhanyisa shanduro dzakasiyana dzeSpark mabasa pasina kuvakazve Docker mifananidzo, uchishandisa, semuenzaniso, WebHDFS kuburitsa mifananidzo. Tinotumira chikumbiro chekugadzira faira (pano {host} ndiye muridzi webasa reWebHDFS, {port} ndiro chiteshi chebasa reWebHDFS, {path-to-file-on-hdfs} ndiyo nzira inodiwa yefaira. paHDFS):

curl -i -X PUT "http://{host}:{port}/webhdfs/v1/{path-to-file-on-hdfs}?op=CREATE

Iwe uchagashira mhinduro seizvi (pano {nzvimbo} ndiyo URL inoda kushandiswa kudhawunirodha faira):

HTTP/1.1 307 TEMPORARY_REDIRECT
Location: {location}
Content-Length: 0

Rodha iyo Spark inogoneka faira muHDFS (pano {path-to-local-file} ndiyo nzira inoenda kuSpark rinogoneka faira pane iyezvino muenzi):

curl -i -X PUT -T {path-to-local-file} "{location}"

Mushure meizvi, tinogona kuita spark-submit tichishandisa Spark faira rakaiswa kuHDFS (pano {class-zita} izita rekirasi inoda kutangwa kuti ipedze basa):

/opt/spark/bin/spark-submit --name spark-test --class {class-name} --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  hdfs://{host}:{port}/{path-to-file-on-hdfs}

Izvo zvinofanirwa kucherechedzwa kuti kuti uwane HDFS uye uone kuti basa rinoshanda, ungangoda kushandura iyo Dockerfile uye iyo entrypoint.sh script - wedzera rairo kuDockerfile kukopa anovimba nemaraibhurari kune / opt/spark/jars dhairekitori uye inosanganisira iyo HDFS yekumisikidza faira mu SPARK_CLASSPATH munzvimbo yekupinda.

Chechipiri chekushandisa kesi - Apache Livy

Kupfuurirazve, kana basa rikagadzirwa uye mhedzisiro inoda kuongororwa, mubvunzo unomuka wekuivhura sechikamu cheCI / CD maitiro uye kuteedzera mamiriro ekuita kwayo. Ehe, unogona kuimhanyisa uchishandisa yemuno spark-kutumira kufona, asi izvi zvinokanganisa iyo CI/CD zvivakwa nekuti zvinoda kuisa uye kugadzirisa Spark pane CI server vamiririri / vanomhanya uye kumisikidza kupinda kuKubernetes API. Kune iyi kesi, chinangwa chekuita chakasarudza kushandisa Apache Livy seREST API yekumhanyisa Spark mabasa anoitirwa mukati meKubernetes cluster. Nerubatsiro rwayo, unogona kumhanyisa Spark mabasa paKubernetes cluster uchishandisa zvikumbiro zvecURL zvenguva dzose, zvinoitwa zviri nyore zvichibva pane chero mhinduro yeCI, uye kuiswa kwayo mukati meKubernetes cluster inogadzirisa nyaya yehuchokwadi kana uchidyidzana neKubernetes API.

Kumhanya Apache Spark paKubernetes

Ngatiiratidze sekesi yechipiri yekushandisa - inomhanyisa Spark mabasa sechikamu cheCI / CD maitiro pane Kubernetes cluster mune yekuyedza loop.

Zvishoma nezveApache Livy - inoshanda seHTTP sevha inopa Webhu interface uye RESTful API iyo inokutendera iwe kuvhura kure kure spark-submit nekupfuura inodiwa paramita. Sechinyakare yakatumirwa sechikamu chekugovera HDP, asi inogona zvakare kuendeswa kuOKD kana chero imwe Kubernetes yekumisikidza uchishandisa yakakodzera manifest uye seti yemifananidzo yeDocker, senge iyi - github.com/ttauveron/k8s-big-data-experiments/tree/master/livy-spark-2.3. Kune yedu, mufananidzo wakafanana weDocker wakavakwa, kusanganisira Spark vhezheni 2.4.5 kubva kune inotevera Dockerfile:

FROM java:8-alpine

ENV SPARK_HOME=/opt/spark
ENV LIVY_HOME=/opt/livy
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV SPARK_USER=spark

WORKDIR /opt

RUN apk add --update openssl wget bash && 
    wget -P /opt https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz && 
    tar xvzf spark-2.4.5-bin-hadoop2.7.tgz && 
    rm spark-2.4.5-bin-hadoop2.7.tgz && 
    ln -s /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark

RUN wget http://mirror.its.dal.ca/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip && 
    unzip apache-livy-0.7.0-incubating-bin.zip && 
    rm apache-livy-0.7.0-incubating-bin.zip && 
    ln -s /opt/apache-livy-0.7.0-incubating-bin /opt/livy && 
    mkdir /var/log/livy && 
    ln -s /var/log/livy /opt/livy/logs && 
    cp /opt/livy/conf/log4j.properties.template /opt/livy/conf/log4j.properties

ADD livy.conf /opt/livy/conf
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ADD entrypoint.sh /entrypoint.sh

ENV PATH="/opt/livy/bin:${PATH}"

EXPOSE 8998

ENTRYPOINT ["/entrypoint.sh"]
CMD ["livy-server"]

Mufananidzo wakagadzirwa unogona kuvakwa uye kurodha kune yako iripo Docker repository, senge yemukati OKD repository. Kuti uiise, shandisa inotevera manifest ({registry-url} - URL yeDocker image registry, {image-name} - Docker image name, {tag} - Docker image tag, {livy-url} - inodiwa URL uko server ichasvikika Livy; iyo "Nzira" kuratidza inoshandiswa kana Red Hat OpenShift ichishandiswa seKubernetes kugovera, zvikasadaro inowirirana Ingress kana Service manifest yerudzi NodePort inoshandiswa):

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: livy
  name: livy
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: livy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: livy
    spec:
      containers:
        - command:
            - livy-server
          env:
            - name: K8S_API_HOST
              value: localhost
            - name: SPARK_KUBERNETES_IMAGE
              value: 'gnut3ll4/spark:v1.0.14'
          image: '{registry-url}/{image-name}:{tag}'
          imagePullPolicy: Always
          name: livy
          ports:
            - containerPort: 8998
              name: livy-rest
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/log/livy
              name: livy-log
            - mountPath: /opt/.livy-sessions/
              name: livy-sessions
            - mountPath: /opt/livy/conf/livy.conf
              name: livy-config
              subPath: livy.conf
            - mountPath: /opt/spark/conf/spark-defaults.conf
              name: spark-config
              subPath: spark-defaults.conf
        - command:
            - /usr/local/bin/kubectl
            - proxy
            - '--port'
            - '8443'
          image: 'gnut3ll4/kubectl-sidecar:latest'
          imagePullPolicy: Always
          name: kubectl
          ports:
            - containerPort: 8443
              name: k8s-api
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: spark
      serviceAccountName: spark
      terminationGracePeriodSeconds: 30
      volumes:
        - emptyDir: {}
          name: livy-log
        - emptyDir: {}
          name: livy-sessions
        - configMap:
            defaultMode: 420
            items:
              - key: livy.conf
                path: livy.conf
            name: livy-config
          name: livy-config
        - configMap:
            defaultMode: 420
            items:
              - key: spark-defaults.conf
                path: spark-defaults.conf
            name: livy-config
          name: spark-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: livy-config
data:
  livy.conf: |-
    livy.spark.deploy-mode=cluster
    livy.file.local-dir-whitelist=/opt/.livy-sessions/
    livy.spark.master=k8s://http://localhost:8443
    livy.server.session.state-retain.sec = 8h
  spark-defaults.conf: 'spark.kubernetes.container.image        "gnut3ll4/spark:v1.0.14"'
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: livy
  name: livy
spec:
  ports:
    - name: livy-rest
      port: 8998
      protocol: TCP
      targetPort: 8998
  selector:
    component: livy
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: livy
  name: livy
spec:
  host: {livy-url}
  port:
    targetPort: livy-rest
  to:
    kind: Service
    name: livy
    weight: 100
  wildcardPolicy: None

Mushure mekuishandisa uye nekubudirira kuvhura iyo pod, iyo Livy graphical interface inowanikwa pane iyi link: http://{livy-url}/ui. NaLivy, tinogona kuburitsa yedu Spark basa tichishandisa REST chikumbiro kubva, semuenzaniso, Postman. Muenzaniso wemuunganidzwa une zvikumbiro unoratidzwa pazasi (magakava ekugadzirisa ane madhirigi anodiwa pakushanda kwebasa rakatangwa anogona kupfuudzwa mu "args" array):

{
    "info": {
        "_postman_id": "be135198-d2ff-47b6-a33e-0d27b9dba4c8",
        "name": "Spark Livy",
        "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
    },
    "item": [
        {
            "name": "1 Submit job with jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar", nt"className": "org.apache.spark.examples.SparkPi",nt"numExecutors":1,nt"name": "spark-test-1",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt}n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        },
        {
            "name": "2 Submit job without jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "hdfs://{host}:{port}/{path-to-file-on-hdfs}", nt"className": "{class-name}",nt"numExecutors":1,nt"name": "spark-test-2",nt"proxyUser": "0",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt},nt"args": [ntt"HADOOP_CONF_DIR=/opt/spark/hadoop-conf",ntt"MASTER=k8s://https://kubernetes.default.svc:8443"nt]n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        }
    ],
    "event": [
        {
            "listen": "prerequest",
            "script": {
                "id": "41bea1d0-278c-40c9-ad42-bf2e6268897d",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        },
        {
            "listen": "test",
            "script": {
                "id": "3cdd7736-a885-4a2d-9668-bd75798f4560",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        }
    ],
    "protocolProfileBehavior": {}
}

Ngatiite chikumbiro chekutanga kubva muunganidzwa, toenda kuOKD interface uye tione kuti basa ratotangwa zvinobudirira - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods. Panguva imwecheteyo, chikamu chichaonekwa muLivy interface (http://{livy-url}/ui), mukati mayo, uchishandisa Livy API kana graphical interface, unogona kutarisa mafambiro ebasa uye kudzidza chikamu. logs.

Zvino ngatiratidze kuti Livy anoshanda sei. Kuti tiite izvi, ngationgororei matanda echigaba cheLivy mukati mepodhi neLivy server - https://{OKD-WEBUI-URL}/console/project/{project}/brows/pods/{livy-pod-name }?tab=matanda. Kubva kwavari tinogona kuona kuti kana tichidaidza Livy REST API mumudziyo unonzi "livy", spark-submit inourayiwa, yakafanana neyatakashandisa pamusoro (pano {livy-pod-name} izita repodhi yakagadzirwa. neLivy server). Iko kuunganidzwa kunosumawo mubvunzo wechipiri uyo unobvumidza iwe kuti uite mabasa anotambira kure Spark inoitiswa uchishandisa Livy server.

Chechitatu chekushandisa kesi - Spark Operator

Iye zvino kuti basa racho rakaedzwa, mubvunzo wekuita nguva dzose unomuka. Iyo yekuzvarwa nzira yekugara uchimhanyisa mabasa muKubernetes cluster ndiyo CronJob entity uye unogona kuishandisa, asi parizvino kushandiswa kwevashandisi kubata maapplication muKubernetes kwakakurumbira uye kune Spark kune ane hunyanzvi mushandisi, ari zvakare. inoshandiswa mu Enterprise-level mhinduro (semuenzaniso, Lightbend FastData Platform). Isu tinokurudzira kuishandisa - iyo yazvino yakagadzikana vhezheni yeSpark (2.4.5) ine zvishoma zvishoma zvigadziriso sarudzo dzekumhanyisa Spark mabasa muKubernetes, nepo iyo huru inotevera vhezheni (3.0.0) inozivisa rutsigiro rwakazara rweKubernetes, asi zuva rekuburitsa harisati razivikanwa. . Spark Operator inotsiva kukanganisa uku nekuwedzera zvakakosha zvigadziriso sarudzo (semuenzaniso, kukwira ConfigMap ine Hadoop yekuwana gadziriso kuSpark pods) uye kugona kumhanya basa rakarongwa nguva dzose.

Kumhanya Apache Spark paKubernetes
Ngatiiratidze sekesi yechitatu yekushandisa - inogara ichimhanyisa Spark mabasa pane Kubernetes cluster mune yekugadzira loop.

Spark Operator yakavhurika sosi uye yakagadziridzwa mukati meGoogle Cloud Platform - github.com/GoogleCloudPlatform/spark-on-k8s-operator. Kuiswa kwayo kunogona kuitwa nenzira nhatu:

  1. Sechikamu cheLightbend FastData Platform/Cloudflow kuisirwa;
  2. Kushandisa Helm:
    helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
    helm install incubator/sparkoperator --namespace spark-operator
    	

  3. Kushandisa zviratidziro kubva kune yepamutemo repository (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest). Izvo zvakakosha kucherechedza zvinotevera - Cloudflow inosanganisira anoshanda neAPI vhezheni v1beta1. Kana iyi mhando yekumisikidza ikashandiswa, tsananguro yekuratidzira yeSpark inofanira kunge yakavakirwa pamuenzaniso ma tag muGit ane API vhezheni yakakodzera, semuenzaniso, "v1beta1-0.9.0-2.4.0". Iyo vhezheni yemushandisi inogona kuwanikwa mune tsananguro yeCRD inosanganisirwa mushandisi mu "shanduro" duramazwi:
    oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
    	

Kana mushandisi akaiswa nemazvo, podhi inoshanda ine Spark opareta ichaonekwa mune inoenderana purojekiti (semuenzaniso, cloudflow-fdp-sparkoperator mu Cloudflow nzvimbo yekumisikidza Cloudflow) uye inoenderana Kubernetes sosi yemhando inonzi "sparkapplications" ichaonekwa. . Unogona kuongorora zviripo Spark application nemurairo unotevera:

oc get sparkapplications -n {project}

Kumhanyisa mabasa uchishandisa Spark Operator unofanirwa kuita zvinhu zvitatu:

  • gadzira mufananidzo weDocker unosanganisira ese maraibhurari anodiwa, pamwe nekumisikidza uye mafaera anozoitwa. Mumufananidzo wakanangwa, uyu mufananidzo wakagadzirwa padanho reCI/CD uye wakaedzwa pachikwata chebvunzo;
  • buritsa mufananidzo weDocker kune registry inowanikwa kubva kuKubernetes cluster;
  • gadzira manifest ine "SparkApplication" mhando uye tsananguro yebasa richatangwa. Mienzaniso inoratidza inowanikwa munzvimbo yepamutemo repository (e.g. github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta1-0.9.0-2.4.0/examples/spark-pi.yaml) Pane zvinhu zvakakosha zvekucherechedza nezve manifesto:
    1. duramazwi re "apiVersion" rinofanira kuratidza API vhezheni inoenderana neshanduro yemushandisi;
    2. duramazwi re "metadata.namespace" rinofanira kuratidza nzvimbo yezita ichavhurwa application;
    3. duramazwi re "spec.image" rinofanira kunge riine kero yemufananidzo wakagadzirwa weDocker mune inosvikika registry;
    4. duramazwi re "spec.mainClass" rinofanira kunge riine Spark task class inoda kuitiswa kana maitiro atanga;
    5. duramazwi re "spec.mainApplicationFile" rinofanirwa kunge riine nzira inoenda kune faira rejagi rinoitwa;
    6. duramazwi re "sparkVersion" rinofanira kuratidza shanduro yeSpark iri kushandiswa;
    7. duramazwi re "spec.driver.serviceAccount" rinofanirwa kutsanangura iyo account yesevhisi mukati meinowirirana Kubernetes namespace iyo ichashandiswa kuita application;
    8. duramazwi re "spec.executor" rinofanirwa kuratidza huwandu hwezviwanikwa zvakagoverwa kuchikumbiro;
    9. duramazwi re "spec.volumeMounts" rinofanirwa kutsanangura dhairekitori renzvimbo umo maSpark task mafaira achagadzirwa.

Muenzaniso wekugadzira manifest (pano {spark-service-account} iaccount account mukati meKubernetes cluster yekumhanyisa Spark mabasa):

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: {spark-service-account}
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Iyi manifest inotsanangura account yesevhisi iyo, usati waburitsa manifest, iwe unofanirwa kugadzira inosungirwa basa rinopa kodzero dzekuwana dzinodiwa dzeSpark application yekudyidzana neKubernetes API (kana zvichidikanwa). Kwatiri, iko kushandisa kunoda kodzero kugadzira maPods. Ngatigadzirei inodiwa basa rinosunga:

oc adm policy add-role-to-user edit system:serviceaccount:{project}:{spark-service-account} -n {project}

Izvo zvakakoshawo kucherechedza kuti iyi manifestation yakatarwa inogona kusanganisira "hadoopConfigMap" parameter, iyo inokutendera kuti utaure ConfigMap ine Hadoop kumisikidzwa pasina kutanga waisa faira rinoenderana mumufananidzo weDocker. Inokodzerawo kumhanyisa mabasa nguva nenguva - uchishandisa iyo "hurongwa" parameter, chirongwa chekuita basa rakapihwa chinogona kutsanangurwa.

Mushure meizvozvo, isu tinochengetedza yedu manifest kune spark-pi.yaml faira uye toishandisa kune yedu Kubernetes cluster:

oc apply -f spark-pi.yaml

Izvi zvinogadzira chinhu chemhando "sparkapplications":

oc get sparkapplications -n {project}
> NAME       AGE
> spark-pi   22h

Muchiitiko ichi, podhi ine application ichagadzirwa, iyo mamiriro ayo acharatidzwa mune yakasikwa "sparkapplications". Unogona kuiona nemurairo unotevera:

oc get sparkapplications spark-pi -o yaml -n {project}

Kana wapedza basa, iyo POD ichaenda kune iyo "Yakapedzwa" mamiriro, ayo anozovandudza mu "sparkapplications". Mapepa ekushandisa anogona kutariswa mubrowser kana kushandisa murairo unotevera (pano {sparkapplications-pod-name} izita repodhi yebasa rinomhanya):

oc logs {sparkapplications-pod-name} -n {project}

Spark mabasa anogona zvakare kudzorwa uchishandisa yakasarudzika sparkctl utility. Kuti uiise, gadzira iyo repository neiyo sosi kodhi, isa Go uye uvake chishandiso ichi:

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin

Ngationgororei rondedzero yekumhanyisa Spark mabasa:

sparkctl list -n {project}

Ngatigadzire tsananguro yebasa reSpark:

vi spark-app.yaml

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Ngatimhanyei basa rakatsanangurwa tichishandisa sparkctl:

sparkctl create spark-app.yaml -n {project}

Ngationgororei rondedzero yekumhanyisa Spark mabasa:

sparkctl list -n {project}

Ngationgororei rondedzero yezviitiko zveyakatangwa Spark basa:

sparkctl event spark-pi -n {project} -f

Ngationgororei mamiriro eiyo Spark basa rinomhanya:

sparkctl status spark-pi -n {project}

Mukupedzisa, ndinoda kufunga nezvezvakaipira zvakawanikwa zvekushandisa ikozvino yakagadzikana vhezheni yeSpark (2.4.5) muKubernetes:

  1. Chekutanga uye, pamwe, chikuru chinokanganisa ndiko kushomeka kweData Locality. Pasinei nezvikanganiso zvese zveYARN, pakanga painewo zvakanakira kuishandisa, semuenzaniso, iyo musimboti wekuendesa kodhi kune data (panzvimbo yedata kune kodhi). Kutenda kwazviri, mabasa eSpark akaitwa pane node uko iyo data inobatanidzwa mukuverenga yaive, uye nguva yaitora kuendesa data pamusoro petiweki yakaderedzwa zvakanyanya. Kana tichishandisa Kubernetes, isu takatarisana nekudiwa kwekufambisa data inobatanidzwa mune rimwe basa panetiweki. Kana iwo akakura zvakakwana, iyo nguva yekuita basa inogona kuwedzera zvakanyanya, uye inodawo yakaringana yakawanda yedhisiki nzvimbo yakagoverwa kune Spark basa zviitiko zvekuchengetedza kwavo kwenguva pfupi. Izvi zvinogona kudzikiswa nekushandisa yakasarudzika software inovimbisa nzvimbo yedata muKubernetes (semuenzaniso, Alluxio), asi izvi zvinotoreva kukosha kwekuchengeta kopi yakazara yedata pane node dzeKubernetes cluster.
  2. Chechipiri chakakosha kukanganisa kuchengetedzwa. Nekumisikidza, maficha ane chekuita nekuchengetedza Spark akadzimwa, kushandiswa kweKerberos hakuna kuvharwa muzvinyorwa zvepamutemo (kunyangwe sarudzo dzinoenderana dzakaunzwa mushanduro 3.0.0, iyo inoda rimwe basa), uye zvinyorwa zvekuchengetedza zve. uchishandisa Spark (https://spark.apache.org/docs/2.4.5/security.html) YARN chete, Mesos neStandalone Cluster zvinoonekwa sezvitoro zvakakosha. Panguva imwecheteyo, mushandisi anotangwa mabasa eSpark haagone kutaurwa zvakananga - isu tinongotsanangura iyo account yebasa pasi payo yaanoshanda, uye mushandisi anosarudzwa zvichienderana neakagadziriswa marongero ekuchengetedza. Panyaya iyi, ingave iyo mudzi mushandisi inoshandiswa, iyo isina kuchengetedzeka munzvimbo inogadzira, kana mushandisi ane isina kurongeka UID, izvo zvisingaite pakugovera kodzero yekuwana data (izvi zvinogona kugadziriswa nekugadzira PodSecurityPolicies uye nekuibatanidza kune iyo maakaundi ebasa rinoenderana). Parizvino, mhinduro ndeyekuisa mafaera ese anodiwa zvakananga mumufananidzo weDocker, kana kugadzirisa Spark yekumisikidza script kushandisa michina yekuchengetedza uye kudzoreredza zvakavanzika zvakagamuchirwa musangano rako.
  3. Kumhanya Spark mabasa uchishandisa Kubernetes zviri pamutemo zvichiri mukuyedza mode uye panogona kunge paine shanduko huru muzvigadzirwa zvinoshandiswa (mafaira ekugadzirisa, Docker base mifananidzo, uye kutanga zvinyorwa) mune ramangwana. Uye zvechokwadi, pakugadzirira zvinhu, shanduro 2.3.0 uye 2.4.5 dzakaedzwa, maitiro akanga akasiyana zvakanyanya.

Ngatimirirei zvigadziriso - vhezheni nyowani yeSpark (3.0.0) ichangobva kuburitswa, iyo yakaunza shanduko huru kubasa reSpark paKubernetes, asi yakachengeta chimiro chekuyedza chekutsigira uyu maneja wezviwanikwa. Zvichida zvinogadziridza zvinotevera zvinonyatsoita kuti zvikwanisike kukurudzira zvizere kusiya YARN uye kumhanya Spark mabasa paKubernetes pasina kutya kuchengetedzeka kwehurongwa hwako uye pasina chikonzero chekuzvimiririra kugadzirisa zvinhu zvinoshanda.

Magumo.

Source: www.habr.com

Voeg