Ku shaqeynta Apache Spark ee Kubernetes

Akhristayaasheena sharafta lahow, galab wanaagsan. Maanta waxaan ka hadli doonaa wax yar oo ku saabsan Apache Spark iyo rajada horumarineed ee ay leedahay.

Ku shaqeynta Apache Spark ee Kubernetes

Dunida casriga ah ee Xogta Weyn, Apache Spark waa heerka dhabta ah ee horumarinta hawlaha habaynta xogta. Intaa waxaa dheer, waxaa sidoo kale loo isticmaalaa in lagu abuuro codsiyada qulqulka kuwaas oo ka shaqeeya fikradda dufcadda yar, ka baaraandegidda iyo dhoofinta xogta qaybo yaryar (Spark Structured Streaming). Dhaqan ahaanna waxay ahayd qayb ka mid ah xidhmada guud ee Hadoop, iyadoo la isticmaalayo YARN (ama xaaladaha qaarkood Apache Mesos) maamulaha agabka. Marka la gaaro 2020, isticmaalkeeda qaab dhaqameed ayaa su'aal ka taagan tahay shirkadaha intooda badan sababtoo ah la'aanta wax-qaybinta Hadoop-ka wanaagsan - horumarinta HDP iyo CDH waa joogsatay, CDH si fiican uma horumarsana oo kharash badan ayaa ku jira, alaab-qeybiyeyaasha Hadoop ee soo hadhayna waxay leeyihiin. ama waa joogsaday ama ha lahaado mustaqbal mugdi ah. Sidaa darteed, bilaabista Apache Spark iyadoo la adeegsanayo Kubernetes waa xiisaha sii kordhaya ee bulshada iyo shirkadaha waaweyn - noqoshada halbeegga abaabulka weelka iyo maareynta kheyraadka ee daruuraha gaarka ah iyo kuwa dadweynaha, waxay ku xallisaa dhibaatada jadwalka kheyraadka aan habboonayn ee hawlaha Spark ee YARN waxayna bixisaa madal si joogto ah u koraysa oo leh ganacsiyo badan iyo qaybin furan oo shirkado leh nooc kasta iyo xariijimo kala duwan. Intaa waxaa dheer, caannimada ka dib, intooda badani waxay mar hore u suurtagashay inay helaan laba qalab oo iyaga u gaar ah waxayna kordhiyeen khibradooda isticmaalka, taas oo fududaynaysa socodka.

Laga bilaabo nooca 2.3.0, Apache Spark waxay heshay taageero rasmi ah oo ku saabsan socodsiinta hawlaha Kubernetes cluster iyo maanta, waxaan ka hadli doonaa qaan-gaarnimada hadda ee habkan, doorashooyin kala duwan oo loo isticmaalo iyo dhibaatooyinka la kulmi doono inta lagu jiro hirgelinta.

Ugu horreyntii, aan eegno habka horumarinta hawlaha iyo codsiyada ku salaysan Apache Spark oo muujinno kiisaska caadiga ah ee aad u baahan tahay inaad ku socodsiiso hawsha Kubernetes cluster. Diyaarinta boostadan, OpenShift waxaa loo istcimaalayaa qaybin ahaan waxaana la siin doonaa amarada laxiriira adeeggeeda khadka taliska (oc) Qaybinta kale ee Kubernetes, amarrada u dhigma ee utility khadka taliska ee Kubernetes (kubectl) ama analoogyadooda (tusaale ahaan siyaasada oc adm) ayaa la isticmaali karaa.

Kiis isticmaalka ugu horreeya - dhimbiil-gudbi

Inta lagu jiro horumarinta hawlaha iyo codsiyada, horumariyuhu wuxuu u baahan yahay inuu socodsiiyo hawlaha si uu u tirtiro isbeddelka xogta. Aragti ahaan, stubs ayaa loo isticmaali karaa ujeedooyinkan, laakiin horumarinta iyada oo ka qaybqaadashada dhabta ah (inkastoo imtixaan) dhacdooyinka hababka dhamaadka ayaa la xaqiijiyay inay si dhakhso ah uga fiicnaanayaan fasalkan hawlaha. Xaaladda marka aan ka saarno dhacdooyinka dhabta ah ee hababka dhamaadka, laba xaaladood ayaa suurtagal ah:

  • horumariyahu waxa uu u wadaa hawl Spark ah gudaha hab gooni ah;

    Ku shaqeynta Apache Spark ee Kubernetes

  • horumariye wuxuu ku wadaa hawl Spark ah kooxda Kubernetes ee wareegga tijaabada ah.

    Ku shaqeynta Apache Spark ee Kubernetes

Doorashada kowaad waxay xaq u leedahay inay jirto, laakiin waxay keenaysaa tiro khasaare ah:

  • Horumariye kasta waa in la siiyaa fursad uu ka helo goobta shaqada ilaa dhammaan xaaladaha nidaamyada dhamaadka uu u baahan yahay;
  • xaddi ku filan oo agab ah ayaa looga baahan yahay mashiinka shaqada si uu u socodsiiyo hawsha la soo saarayo.

Doorashada labaad ma laha faa'iido darradan, maadaama isticmaalka kutlada Kubernetes ay kuu ogolaaneyso inaad u qoondeyso barkada kheyraadka lagama maarmaanka ah si aad u socodsiiso hawlaha oo aad siiso marinka lagama maarmaanka u ah dhacdooyinka nidaamka dhamaadka, si dabacsanaan leh u siinaya marin u helka adigoo isticmaalaya Kubernetes model model dhammaan xubnaha kooxda horumarinta. Aan u muujinno kiiskii ugu horreeyay ee la isticmaalo - ka billowda hawlaha Spark ee mashiinka horumariyaha maxalliga ah ee kutlada Kubernetes ee wareegga tijaabada.

Aynu ka hadalno wax badan oo ku saabsan habka loo dejiyo Spark si ay ugu shaqeyso gudaha. Si aad u bilowdo isticmaalka Spark waxaad u baahan tahay inaad ku rakibto:

mkdir /opt/spark
cd /opt/spark
wget http://mirror.linux-ia64.org/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar zxvf spark-2.4.5.tgz
rm -f spark-2.4.5.tgz

Waxaan aruurineynaa xirmooyinka lagama maarmaanka u ah la shaqeynta Kubernetes:

cd spark-2.4.5/
./build/mvn -Pkubernetes -DskipTests clean package

Dhismuhu wuxuu qaadanayaa waqti badan, iyo in la abuuro sawirro Docker oo lagu socodsiiyo kooxda Kubernetes, runtii waxaad u baahan tahay oo keliya faylalka weelka ee hagaha "ururka/", marka waxaad dhisi kartaa oo keliya mashruucan:

./build/mvn -f ./assembly/pom.xml -Pkubernetes -DskipTests clean package

Si aad u socodsiiso shaqooyinka Spark ee Kubernetes, waxaad u baahan tahay inaad abuurto sawir Docker si aad u isticmaasho sawir salka ah. Waxaa jira 2 hab oo suurtagal ah halkan:

  • Sawirka Docker ee la sameeyay waxaa ka mid ah koodhka shaqada ee Spark ee la fulin karo;
  • Sawirka la sameeyay waxaa ku jira kaliya Spark iyo ku tiirsanaanta lagama maarmaanka ah, koodhka la fulin karo waxaa lagu martigaliyaa meel fog (tusaale, HDFS).

Marka hore, aan dhisno sawirka Docker oo ka kooban tusaale tijaabo ah hawsha Spark. Si loo abuuro sawirada Docker, Spark waxay leedahay utility loo yaqaan "docker-image-tool". Aynu ku baranno caawimada ku saabsan:

./bin/docker-image-tool.sh --help

Caawinteeda, waxaad abuuri kartaa sawirro Docker oo aad ku dhejin kartaa diiwaannada fog, laakiin asal ahaan waxay leedahay tiro faa'iido darrooyin ah:

  • iyada oo aan guuldarraysnayn waxay abuurtaa 3 sawirro Docker hal mar - loogu talagalay Spark, PySpark iyo R;
  • kuma ogola inaad sheegto magaca sawirka.

Sidaa darteed, waxaan isticmaali doonaa nooca la bedelay ee utility this ee hoos ku qoran:

vi bin/docker-image-tool-upd.sh

#!/usr/bin/env bash

function error {
  echo "$@" 1>&2
  exit 1
}

if [ -z "${SPARK_HOME}" ]; then
  SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"

function image_ref {
  local image="$1"
  local add_repo="${2:-1}"
  if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
    image="$REPO/$image"
  fi
  if [ -n "$TAG" ]; then
    image="$image:$TAG"
  fi
  echo "$image"
}

function build {
  local BUILD_ARGS
  local IMG_PATH

  if [ ! -f "$SPARK_HOME/RELEASE" ]; then
    IMG_PATH=$BASEDOCKERFILE
    BUILD_ARGS=(
      ${BUILD_PARAMS}
      --build-arg
      img_path=$IMG_PATH
      --build-arg
      datagram_jars=datagram/runtimelibs
      --build-arg
      spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
    )
  else
    IMG_PATH="kubernetes/dockerfiles"
    BUILD_ARGS=(${BUILD_PARAMS})
  fi

  if [ -z "$IMG_PATH" ]; then
    error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
  fi

  if [ -z "$IMAGE_REF" ]; then
    error "Cannot find docker image reference. Please add -i arg."
  fi

  local BINDING_BUILD_ARGS=(
    ${BUILD_PARAMS}
    --build-arg
    base_img=$(image_ref $IMAGE_REF)
  )
  local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}

  docker build $NOCACHEARG "${BUILD_ARGS[@]}" 
    -t $(image_ref $IMAGE_REF) 
    -f "$BASEDOCKERFILE" .
}

function push {
  docker push "$(image_ref $IMAGE_REF)"
}

function usage {
  cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.

Commands:
  build       Build image. Requires a repository address to be provided if the image will be
              pushed to a different registry.
  push        Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
  -f file               Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
  -p file               Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
  -R file               Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
  -r repo               Repository address.
  -i name               Image name to apply to the built image, or to identify the image to be pushed.  
  -t tag                Tag to apply to the built image, or to identify the image to be pushed.
  -m                    Use minikube's Docker daemon.
  -n                    Build docker image with --no-cache
  -b arg      Build arg to build or push the image. For multiple build args, this option needs to
              be used separately for each build arg.

Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.

Check the following documentation for more information on using the minikube Docker daemon:

  https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon

Examples:
  - Build image in minikube with tag "testing"
    $0 -m -t testing build

  - Build and push image with tag "v2.3.0" to docker.io/myrepo
    $0 -r docker.io/myrepo -t v2.3.0 build
    $0 -r docker.io/myrepo -t v2.3.0 push
EOF
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  usage
  exit 0
fi

REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
 case "${option}"
 in
 f) BASEDOCKERFILE=${OPTARG};;
 r) REPO=${OPTARG};;
 t) TAG=${OPTARG};;
 n) NOCACHEARG="--no-cache";;
 i) IMAGE_REF=${OPTARG};;
 b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
 esac
done

case "${@: -1}" in
  build)
    build
    ;;
  push)
    if [ -z "$REPO" ]; then
      usage
      exit 1
    fi
    push
    ;;
  *)
    usage
    exit 1
    ;;
esac

Caawinteeda, waxaan soo aruurineynaa sawirka aasaasiga ah ee Spark oo ka kooban hawl tijaabo ah oo lagu xisaabinayo Pi iyadoo la adeegsanayo Spark (halkan {docker-registry-url} waa URL ee diiwaanka sawirkaaga Docker, {repo} waa magaca kaydka gudaha diiwaanka, kaas oo u dhigma mashruuca OpenShift , {image-name} - magaca sawirka (haddii kala saarista saddex-heer ee sawirada la isticmaalo, tusaale ahaan, sida diiwaanka isku dhafan ee sawirada Koofiyada Cas OpenShift), {tag} - tag of this nooca sawirka):

./bin/docker-image-tool-upd.sh -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile -r {docker-registry-url}/{repo} -i {image-name} -t {tag} build

Gal kooxda OKD adiga oo isticmaalaya utility console (halkan {OKD-API-URL} waa OKD cluster API URL):

oc login {OKD-API-URL}

Aynu helno calaamada isticmaalaha hadda ee oggolaanshaha gudaha Diiwaanka Docker:

oc whoami -t

Gal Diiwaanka Docker-ka gudaha ee kooxda OKD (waxaan isticmaalnaa calaamada la helay anagoo adeegsanayna amarkii hore sida erayga sirta ah):

docker login {docker-registry-url}

Aan soo rarno sawirka Docker ee la isu keenay OKD Registry Docker:

./bin/docker-image-tool-upd.sh -r {docker-registry-url}/{repo} -i {image-name} -t {tag} push

Aynu eegno in sawirka la isu keenay uu ku jiro OKD. Si tan loo sameeyo, ku fur URL browserka oo wata liiska sawirada mashruuca u dhigma (halkan {mashruuca} waa magaca mashruuca gudaha kooxda OpenShift, {OKD-WEBUI-URL} waa URL ee OpenShift Web console ) - https://{OKD-WEBUI-URL}/console /project/{project}/browse/images/{image-name}.

Si loo socodsiiyo hawlaha, koontada adeegga waa in la abuuraa iyada oo leh mudnaanta lagu socodsiiyo pods-ka xidid ahaan (waxaanu ka hadli doonaa qodobkan mar dambe):

oc create sa spark -n {project}
oc adm policy add-scc-to-user anyuid -z spark -n {project}

Aan socodsiino amarka-soo-gudbiyeedka si aan u daabacno hawsha Spark kooxda OKD, annagoo tilmaamayna koontada adeegga la abuuray iyo sawirka Docker:

 /opt/spark/bin/spark-submit --name spark-test --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar

Halkan:

Magaca - magaca hawsha ka qaybqaadan doonta samaynta magaca Kubernetes pods;

-class - fasalka faylka la fulin karo, oo loo yaqaan marka hawshu bilaabato;

-conf - Halbeegyada qaabeynta Spark;

spark.executor.intances - tirada fulinta Spark ee la bilaabayo;

spark.kubernetes.authenticate.driver.serviceAccountName - magaca koontada adeega Kubernetes ee la isticmaalo marka la bilaabayo boodhadhka (si loo qeexo macnaha amniga iyo awoodaha marka lala falgalaayo Kubernetes API);

spark.kubernetes.namespace - Kubernetes magaca goobta kaas oo darawalka iyo galalka fulinta la bilaabi doono;

spark.submit.deployMode - habka loo bilaabo Spark (sida caadiga ah dhimbiil-gudbi "cluster" waxaa loo isticmaalaa, ee Spark Operator iyo noocyada dambe ee Spark "macmiil");

spark.kubernetes.container.image - Sawirka Docker ee loo isticmaalo in lagu soo saaro boodhadhka;

spark.master - Kubernetes API URL (dibadeed ayaa la cayimay si ay u galaan mashiinka maxalliga ah);

local:// waa dariiqa loo maro Spark ee lagu fulin karo gudaha sawirka Docker.

Waxaan aadeynaa mashruuca OKD ee u dhigma oo aan barano boodhadhka la abuuray - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods.

Si loo fududeeyo geeddi-socodka horumarinta, ikhtiyaar kale ayaa la isticmaali karaa, kaas oo la abuurayo sawirka aasaasiga ah ee Spark, oo loo isticmaalo dhammaan hawlaha si ay u socdaan, iyo sawirada faylasha la fulin karo ayaa lagu daabacaa kaydinta dibadda (tusaale, Hadoop) oo la cayimay marka la wacayo. dhimbiil-gudbi xiriir ahaan. Xaaladdan oo kale, waxaad socodsiin kartaa noocyo kala duwan oo hawlaha Spark ah adigoon dib u dhisin sawirada Docker, adoo isticmaalaya, tusaale ahaan, WebHDFS si aad u daabacdo sawirada. Waxaan u dirnaa codsi si aan u abuurno fayl (halkan {host} waa martigeliyaha adeegga WebHDFS, {dekedda} waa dekedda adeegga WebHDFS, {dariiqa-file-on-hdfs} waa waddada la rabo ee faylka HDFS):

curl -i -X PUT "http://{host}:{port}/webhdfs/v1/{path-to-file-on-hdfs}?op=CREATE

Waxaad heli doontaa jawaab sidan oo kale ah (halkan {goobta} waa URL u baahan in loo isticmaalo soo dejinta faylka):

HTTP/1.1 307 TEMPORARY_REDIRECT
Location: {location}
Content-Length: 0

Ku shub feylka Spark la fulin karo HDFS (halkan {dariiqa-fayl-maxalli} waa dariiqa loo maro faylka Spark la fulin karo ee martida hadda):

curl -i -X PUT -T {path-to-local-file} "{location}"

Tan ka dib, waxaan samayn karnaa Spark-Submit anagoo adeegsanayna faylka Spark ee lagu shubay HDFS (halkan {class-name} waa magaca fasalka loo baahan yahay in la bilaabo si loo dhamaystiro hawsha):

/opt/spark/bin/spark-submit --name spark-test --class {class-name} --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  hdfs://{host}:{port}/{path-to-file-on-hdfs}

Waa in la ogaadaa in si aad u hesho HDFS oo aad u hubiso in hawshu shaqaynayso, waxaad u baahan kartaa inaad bedesho Dockerfile iyo qoraalka gelitaanka.sh - ku dar dardaaran Dockerfile si aad u nuqul ka sameyso maktabadaha ku tiirsan buugga /opt/spark/jars iyo ku dar faylka qaabeynta HDFS gudaha SPARK_CLASSPATH meesha laga soo galo. sh.

Kiis isticmaalka labaad - Apache Livy

Intaa waxaa dheer, marka hawl la sameeyo oo natiijadu u baahan tahay in la tijaabiyo, su'aashu waxay soo baxaysaa in la bilaabo iyada oo qayb ka ah habka CI/CD iyo la socoshada heerka fulinta. Dabcan, waxaad ku socodsiin kartaa adigoo isticmaalaya wicitaanka soo-gudbinta maxalliga ah, laakiin tani waxay adkeyneysaa kaabayaasha CI / CD maadaama ay u baahan tahay rakibidda iyo habeynta Spark ee wakiilada CI server / orodyahannada iyo dejinta gelitaanka Kubernetes API. Kiiskan, hirgelinta bartilmaameedku waxay dooratay inay Apache Livy u isticmaasho API REST si ay ugu socodsiiso hawlaha Spark ee lagu marti galiyay kutlada Kubernetes. Caawinteeda, waxaad ku wadi kartaa hawlaha Spark ee kutlada Kubernetes adoo isticmaalaya codsiyada cURL ee caadiga ah, kaas oo si fudud loo hirgeliyo iyadoo lagu saleynayo xal kasta oo CI ah, iyo meelaynta gudaha kutlada Kubernetes waxay xallisaa arrinta aqoonsiga marka ay la falgalayso Kubernetes API.

Ku shaqeynta Apache Spark ee Kubernetes

Aan u muujinno kiis la isticmaalo labaad - socodsiinta hawlaha Spark oo qayb ka ah habka CI/CD ee kutlada Kubernetes ee wareegga tijaabada ah.

Wax yar oo ku saabsan Apache Livy - waxay u shaqeysaa sidii server HTTP ah oo bixiya interface-ka Webka iyo API RESTful kaas oo kuu ogolaanaya inaad meel fog ka bilowdo dhimbiil-soo-gudbinta adoo gudbaya xuduudaha lagama maarmaanka ah. Dhaqan ahaan waxaa lagu soo raray qayb ka mid ah qaybinta HDP, laakiin sidoo kale waxaa loo diri karaa OKD ama qalab kasta oo Kubernetes ah iyadoo la adeegsanayo muujinta ku habboon iyo sawirada Docker, sida kan - github.com/ttauveron/k8s-big-data-experiments/tree/master/livy-spark-2.3. Kiiskeena, sawir Docker la mid ah ayaa la dhisay, oo ay ku jiraan nooca Spark 2.4.5 ee ka yimid Dockerfile soo socda:

FROM java:8-alpine

ENV SPARK_HOME=/opt/spark
ENV LIVY_HOME=/opt/livy
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV SPARK_USER=spark

WORKDIR /opt

RUN apk add --update openssl wget bash && 
    wget -P /opt https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz && 
    tar xvzf spark-2.4.5-bin-hadoop2.7.tgz && 
    rm spark-2.4.5-bin-hadoop2.7.tgz && 
    ln -s /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark

RUN wget http://mirror.its.dal.ca/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip && 
    unzip apache-livy-0.7.0-incubating-bin.zip && 
    rm apache-livy-0.7.0-incubating-bin.zip && 
    ln -s /opt/apache-livy-0.7.0-incubating-bin /opt/livy && 
    mkdir /var/log/livy && 
    ln -s /var/log/livy /opt/livy/logs && 
    cp /opt/livy/conf/log4j.properties.template /opt/livy/conf/log4j.properties

ADD livy.conf /opt/livy/conf
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ADD entrypoint.sh /entrypoint.sh

ENV PATH="/opt/livy/bin:${PATH}"

EXPOSE 8998

ENTRYPOINT ["/entrypoint.sh"]
CMD ["livy-server"]

Sawirka la soo saaray waa la dhisi karaa oo lagu dhejin karaa kaydka Docker ee jira, sida kaydka OKD gudaha. Si aad u geyso, isticmaal bayaankan soo socda ({registry-url} - URL ee diiwaanka sawirka Docker, {image-name} - Magaca sawirka Docker, {tag} - Docker image tag, {livy-url} - URL la doonayo halka Server-ku wuxuu noqon doonaa mid la heli karo Livy; bayaanka "Route" waxaa la isticmaalaa haddii Koofiyadda Cas OpenShift loo isticmaalo qaybinta Kubernetes, haddii kale soo-galitaanka ama adeegga u dhigma ee nooca NodePort ayaa la isticmaalaa):

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: livy
  name: livy
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: livy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: livy
    spec:
      containers:
        - command:
            - livy-server
          env:
            - name: K8S_API_HOST
              value: localhost
            - name: SPARK_KUBERNETES_IMAGE
              value: 'gnut3ll4/spark:v1.0.14'
          image: '{registry-url}/{image-name}:{tag}'
          imagePullPolicy: Always
          name: livy
          ports:
            - containerPort: 8998
              name: livy-rest
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/log/livy
              name: livy-log
            - mountPath: /opt/.livy-sessions/
              name: livy-sessions
            - mountPath: /opt/livy/conf/livy.conf
              name: livy-config
              subPath: livy.conf
            - mountPath: /opt/spark/conf/spark-defaults.conf
              name: spark-config
              subPath: spark-defaults.conf
        - command:
            - /usr/local/bin/kubectl
            - proxy
            - '--port'
            - '8443'
          image: 'gnut3ll4/kubectl-sidecar:latest'
          imagePullPolicy: Always
          name: kubectl
          ports:
            - containerPort: 8443
              name: k8s-api
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: spark
      serviceAccountName: spark
      terminationGracePeriodSeconds: 30
      volumes:
        - emptyDir: {}
          name: livy-log
        - emptyDir: {}
          name: livy-sessions
        - configMap:
            defaultMode: 420
            items:
              - key: livy.conf
                path: livy.conf
            name: livy-config
          name: livy-config
        - configMap:
            defaultMode: 420
            items:
              - key: spark-defaults.conf
                path: spark-defaults.conf
            name: livy-config
          name: spark-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: livy-config
data:
  livy.conf: |-
    livy.spark.deploy-mode=cluster
    livy.file.local-dir-whitelist=/opt/.livy-sessions/
    livy.spark.master=k8s://http://localhost:8443
    livy.server.session.state-retain.sec = 8h
  spark-defaults.conf: 'spark.kubernetes.container.image        "gnut3ll4/spark:v1.0.14"'
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: livy
  name: livy
spec:
  ports:
    - name: livy-rest
      port: 8998
      protocol: TCP
      targetPort: 8998
  selector:
    component: livy
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: livy
  name: livy
spec:
  host: {livy-url}
  port:
    targetPort: livy-rest
  to:
    kind: Service
    name: livy
    weight: 100
  wildcardPolicy: None

Ka dib marka aad codsato oo si guul leh u furto boodhka, interface garaafyada Livy ayaa laga heli karaa isku xirka: http://{livy-url}/ui. Livy, waxaanu daabici karnaa hawshayada Spark anagoo adeegsanayna codsi REST ah, tusaale ahaan, Boostaha. Tusaalaha ururinta codsiyada ayaa hoos lagu soo bandhigay ( doodaha qaabaynta oo leh doorsoomayaal lagama maarmaan u ah hawlgalka hawsha la bilaabay ayaa loo gudbi karaa "args" array):

{
    "info": {
        "_postman_id": "be135198-d2ff-47b6-a33e-0d27b9dba4c8",
        "name": "Spark Livy",
        "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
    },
    "item": [
        {
            "name": "1 Submit job with jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar", nt"className": "org.apache.spark.examples.SparkPi",nt"numExecutors":1,nt"name": "spark-test-1",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt}n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        },
        {
            "name": "2 Submit job without jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "hdfs://{host}:{port}/{path-to-file-on-hdfs}", nt"className": "{class-name}",nt"numExecutors":1,nt"name": "spark-test-2",nt"proxyUser": "0",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt},nt"args": [ntt"HADOOP_CONF_DIR=/opt/spark/hadoop-conf",ntt"MASTER=k8s://https://kubernetes.default.svc:8443"nt]n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        }
    ],
    "event": [
        {
            "listen": "prerequest",
            "script": {
                "id": "41bea1d0-278c-40c9-ad42-bf2e6268897d",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        },
        {
            "listen": "test",
            "script": {
                "id": "3cdd7736-a885-4a2d-9668-bd75798f4560",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        }
    ],
    "protocolProfileBehavior": {}
}

Aan fulino codsiga ugu horeeya ee ururinta, aado interface OKD oo hubi in hawsha si guul leh loo bilaabay - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods. Isla mar ahaantaana, fadhi ayaa ka soo muuqan doona Livy interface (http://{livy-url}/ui), kaas oo, adoo isticmaalaya Livy API ama interface interface, waxaad la socon kartaa horumarka hawsha oo aad baran doontaa fadhiga qoryo.

Hadda aan tusno sida Livy u shaqeyso. Si tan loo sameeyo, aynu eegno diiwaannada weelka Livy ee ku jira boodhka ku jira server-ka Livy - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods/{livy-pod-name }?tab=log. Iyaga waxaan ka arki karnaa in marka la wacayo Livy REST API oo ku jira weel la yiraahdo "livy", soo-gudbin dhimbiil ah ayaa la fuliyay, oo la mid ah midka aan kor ku isticmaalnay (halkan {livy-pod-name} waa magaca boodhka la abuuray. oo leh server-ka Livy). Ururinta ayaa sidoo kale soo bandhigaysa waydiin labaad oo kuu ogolaanaysa inaad maamusho hawlaha fog ee martigeliya Spark la fulin karo iyadoo la adeegsanayo server-ka Livy.

Kiis isticmaalka saddexaad - Spark Operator

Hadda oo hawsha la tijaabiyay, su'aasha ah in ay si joogto ah u shaqeyso ayaa soo baxaysa. Habka asalka ah ee si joogto ah loogu socodsiiyo hawlaha Kubernetes kutlada waa CronJob hay'adda oo waad isticmaali kartaa, laakiin hadda isticmaalka hawlwadeennada si ay u maareeyaan codsiyada Kubernetes waa mid caan ah oo Spark waxaa jira hawlwadeen si caddaalad ah u bislaaday, kaas oo sidoo kale ah. loo isticmaalo xalalka heerka ganacsiga (tusaale, Lightbend FastData Platform). Waxaan kugula talineynaa in la isticmaalo - nooca xasilloon ee Spark (2.4.5) ee hadda jira wuxuu leeyahay ikhtiyaaro xaddidan oo xaddidan oo loogu talagalay socodsiinta hawlaha Spark ee Kubernetes, halka nooca xiga ee weyn (3.0.0) uu ku dhawaaqayo taageerada buuxda ee Kubernetes, laakiin taariikhda la sii daayo ayaa weli aan la garanayn. . Hawlwadeenka Spark wuxuu magdhabaa cilladaan isagoo ku daraya xulashooyinka qaabeynta ee muhiimka ah (tusaale, ku rakibida ConfigMap oo leh qaabaynta gelitaanka Hadoop ee Spark pods) iyo awoodda lagu socodsiiyo hawl si joogto ah loo qorsheeyay.

Ku shaqeynta Apache Spark ee Kubernetes
Aan u muujinno kiis isticmaalka saddexaad - si joogto ah ugu socodsiinta hawlaha Spark ee kutlada Kubernetes ee wareegga wax soo saarka.

Hawlwadeenka Spark waa il furan oo lagu sameeyay gudaha Google Cloud Platform - github.com/GoogleCloudPlatform/spark-on-k8s-operator. Rakibadeeda waxaa loo samayn karaa 3 siyaabood:

  1. Iyadoo qayb ka ah Lightbend FastData Platform / Cloudflow rakibaadda;
  2. Isticmaalka Helm:
    helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
    helm install incubator/sparkoperator --namespace spark-operator
    	

  3. Isticmaalka muujinta kaydka rasmiga ah (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest). Waxaa mudan in la xuso kuwa soo socda - Cloudflow waxaa ku jira hawl wadeen wata nooca API v1beta1. Haddii rakibidda noocan ah la isticmaalo, sharraxaadaha muujinta codsiga Spark waa in lagu saleeyaa tusaale tags gudaha Git oo wata nooca API ee habboon, tusaale ahaan, "v1beta1-0.9.0-2.4.0". Nooca hawlwadeenka waxaa laga heli karaa sharraxaadda CRD ee lagu daray hawlwadeenka qaamuuska "versions":
    oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
    	

Haddii hawlwadeenku si sax ah loo rakibo, boodh firfircoon oo leh hawlwadeenka Spark ayaa ka soo muuqan doona mashruuca u dhigma (tusaale ahaan, Cloudflow-fdp-sparkoperator ee booska Cloudflow ee rakibaadda Cloudflow) iyo nooca kheyraadka ee Kubernetes ee loo yaqaan "sparkapplications" ayaa soo muuqan doona. . Waxaad ku sahamin kartaa codsiyada Spark ee la heli karo adiga oo wata amarka soo socda:

oc get sparkapplications -n {project}

Si aad u socodsiiso hawlaha adoo isticmaalaya Spark Operator waxaad u baahan tahay inaad samayso 3 shay:

  • samee sawir Docker oo ay ku jiraan dhammaan maktabadaha lagama maarmaanka ah, iyo sidoo kale qaabeynta iyo faylasha la fulin karo. Sawirka la beegsanayo, kani waa sawir lagu sameeyay heerka CI/CD oo lagu tijaabiyay koox tijaabo ah;
  • ku daabac sawirka Docker diiwaanka laga heli karo kooxda Kubernetes;
  • soo saar bayaan muujinaya nooca "SparkApplication" iyo sharaxaad hawsha la bilaabayo. Tusaalooyinka muuqda ayaa laga heli karaa kaydka rasmiga ah (tusaale. github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta1-0.9.0-2.4.0/examples/spark-pi.yaml). Waxaa jira qodobbo muhiim ah oo ku saabsan muujinta:
    1. Qaamuuska "apiVersion" waa inuu muujiyaa nooca API ee u dhigma nooca hawlwadeenka;
    2. Qaamuuska "metadata.namespace" waa inuu muujiyaa meesha magaca codsiga lagu bilaabayo;
    3. Qaamuuska "spec.image" waa inuu ka kooban yahay ciwaanka sawirka Docker ee la sameeyay ee diiwaanka la heli karo;
    4. Qaamuuska "spec.mainClass" waa inuu ka kooban yahay fasalka hawsha Spark ee u baahan in la socodsiiyo marka hawshu bilaabanto;
    5. Jidka loo maro faylka weelka la fulin karo waa in lagu qeexaa qaamuuska "spec.mainApplicationFile";
    6. Qaamuuska "spec.sparkVersion" waa inuu muujiyaa nooca Spark ee la isticmaalay;
    7. Qaamuuska "spec.driver.serviceAccount" waa in lagu qeexaa koontada adeegga gudaha magaca Kubernetes ee loo isticmaali doono in lagu socodsiiyo codsiga;
    8. Qaamuuska "spec.executor" waa inuu muujiyaa tirada kheyraadka loo qoondeeyay codsiga;
    9. Qaamuuska "spec.volumeMounts" waa in uu cadeeyaa hagaha deegaanka kaas oo faylasha shaqada ee Spark lagu abuuri doono.

Tusaalaha soo saarista caddaynta (halkan {spark-service-account} waa akoon adeeg gudaha kooxda Kubernetes ee socodsiinta hawlaha Spark):

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: {spark-service-account}
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Muujintani waxay qeexaysaa akoon adeeg kaas oo, ka hor inta aanad daabicin muujinta, waa inaad abuurtaa xidhitaannada doorka lagama maarmaanka ah ee bixiya xuquuqda gelitaanka lagama maarmaanka ah ee codsiga Spark si ay ula falgalaan Kubernetes API (haddii loo baahdo). Xaaladeena, codsigu wuxuu u baahan yahay xuquuq si loo abuuro Pods. Aynu abuurno xidhidhiyaha doorka lagama maarmaanka ah:

oc adm policy add-role-to-user edit system:serviceaccount:{project}:{spark-service-account} -n {project}

Waxa kale oo xusid mudan in sifayntan muujinta ah ay ku jiri karto halbeegga "hadoopConfigMap", kaas oo kuu oggolaanaya inaad qeexdo ConfigMap oo leh qaabaynta Hadoop adoon marka hore gelin faylka u dhigma sawirka Docker. Waxa kale oo ay ku habboon tahay in si joogto ah loo socodsiiyo hawlaha - iyadoo la adeegsanayo halbeegga "Jadwalka", jadwalka socodsiinta hawsha la bixiyay ayaa la cayimi karaa.

Intaa ka dib, waxa aanu ku kaydinaynaa badhankayaga faylka spark-pi.yaml oo aanu ku dabaqno kooxdayada Kubernetes:

oc apply -f spark-pi.yaml

Tani waxay abuuri doontaa shay nooca "sparkapplications":

oc get sparkapplications -n {project}
> NAME       AGE
> spark-pi   22h

Xaaladdan oo kale, boodh leh codsi ayaa la abuuri doonaa, heerka kaas oo lagu soo bandhigi doono "sparkapplications" la abuuray. Waxaad ku arki kartaa amarka soo socda:

oc get sparkapplications spark-pi -o yaml -n {project}

Marka la dhammeeyo hawsha, POD-gu wuxuu u guuri doonaa heerka "Dhamaystiran", kaas oo sidoo kale cusbooneysiin doona "sparkapplications". Logyada arjiga waxaa laga eegi karaa browserka ama iyadoo la isticmaalayo amarka soo socda (halkan {sparkapplications-pod-name} waa magaca boodhka hawsha socota):

oc logs {sparkapplications-pod-name} -n {project}

Hawlaha Spark sidoo kale waxaa lagu maareyn karaa iyadoo la isticmaalayo utility sparkctl ee gaarka ah. Si aad u rakibto, ku xidh kaydka koodka isha, ku rakib Go oo dhis utilityn:

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin

Aynu eegno liiska socodsiinta hawlaha Spark:

sparkctl list -n {project}

Aan u abuurno sharraxaad hawsha Spark:

vi spark-app.yaml

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Aynu socodsiinno hawsha la sharraxay anagoo adeegsanayna sparkctl:

sparkctl create spark-app.yaml -n {project}

Aynu eegno liiska socodsiinta hawlaha Spark:

sparkctl list -n {project}

Aynu eegno liiska dhacdooyinka hawsha Spark la bilaabay:

sparkctl event spark-pi -n {project} -f

Aynu eegno heerka ay marayso hawsha Spark ee socota:

sparkctl status spark-pi -n {project}

Gabagabadii, waxaan jeclaan lahaa in aan tixgeliyo faa'iido darrada la ogaaday ee isticmaalka nooca xasilloon ee Spark (2.4.5) ee Kubernetes:

  1. Midda koowaad iyo, laga yaabee, khasaaraha ugu weyni waa la'aanta xogta deegaanka. In kasta oo ay jiraan dhammaan cilladaha YARN, waxaa sidoo kale jiray faa'iidooyin loo isticmaalo, tusaale ahaan, mabda'a ah bixinta koodhka xogta (halkii xogta lagu koodka). Waad ku mahadsan tahay, hawlaha Spark waxaa lagu fuliyay qanjidhada halkaas oo xogta ku lug leh xisaabinta ay ku taal, iyo wakhtiga ay ku qaadatay in lagu gudbiyo xogta shabakada ayaa si weyn hoos ugu dhacday. Marka la isticmaalayo Kubernetes, waxaan la kulannay baahida loo qabo in la dhaqaajiyo xogta ku lug leh hawsha guud ahaan shabakadda. Haddii ay ku filan yihiin, wakhtiga fulinta hawsha aad ayuu u kordhi karaa, sidoo kale wuxuu u baahan yahay xaddi aad u badan oo meel saxan ah oo loo qoondeeyay tusaaleyaasha hawsha Spark ee kaydintooda ku meel gaadhka ah. Khasaarahaan waxaa lagu yarayn karaa iyadoo la isticmaalayo software gaar ah oo hubisa deegaanka xogta ee Kubernetes (tusaale, Alluxio), laakiin tani dhab ahaantii waxay ka dhigan tahay baahida loo qabo in la kaydiyo nuqul dhamaystiran oo xogta ah ee qanjidhada Kubernetes kutlada.
  2. Khasaaraha labaad ee muhiimka ah waa amniga. Sida caadiga ah, astaamaha la xiriira amniga ee ku saabsan socodsiinta howlaha Spark waa naafo, adeegsiga Kerberos kuma jiraan dukumeentiyada rasmiga ah (in kasta oo xulashooyinka u dhigma lagu soo bandhigay nooca 3.0.0, oo u baahan doona shaqo dheeri ah), iyo dukumeentiyada amniga Isticmaalka Spark (https://spark.apache.org/docs/2.4.5/security.html) kaliya YARN, Mesos iyo Standalone Cluster ayaa u muuqda sida dukaamada muhiimka ah. Isla mar ahaantaana, adeegsadaha ay hoos yimaadaan hawlaha Spark si toos ah looma cayimi karo - waxaan kaliya ku qeexnaa akoontada adeegga ee ay ku shaqeyn doonto, isticmaaluhuna waxaa lagu doortaa iyadoo lagu saleynayo siyaasadaha amniga habaysan. Marka tan la eego, mid ka mid ah isticmaalaha xididka ayaa loo isticmaalaa, kaas oo aan ammaan ahayn deegaan wax soo saar leh, ama isticmaale leh UID-ka random, taas oo aan ku habboonayn marka la qaybinayo xuquuqda helitaanka xogta (tan waxaa lagu xallin karaa iyada oo la abuurayo PodSecurityPolicies oo lagu xiro xisaabaadka adeegga u dhigma). Waqtigan xaadirka ah, xalku waa in si toos ah loo dhigo dhammaan faylasha lagama maarmaanka ah sawirka Docker, ama wax laga beddelo qoraalka soo-saarista Spark si loo isticmaalo habka lagu kaydiyo iyo soo saarista siraha lagu qaatay ururkaaga.
  3. Ku shaqaynta shaqooyinka Spark ee isticmaalaya Kubernetes ayaa si rasmi ah wali ugu jira qaab tijaabo ah waxaana laga yaabaa inay isbedel weyn ku yimaadaan farshaxanada la isticmaalo (faylalka habaynta, sawirada saldhiga Docker, iyo qoraalada bilawga ah) mustaqbalka. Runtii, marka la diyaarinayo walxaha, noocyada 2.3.0 iyo 2.4.5 ayaa la tijaabiyay, habdhaqanku aad ayuu uga duwan yahay.

Aynu sugno cusbooneysiinta - nooc cusub oo Spark (3.0.0) ah ayaa dhawaan la sii daayay, kaas oo isbeddel weyn ku keenay shaqada Spark on Kubernetes, laakiin waxay sii waday heerka tijaabada ah ee taageerada maamulaha kheyraadka. Waxaa laga yaabaa in cusboonaysiinta soo socota ay dhab ahaantii suurtogal ka dhigi doonto in si buuxda loogu taliyo in laga tago YARN iyo socodsiinta hawlaha Spark ee Kubernetes iyada oo aan cabsi laga qabin amniga nidaamkaaga iyo iyada oo aan loo baahnayn in si madaxbannaan loo beddelo qaybaha shaqeynaya.

Fin

Source: www.habr.com

Add a comment