Kuthamanga Apache Spark pa Kubernetes

Okondedwa owerenga, masana abwino. Lero tikambirana pang'ono za Apache Spark ndi ziyembekezo zake zachitukuko.

Kuthamanga Apache Spark pa Kubernetes

M'dziko lamakono la Big Data, Apache Spark ndiye mulingo wokhazikika wopangira ntchito zosewerera deta. Kuphatikiza apo, imagwiritsidwanso ntchito kupanga mapulogalamu osinthira omwe amagwira ntchito mu lingaliro laling'ono la batch, kukonza ndi kutumiza deta m'magawo ang'onoang'ono (Spark Structured Streaming). Ndipo mwamwambo idakhala gawo lazonse za Hadoop, pogwiritsa ntchito YARN (kapena nthawi zina Apache Mesos) ngati woyang'anira zothandizira. Pofika chaka cha 2020, kugwiritsidwa ntchito kwake mwachikhalidwe kumakayikiridwa ndi makampani ambiri chifukwa chosowa magawo abwino a Hadoop - chitukuko cha HDP ndi CDH chasiya, CDH sichinapangidwe bwino ndipo ili ndi mtengo wokwera, ndipo ogulitsa Hadoop otsala ali ndi mwina inasiya kukhalapo kapena kukhala ndi tsogolo loipa. Chifukwa chake, kukhazikitsidwa kwa Apache Spark pogwiritsa ntchito Kubernetes ndikochititsa chidwi kwambiri pakati pa anthu ammudzi ndi makampani akulu - kukhala muyezo pakuwongolera ziwiya ndi kasamalidwe kazinthu m'mitambo yachinsinsi komanso pagulu, kumathetsa vutoli ndikukonza zovuta za ntchito za Spark pa YARN ndikupereka. nsanja yomwe ikukula pang'onopang'ono yokhala ndi magawo ambiri azamalonda komanso otseguka amakampani amitundu yonse ndi mikwingwirima. Kuphatikiza apo, chifukwa cha kutchuka, ambiri adakwanitsa kale kupeza makhazikitsidwe angapo awoawo ndikuwonjezera ukadaulo wawo pakugwiritsa ntchito, zomwe zimathandizira kusuntha.

Kuyambira ndi mtundu wa 2.3.0, Apache Spark adapeza thandizo lovomerezeka pakuyendetsa ntchito mu gulu la Kubernetes ndipo lero, tikambirana za kukhwima kwa njira iyi, zosankha zosiyanasiyana zogwiritsiridwa ntchito ndi zovuta zomwe tidzakumane nazo pakukhazikitsa.

Choyamba, tiyeni tiwone njira yopangira ntchito ndi kugwiritsa ntchito kutengera Apache Spark ndikuwunikira zochitika zomwe muyenera kuyendetsa ntchito pagulu la Kubernetes. Pokonzekera positiyi, OpenShift imagwiritsidwa ntchito ngati kugawa ndipo malamulo okhudzana ndi mzere wake wolamula (oc) adzaperekedwa. Pakugawa kwina kwa Kubernetes, malamulo ofananirako kuchokera ku muyezo wa Kubernetes command line utility (kubectl) kapena ma analogue awo (mwachitsanzo, a oc adm policy) angagwiritsidwe ntchito.

Choyamba chogwiritsa ntchito - spark-submit

Pakukonza ntchito ndi kugwiritsa ntchito, wopanga amayenera kuyendetsa ntchito kuti athetse kusintha kwa data. Mwachidziwitso, ma stubs angagwiritsidwe ntchito pazifukwa izi, koma chitukuko ndi kutenga nawo mbali pazochitika zenizeni (ngakhale zoyesa) za machitidwe otsiriza zatsimikizira kukhala zachangu komanso zabwinoko m'gululi la ntchito. Pamene tikukonza zochitika zenizeni zamakina omaliza, zochitika ziwiri ndizotheka:

  • wopanga mapulogalamu amayendetsa ntchito ya Spark komweko mumachitidwe oyimira;

    Kuthamanga Apache Spark pa Kubernetes

  • wopanga amayendetsa ntchito ya Spark pagulu la Kubernetes mu chipika choyesera.

    Kuthamanga Apache Spark pa Kubernetes

Njira yoyamba ili ndi ufulu kukhalapo, koma ili ndi zovuta zingapo:

  • Wopanga aliyense ayenera kupatsidwa mwayi wopezeka kuchokera kuntchito kupita ku zochitika zonse zamachitidwe omaliza omwe amafunikira;
  • kuchuluka kwazinthu zofunikira kumafunika pa makina ogwira ntchito kuti ayendetse ntchito yomwe ikupangidwa.

Njira yachiwiri ilibe zovuta izi, chifukwa kugwiritsa ntchito gulu la Kubernetes kumakupatsani mwayi woti mugawire dziwe lofunikira kuti mugwire ntchito ndikupatseni mwayi wofikira kumapeto kwa nthawi, ndikupatseni mwayi wogwiritsa ntchito Kubernetes mamembala onse a gulu lachitukuko. Tiyeni tiwunikire ngati njira yoyamba yogwiritsira ntchito - kuyambitsa ntchito za Spark kuchokera pamakina omanga akomweko pagulu la Kubernetes pagawo loyesa.

Tiyeni tikambirane zambiri za momwe mungakhazikitsire Spark kuti iziyenda kwanuko. Kuti muyambe kugwiritsa ntchito Spark muyenera kuyiyika:

mkdir /opt/spark
cd /opt/spark
wget http://mirror.linux-ia64.org/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar zxvf spark-2.4.5.tgz
rm -f spark-2.4.5.tgz

Timasonkhanitsa phukusi lofunikira kuti tigwire ntchito ndi Kubernetes:

cd spark-2.4.5/
./build/mvn -Pkubernetes -DskipTests clean package

Kumanga kwathunthu kumatenga nthawi yochuluka, ndikupanga zithunzi za Docker ndikuziyendetsa pagulu la Kubernetes, mumangofunika mafayilo amtundu wa "assembly/", kotero mutha kungopanga gawo ili:

./build/mvn -f ./assembly/pom.xml -Pkubernetes -DskipTests clean package

Kuti mugwiritse ntchito Spark pa Kubernetes, muyenera kupanga chithunzi cha Docker kuti mugwiritse ntchito ngati chithunzi choyambira. Pali njira ziwiri zomwe zingatheke apa:

  • Chithunzi chopangidwa ndi Docker chimaphatikizapo nambala yantchito ya Spark yomwe ingagwiritsidwe ntchito;
  • Chithunzi chopangidwa chimangophatikizapo Spark ndi kudalira kofunikira, code yogwiritsiridwa ntchito imayendetsedwa kutali (mwachitsanzo, mu HDFS).

Choyamba, tiyeni tipange chithunzi cha Docker chokhala ndi chitsanzo choyesera cha ntchito ya Spark. Kuti mupange zithunzi za Docker, Spark ili ndi chida chotchedwa "docker-image-tool". Tiyeni tiphunzire thandizo pa izi:

./bin/docker-image-tool.sh --help

Ndi chithandizo chake, mutha kupanga zithunzi za Docker ndikuziyika ku zolembera zakutali, koma mwachisawawa zimakhala ndi zovuta zingapo:

  • mosalephera amapanga zithunzi za 3 Docker nthawi imodzi - za Spark, PySpark ndi R;
  • sikukulolani kuti mutchule dzina lachithunzi.

Chifukwa chake, tigwiritsa ntchito mtundu wosinthidwa wa izi zomwe zaperekedwa pansipa:

vi bin/docker-image-tool-upd.sh

#!/usr/bin/env bash

function error {
  echo "$@" 1>&2
  exit 1
}

if [ -z "${SPARK_HOME}" ]; then
  SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"

function image_ref {
  local image="$1"
  local add_repo="${2:-1}"
  if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
    image="$REPO/$image"
  fi
  if [ -n "$TAG" ]; then
    image="$image:$TAG"
  fi
  echo "$image"
}

function build {
  local BUILD_ARGS
  local IMG_PATH

  if [ ! -f "$SPARK_HOME/RELEASE" ]; then
    IMG_PATH=$BASEDOCKERFILE
    BUILD_ARGS=(
      ${BUILD_PARAMS}
      --build-arg
      img_path=$IMG_PATH
      --build-arg
      datagram_jars=datagram/runtimelibs
      --build-arg
      spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
    )
  else
    IMG_PATH="kubernetes/dockerfiles"
    BUILD_ARGS=(${BUILD_PARAMS})
  fi

  if [ -z "$IMG_PATH" ]; then
    error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
  fi

  if [ -z "$IMAGE_REF" ]; then
    error "Cannot find docker image reference. Please add -i arg."
  fi

  local BINDING_BUILD_ARGS=(
    ${BUILD_PARAMS}
    --build-arg
    base_img=$(image_ref $IMAGE_REF)
  )
  local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}

  docker build $NOCACHEARG "${BUILD_ARGS[@]}" 
    -t $(image_ref $IMAGE_REF) 
    -f "$BASEDOCKERFILE" .
}

function push {
  docker push "$(image_ref $IMAGE_REF)"
}

function usage {
  cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.

Commands:
  build       Build image. Requires a repository address to be provided if the image will be
              pushed to a different registry.
  push        Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
  -f file               Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
  -p file               Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
  -R file               Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
  -r repo               Repository address.
  -i name               Image name to apply to the built image, or to identify the image to be pushed.  
  -t tag                Tag to apply to the built image, or to identify the image to be pushed.
  -m                    Use minikube's Docker daemon.
  -n                    Build docker image with --no-cache
  -b arg      Build arg to build or push the image. For multiple build args, this option needs to
              be used separately for each build arg.

Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.

Check the following documentation for more information on using the minikube Docker daemon:

  https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon

Examples:
  - Build image in minikube with tag "testing"
    $0 -m -t testing build

  - Build and push image with tag "v2.3.0" to docker.io/myrepo
    $0 -r docker.io/myrepo -t v2.3.0 build
    $0 -r docker.io/myrepo -t v2.3.0 push
EOF
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  usage
  exit 0
fi

REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
 case "${option}"
 in
 f) BASEDOCKERFILE=${OPTARG};;
 r) REPO=${OPTARG};;
 t) TAG=${OPTARG};;
 n) NOCACHEARG="--no-cache";;
 i) IMAGE_REF=${OPTARG};;
 b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
 esac
done

case "${@: -1}" in
  build)
    build
    ;;
  push)
    if [ -z "$REPO" ]; then
      usage
      exit 1
    fi
    push
    ;;
  *)
    usage
    exit 1
    ;;
esac

Ndi chithandizo chake, timasonkhanitsa chithunzi choyambirira cha Spark chokhala ndi ntchito yoyesa kuwerengera Pi pogwiritsa ntchito Spark (pano {docker-registry-url} ndi ulalo wa kaundula wa zithunzi za Docker, {repo} ndi dzina la malo osungira mkati mwa registry, zomwe zimagwirizana ndi polojekitiyi mu OpenShift , {image-name} - dzina la fano (ngati magawo atatu olekanitsa zithunzi akugwiritsidwa ntchito, mwachitsanzo, monga mu kaundula wophatikizika wa Red Hat OpenShift zithunzi), {tag} - tag ya izi mtundu wa chithunzi):

./bin/docker-image-tool-upd.sh -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile -r {docker-registry-url}/{repo} -i {image-name} -t {tag} build

Lowani kugulu la OKD pogwiritsa ntchito chida chothandizira (pano {OKD-API-URL} ndiye OKD cluster API URL):

oc login {OKD-API-URL}

Tiyeni titenge chizindikiro cha wogwiritsa ntchito pano kuti avomerezedwe mu Docker Registry:

oc whoami -t

Lowani ku Registry yamkati ya Docker ya gulu la OKD (timagwiritsa ntchito chizindikiro chomwe tapeza pogwiritsa ntchito lamulo lapitalo ngati mawu achinsinsi):

docker login {docker-registry-url}

Tiyeni tikweze chithunzi cha Docker chophatikizidwa ku Docker Registry OKD:

./bin/docker-image-tool-upd.sh -r {docker-registry-url}/{repo} -i {image-name} -t {tag} push

Tiyeni tiwone ngati chithunzi chophatikizidwa chikupezeka mu OKD. Kuti muchite izi, tsegulani ulalo wa msakatuli wokhala ndi mndandanda wazithunzi za polojekiti yofananira (pano {project} ndi dzina la polojekiti mkati mwa gulu la OpenShift, {OKD-WEBUI-URL} ndi ulalo wa OpenShift Web console. ) - https://{OKD-WEBUI-URL}/console /project/{project}/browse/images/{image-name}.

Kuti mugwire ntchito, akaunti yautumiki iyenera kupangidwa ndi mwayi woyendetsa ma pods ngati mizu (tidzakambirana mfundoyi pambuyo pake):

oc create sa spark -n {project}
oc adm policy add-scc-to-user anyuid -z spark -n {project}

Tiyeni tiyendetse lamulo la spark-submit kuti tisindikize ntchito ya Spark ku gulu la OKD, kutchula akaunti yomwe idapangidwa ndi chithunzi cha Docker:

 /opt/spark/bin/spark-submit --name spark-test --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar

Zomwe:

-dzina - dzina la ntchito yomwe idzachita nawo pakupanga dzina la Kubernetes pods;

-kalasi - kalasi ya fayilo yomwe ingathe kuchitidwa, yomwe imatchedwa ntchito ikayamba;

-conf - Zosintha za Spark;

spark.executor.instances - chiwerengero cha Spark executors kuti ayambitse;

spark.kubernetes.authenticate.driver.serviceAccountName - dzina laakaunti yautumiki wa Kubernetes yomwe imagwiritsidwa ntchito poyambitsa ma pod (kutanthauzira zachitetezo ndi kuthekera polumikizana ndi Kubernetes API);

spark.kubernetes.namespace - Kubernetes namespace momwe dalaivala ndi executor pods adzakhazikitsidwa;

spark.submit.deployMode β€” njira yoyambitsira Spark (pa "cluster" yokhazikika ya spark-submit, ya Spark Operator ndi mitundu ina ya Spark "client");

spark.kubernetes.container.image - Chithunzi cha Docker chomwe chimagwiritsidwa ntchito poyambitsa ma pod;

spark.master - Kubernetes API URL (yakunja yafotokozedwa kotero kuti mwayi umapezeka kuchokera kumakina akomweko);

local: // ndi njira yopita ku Spark yomwe ingagwiritsidwe ntchito mkati mwa chithunzi cha Docker.

Timapita ku pulojekiti yofananira ya OKD ndikuphunzira ma pods omwe adapangidwa - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods.

Kuti muchepetse njira yachitukuko, njira ina ingagwiritsidwe ntchito, momwe chithunzi chodziwika bwino cha Spark chimapangidwa, chogwiritsidwa ntchito ndi ntchito zonse kuti ziyendetse, ndipo zithunzithunzi zamafayilo omwe angathe kuchitidwa zimasindikizidwa kumalo osungirako kunja (mwachitsanzo, Hadoop) ndikufotokozedwa poyimba. spark-submit ngati ulalo. Pankhaniyi, mutha kuyendetsa mitundu yosiyanasiyana ya ntchito za Spark popanda kumanganso zithunzi za Docker, pogwiritsa ntchito, mwachitsanzo, WebHDFS kusindikiza zithunzi. Timatumiza pempho loti tipange fayilo (pano {host} ndi woyang'anira utumiki wa WebHDFS, {port} ndi doko la utumiki wa WebHDFS, {path-to-file-on-hdfs} ndiyo njira yofunidwa yopita ku fayilo. pa HDFS):

curl -i -X PUT "http://{host}:{port}/webhdfs/v1/{path-to-file-on-hdfs}?op=CREATE

Mudzalandira yankho ngati ili (pano {location} ndi ulalo womwe uyenera kugwiritsidwa ntchito kutsitsa fayilo):

HTTP/1.1 307 TEMPORARY_REDIRECT
Location: {location}
Content-Length: 0

Kwezani fayilo yoyeserera ya Spark mu HDFS (pano {path-to-local-file} ndiyo njira yopita ku fayilo ya Spark yomwe ikupezeka pa omwe ali pano):

curl -i -X PUT -T {path-to-local-file} "{location}"

Pambuyo pa izi, titha kutumiza spark-submit pogwiritsa ntchito fayilo ya Spark yomwe idakwezedwa ku HDFS (pano {class-name} ndi dzina la kalasi lomwe likufunika kukhazikitsidwa kuti amalize ntchitoyi):

/opt/spark/bin/spark-submit --name spark-test --class {class-name} --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  hdfs://{host}:{port}/{path-to-file-on-hdfs}

Dziwani kuti kuti mupeze HDFS ndikuwonetsetsa kuti ntchitoyo ikugwira ntchito, mungafunike kusintha Dockerfile ndi entrypoint.sh script - onjezerani chitsogozo ku Dockerfile kuti mukopere malaibulale odalira ku / opt/spark/mitsuko ndi Phatikizani fayilo yosinthira ya HDFS mu SPARK_CLASSPATH polowera.

Mlandu wachiwiri wogwiritsa ntchito - Apache Livy

Kupitilira apo, ntchito ikapangidwa ndipo zotsatira zake ziyenera kuyesedwa, funso limabwera pakuyiyambitsa ngati gawo la njira ya CI / CD ndikutsata momwe ikugwiritsidwira ntchito. Zachidziwikire, mutha kuyiyendetsa pogwiritsa ntchito kuyimba komweko, koma izi zimasokoneza zomangamanga za CI/CD popeza zimafunikira kukhazikitsa ndikusintha Spark pa othandizira / othamanga a CI ndikukhazikitsa mwayi wofikira Kubernetes API. Pachifukwa ichi, cholinga chake chasankha kugwiritsa ntchito Apache Livy ngati REST API yoyendetsa ntchito za Spark zomwe zimachitika mkati mwa gulu la Kubernetes. Ndi chithandizo chake, mutha kuyendetsa ntchito za Spark pagulu la Kubernetes pogwiritsa ntchito zopempha zanthawi zonse za cURL, zomwe zimakhazikitsidwa mosavuta kutengera yankho lililonse la CI, ndikuyika kwake mkati mwa gulu la Kubernetes kumathetsa vuto la kutsimikizika polumikizana ndi Kubernetes API.

Kuthamanga Apache Spark pa Kubernetes

Tiyeni tiwunikire ngati njira yachiwiri yogwiritsira ntchito - kuyendetsa ntchito za Spark ngati gawo la CI/CD pagulu la Kubernetes mu chipika choyesera.

Pang'ono ndi Apache Livy - imagwira ntchito ngati seva ya HTTP yomwe imapereka mawonekedwe a Webusaiti ndi RESTful API yomwe imakulolani kuti muyambitse kutalikirana ndi spark-submit podutsa magawo ofunikira. Mwachikhalidwe idatumizidwa ngati gawo la kugawa kwa HDP, koma imathanso kutumizidwa ku OKD kapena kukhazikitsa kwina kulikonse kwa Kubernetes pogwiritsa ntchito chiwonetsero choyenera ndi seti ya zithunzi za Docker, monga iyi - github.com/ttauveron/k8s-big-data-experiments/tree/master/livy-spark-2.3. Kwa ife, chithunzi chofananira cha Docker chidapangidwa, kuphatikiza mtundu wa Spark 2.4.5 kuchokera pa Dockerfile yotsatirayi:

FROM java:8-alpine

ENV SPARK_HOME=/opt/spark
ENV LIVY_HOME=/opt/livy
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV SPARK_USER=spark

WORKDIR /opt

RUN apk add --update openssl wget bash && 
    wget -P /opt https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz && 
    tar xvzf spark-2.4.5-bin-hadoop2.7.tgz && 
    rm spark-2.4.5-bin-hadoop2.7.tgz && 
    ln -s /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark

RUN wget http://mirror.its.dal.ca/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip && 
    unzip apache-livy-0.7.0-incubating-bin.zip && 
    rm apache-livy-0.7.0-incubating-bin.zip && 
    ln -s /opt/apache-livy-0.7.0-incubating-bin /opt/livy && 
    mkdir /var/log/livy && 
    ln -s /var/log/livy /opt/livy/logs && 
    cp /opt/livy/conf/log4j.properties.template /opt/livy/conf/log4j.properties

ADD livy.conf /opt/livy/conf
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ADD entrypoint.sh /entrypoint.sh

ENV PATH="/opt/livy/bin:${PATH}"

EXPOSE 8998

ENTRYPOINT ["/entrypoint.sh"]
CMD ["livy-server"]

Chithunzi chopangidwa chitha kumangidwa ndikukwezedwa kunkhokwe yanu ya Docker, monga chosungira chamkati cha OKD. Kuti muyitumizire, gwiritsani ntchito chiwonetsero chotsatirachi ({registry-url} - URL ya kaundula wa zithunzi za Docker, {image-name} - Dzina lachithunzi la Docker, {tag} - Docker image tag, {livy-url} - URL yomwe mukufuna seva ipezeka Livy; mawonekedwe a "Route" amagwiritsidwa ntchito ngati Red Hat OpenShift ikugwiritsidwa ntchito ngati Kubernetes kugawa, apo ayi Ingress yofananira kapena mawonekedwe a Service amtundu wa NodePort amagwiritsidwa ntchito):

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: livy
  name: livy
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: livy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: livy
    spec:
      containers:
        - command:
            - livy-server
          env:
            - name: K8S_API_HOST
              value: localhost
            - name: SPARK_KUBERNETES_IMAGE
              value: 'gnut3ll4/spark:v1.0.14'
          image: '{registry-url}/{image-name}:{tag}'
          imagePullPolicy: Always
          name: livy
          ports:
            - containerPort: 8998
              name: livy-rest
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/log/livy
              name: livy-log
            - mountPath: /opt/.livy-sessions/
              name: livy-sessions
            - mountPath: /opt/livy/conf/livy.conf
              name: livy-config
              subPath: livy.conf
            - mountPath: /opt/spark/conf/spark-defaults.conf
              name: spark-config
              subPath: spark-defaults.conf
        - command:
            - /usr/local/bin/kubectl
            - proxy
            - '--port'
            - '8443'
          image: 'gnut3ll4/kubectl-sidecar:latest'
          imagePullPolicy: Always
          name: kubectl
          ports:
            - containerPort: 8443
              name: k8s-api
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: spark
      serviceAccountName: spark
      terminationGracePeriodSeconds: 30
      volumes:
        - emptyDir: {}
          name: livy-log
        - emptyDir: {}
          name: livy-sessions
        - configMap:
            defaultMode: 420
            items:
              - key: livy.conf
                path: livy.conf
            name: livy-config
          name: livy-config
        - configMap:
            defaultMode: 420
            items:
              - key: spark-defaults.conf
                path: spark-defaults.conf
            name: livy-config
          name: spark-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: livy-config
data:
  livy.conf: |-
    livy.spark.deploy-mode=cluster
    livy.file.local-dir-whitelist=/opt/.livy-sessions/
    livy.spark.master=k8s://http://localhost:8443
    livy.server.session.state-retain.sec = 8h
  spark-defaults.conf: 'spark.kubernetes.container.image        "gnut3ll4/spark:v1.0.14"'
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: livy
  name: livy
spec:
  ports:
    - name: livy-rest
      port: 8998
      protocol: TCP
      targetPort: 8998
  selector:
    component: livy
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: livy
  name: livy
spec:
  host: {livy-url}
  port:
    targetPort: livy-rest
  to:
    kind: Service
    name: livy
    weight: 100
  wildcardPolicy: None

Mukatha kuyigwiritsa ntchito ndikukhazikitsa pod, mawonekedwe a Livy akupezeka pa ulalo: http://{livy-url}/ui. Ndi Livy, titha kufalitsa ntchito yathu ya Spark pogwiritsa ntchito pempho la REST kuchokera, mwachitsanzo, Postman. Chitsanzo cha zosonkhanitsira zokhala ndi zopempha zaperekedwa pansipa (zosintha zosintha zomwe zili ndi zofunikira pakugwira ntchito yomwe yakhazikitsidwa zitha kuperekedwa mugulu la "args"):

{
    "info": {
        "_postman_id": "be135198-d2ff-47b6-a33e-0d27b9dba4c8",
        "name": "Spark Livy",
        "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
    },
    "item": [
        {
            "name": "1 Submit job with jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar", nt"className": "org.apache.spark.examples.SparkPi",nt"numExecutors":1,nt"name": "spark-test-1",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt}n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        },
        {
            "name": "2 Submit job without jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "hdfs://{host}:{port}/{path-to-file-on-hdfs}", nt"className": "{class-name}",nt"numExecutors":1,nt"name": "spark-test-2",nt"proxyUser": "0",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt},nt"args": [ntt"HADOOP_CONF_DIR=/opt/spark/hadoop-conf",ntt"MASTER=k8s://https://kubernetes.default.svc:8443"nt]n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        }
    ],
    "event": [
        {
            "listen": "prerequest",
            "script": {
                "id": "41bea1d0-278c-40c9-ad42-bf2e6268897d",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        },
        {
            "listen": "test",
            "script": {
                "id": "3cdd7736-a885-4a2d-9668-bd75798f4560",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        }
    ],
    "protocolProfileBehavior": {}
}

Tiyeni tipereke pempho loyamba kuchokera mgululi, pitani ku mawonekedwe a OKD ndikuwona ngati ntchitoyi yayambika bwino - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods. Panthawi imodzimodziyo, gawo lidzawonekera mu mawonekedwe a Livy (http://{livy-url}/ui), momwemo, pogwiritsa ntchito Livy API kapena mawonekedwe azithunzi, mukhoza kuyang'ana momwe ntchito ikuyendera ndikuphunzira gawolo. mitengo.

Tsopano tiyeni tiwone momwe Livy amagwirira ntchito. Kuti tichite izi, tiyeni tiwone zipika za chidebe cha Livy mkati mwa pod ndi seva ya Livy - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods/{livy-pod-name }?tabu=zipika. Kuchokera kwa iwo titha kuwona kuti poyitanitsa Livy REST API mu chidebe chotchedwa "livy", kutumiza kwa spark kumachitidwa, kofanana ndi komwe tidagwiritsa ntchito pamwambapa (pano {livy-pod-name} ndi dzina la pod yopangidwa. ndi seva ya Livy). Zosonkhanitsazo zimabweretsanso funso lachiwiri lomwe limakupatsani mwayi woyendetsa ntchito zomwe zimagwira kutali ndi Spark yomwe ingagwiritsidwe ntchito ndi seva ya Livy.

Njira yachitatu - Spark Operator

Tsopano popeza kuti ntchitoyi yayesedwa, funso loyendetsa nthawi zonse limakhalapo. Njira yachilengedwe yoyendetsera ntchito pafupipafupi mugulu la Kubernetes ndi gulu la CronJob ndipo mutha kuligwiritsa ntchito, koma pakadali pano kugwiritsa ntchito ogwiritsa ntchito kuyang'anira ntchito ku Kubernetes ndikotchuka kwambiri ndipo kwa Spark pali wogwiritsa ntchito wokhwima, yemwenso ali amagwiritsidwa ntchito pamayankho a Enterprise-level (mwachitsanzo, Lightbend FastData Platform). Tikupangira kugwiritsa ntchito - mtundu wokhazikika wa Spark (2.4.5) uli ndi zosankha zochepa zosinthira ntchito za Spark ku Kubernetes, pomwe mtundu wotsatira waukulu (3.0.0) umalengeza kuthandizira kwathunthu kwa Kubernetes, koma tsiku lake lomasulidwa silikudziwika. . Spark Operator amalipira cholakwikacho powonjezera zosankha zofunika zosinthira (mwachitsanzo, kuyika ConfigMap yokhala ndi kasinthidwe kofikira kwa Hadoop ku Spark pods) komanso kuthekera koyendetsa ntchito yomwe imakonzedwa pafupipafupi.

Kuthamanga Apache Spark pa Kubernetes
Tiyeni tiwunikire ngati njira yachitatu yogwiritsira ntchito - kuyendetsa ntchito za Spark pafupipafupi pagulu la Kubernetes mu loop yopanga.

Spark Operator ndi gwero lotseguka ndikupangidwa mkati mwa Google Cloud Platform - github.com/GoogleCloudPlatform/spark-on-k8s-operator. Kuyika kwake kungatheke m'njira zitatu:

  1. Monga gawo la Lightbend FastData Platform/Cloudflow install;
  2. Kugwiritsa ntchito Helm:
    helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
    helm install incubator/sparkoperator --namespace spark-operator
    	

  3. Kugwiritsa ntchito ziwonetsero zochokera kumalo ovomerezeka (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest). Ndizofunikira kudziwa zotsatirazi - Cloudflow imaphatikizapo wogwiritsa ntchito API v1beta1. Ngati kuyika kwamtunduwu kukugwiritsidwa ntchito, mafotokozedwe a mawonekedwe a Spark akuyenera kutengera ma tag achitsanzo mu Git okhala ndi mtundu woyenerera wa API, mwachitsanzo, "v1beta1-0.9.0-2.4.0". Mtundu wa wogwiritsa ntchito ungapezeke pofotokozera CRD yophatikizidwa mu mtanthauzira mawu wa "mabaibulo":
    oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
    	

Ngati woyendetsayo aikidwa molondola, pod yogwira ntchito ndi Spark operator idzawonekera mu pulojekiti yofanana (mwachitsanzo, cloudflow-fdp-sparkoperator mu Cloudflow space poyika Cloudflow) ndi mtundu wofananira wa Kubernetes wotchedwa "sparkapplications" udzawonekera. . Mutha kuyang'ana mapulogalamu omwe alipo a Spark ndi lamulo ili:

oc get sparkapplications -n {project}

Kuti mugwiritse ntchito Spark Operator muyenera kuchita zinthu zitatu:

  • pangani chithunzi cha Docker chomwe chimaphatikizapo malaibulale onse ofunikira, komanso makonzedwe ndi mafayilo otheka. Mu chithunzi chandamale, ichi ndi chithunzi chopangidwa pa CI / CD siteji ndikuyesedwa pamagulu oyesera;
  • sindikizani chithunzi cha Docker ku registry yomwe ikupezeka kuchokera ku gulu la Kubernetes;
  • pangani chiwonetsero chamtundu wa "SparkApplication" ndi kufotokozera ntchito yomwe ikuyenera kukhazikitsidwa. Zitsanzo zowonetsera zimapezeka m'malo ovomerezeka (mwachitsanzo. github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta1-0.9.0-2.4.0/examples/spark-pi.yaml). Pali mfundo zofunika kuzizindikira pa manifesto:
    1. dikishonale ya "apiVersion" iyenera kuwonetsa mtundu wa API wolingana ndi mtundu wa opareta;
    2. dikishonale ya "metadata.namespace" iyenera kuwonetsa malo omwe pulogalamuyo idzayambitse;
    3. dikishonale ya "spec.image" iyenera kukhala ndi adilesi ya chithunzi cha Docker chopangidwa mu kaundula wopezeka;
    4. dikishonale ya "spec.mainClass" iyenera kukhala ndi gulu la Spark lomwe liyenera kuyendetsedwa ntchito ikayamba;
    5. dikishonale ya "spec.mainApplicationFile" iyenera kukhala ndi njira yopita ku fayilo ya mtsuko yomwe ingathe kuchitika;
    6. dikishonale ya "sparkVersion" iyenera kuwonetsa mtundu wa Spark womwe ukugwiritsidwa ntchito;
    7. buku lotanthauzira mawu la "spec.driver.serviceAccount" liyenera kufotokoza akaunti ya ntchito yomwe ili mkati mwa dzina la Kubernetes lomwe lidzagwiritsidwe ntchito poyendetsa pulogalamuyi;
    8. dikishonale ya "spec.executor" iyenera kuwonetsa kuchuluka kwa zinthu zomwe zaperekedwa ku ntchitoyo;
    9. dikishonale ya "spec.volumeMounts" iyenera kufotokoza chikwatu chapafupi momwe mafayilo a Spark apafupi adzapangidwira.

Chitsanzo chopanga chiwonetsero (pano {spark-service-account} ndi akaunti yantchito mkati mwa gulu la Kubernetes poyendetsa ntchito za Spark):

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: {spark-service-account}
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Mawonekedwewa amatchula akaunti yantchito yomwe, musanasindikize chiwonetserochi, muyenera kupanga zofunikira zomwe zimapereka ufulu wofikira kuti pulogalamu ya Spark igwirizane ndi Kubernetes API (ngati kuli kofunikira). Kwa ife, kugwiritsa ntchito kumafunikira ufulu kuti apange ma Pods. Titha kupanga ntchito yofunika:

oc adm policy add-role-to-user edit system:serviceaccount:{project}:{spark-service-account} -n {project}

Ndizofunikiranso kudziwa kuti mawonekedwe owonetserawa angaphatikizepo gawo la "hadoopConfigMap", lomwe limakupatsani mwayi wofotokozera ConfigMap ndi kasinthidwe ka Hadoop osayika kaye fayilo yofananira pachithunzi cha Docker. Ndiwoyeneranso kugwira ntchito nthawi zonse - pogwiritsa ntchito gawo la "ndandanda", ndandanda yoyendetsera ntchito yomwe wapatsidwa ikhoza kufotokozedwa.

Pambuyo pake, timasunga chiwonetsero chathu ku fayilo ya spark-pi.yaml ndikuyiyika ku gulu lathu la Kubernetes:

oc apply -f spark-pi.yaml

Izi zipanga chinthu chamtundu wa "sparkapplications":

oc get sparkapplications -n {project}
> NAME       AGE
> spark-pi   22h

Pankhaniyi, pod yokhala ndi pulogalamu idzapangidwa, yomwe idzawonetsedwa mu "sparkapplications" zomwe zidapangidwa. Mutha kuziwona ndi lamulo ili:

oc get sparkapplications spark-pi -o yaml -n {project}

Mukamaliza ntchitoyi, POD idzapita ku "Yamalizidwa", yomwe idzasinthidwenso mu "sparkapplications". Zolemba zamapulogalamu zitha kuwonedwa mumsakatuli kapena kugwiritsa ntchito lamulo lotsatirali (apa {sparkapplications-pod-name} ndiye dzina la pod ya ntchitoyo):

oc logs {sparkapplications-pod-name} -n {project}

Ntchito za Spark zitha kuyendetsedwanso pogwiritsa ntchito zida zapadera za sparkctl. Kuti muyike, phatikizani chosungiracho ndi gwero lake, ikani Go ndikumanga izi:

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin

Tiyeni tiwone mndandanda wa ntchito za Spark:

sparkctl list -n {project}

Tiyeni tipange kufotokozera kwa ntchito ya Spark:

vi spark-app.yaml

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Tiyeni tiyendetse ntchito yomwe tafotokozayi pogwiritsa ntchito sparkctl:

sparkctl create spark-app.yaml -n {project}

Tiyeni tiwone mndandanda wa ntchito za Spark:

sparkctl list -n {project}

Tiyeni tiwone mndandanda wazinthu zomwe zidayambitsa ntchito ya Spark:

sparkctl event spark-pi -n {project} -f

Tiyeni tiwone momwe ntchito ya Spark ikuyendetsa:

sparkctl status spark-pi -n {project}

Pomaliza, ndikufuna kuganizira zoyipa zomwe zidapezeka pogwiritsa ntchito mtundu waposachedwa wa Spark (2.4.5) ku Kubernetes:

  1. Choyamba ndipo, mwinamwake, choyipa chachikulu ndi kusowa kwa Data Locality. Ngakhale zofooka zonse za YARN, panalinso ubwino wogwiritsa ntchito, mwachitsanzo, mfundo yopereka code ku deta (osati deta ku code). Chifukwa cha izo, ntchito za Spark zinachitidwa pa node zomwe deta yokhudzana ndi kuwerengera inalipo, ndipo nthawi yomwe inatenga kuti apereke deta pa intaneti inachepetsedwa kwambiri. Mukamagwiritsa ntchito Kubernetes, timayang'anizana ndi kufunikira kosuntha deta yomwe ikukhudzidwa ndi ntchito pa intaneti. Ngati ndi yayikulu mokwanira, nthawi yogwirira ntchito imatha kuwonjezeka kwambiri, komanso imafunikanso malo ochulukirapo a disk omwe amaperekedwa ku zochitika za Spark zosungirako kwakanthawi. Zoyipa izi zitha kuchepetsedwa pogwiritsa ntchito mapulogalamu apadera omwe amatsimikizira malo a data ku Kubernetes (mwachitsanzo, Alluxio), koma izi zikutanthauza kufunikira kosunga deta yonse pama node a gulu la Kubernetes.
  2. Choyipa chachiwiri chofunikira ndi chitetezo. Mwachikhazikitso, zinthu zokhudzana ndi chitetezo zokhudzana ndi kuyendetsa ntchito za Spark ndizozimitsidwa, kugwiritsa ntchito Kerberos sikunafotokozedwe m'malemba ovomerezeka (ngakhale zosankha zofananira zinayambitsidwa mu mtundu wa 3.0.0, womwe udzafunika ntchito yowonjezera), ndi zolemba zachitetezo cha pogwiritsa ntchito Spark (https://spark.apache.org/docs/2.4.5/security.html) CHIKWANGWANI chokha, Mesos ndi Standalone Cluster amaoneka ngati masitolo ofunikira. Nthawi yomweyo, wogwiritsa ntchito yemwe ntchito za Spark zimakhazikitsidwa sangatchulidwe mwachindunji - timangotchula akaunti yautumiki yomwe idzagwire ntchito, ndipo wogwiritsa ntchito amasankhidwa kutengera ndondomeko zachitetezo zomwe zidakhazikitsidwa. Pachifukwa ichi, mwina wogwiritsa ntchito mizu amagwiritsidwa ntchito, omwe sali otetezeka m'malo opindulitsa, kapena wogwiritsa ntchito UID mwachisawawa, zomwe zimakhala zovuta pamene akugawa ufulu wopeza deta (izi zikhoza kuthetsedwa mwa kupanga PodSecurityPolicies ndi kuwagwirizanitsa ndi akaunti zofananira zautumiki). Pakadali pano, yankho ndikuyika mafayilo onse ofunikira mwachindunji mu chithunzi cha Docker, kapena kusintha Spark kukhazikitsa script kuti mugwiritse ntchito makina osungira ndi kubweza zinsinsi zomwe zatengedwa m'gulu lanu.
  3. Kuthamanga kwa ntchito za Spark pogwiritsa ntchito Kubernetes kudakali mumayendedwe oyesera ndipo pakhoza kukhala kusintha kwakukulu muzinthu zakale zomwe zimagwiritsidwa ntchito (mafayilo osintha, zithunzi za Docker base, ndi kukhazikitsa zolemba) mtsogolomo. Ndipo ndithudi, pokonzekera zakuthupi, matembenuzidwe 2.3.0 ndi 2.4.5 anayesedwa, khalidweli linali losiyana kwambiri.

Tiyeni tidikire zosintha - mtundu watsopano wa Spark (3.0.0) watulutsidwa posachedwa, womwe udabweretsa kusintha kwakukulu pantchito ya Spark pa Kubernetes, koma adasungabe kuyesa kothandizira woyang'anira izi. Mwina zosintha zina zipangitsa kuti zikhale zotheka kuvomereza kwathunthu kusiya YARN ndikuyendetsa ntchito za Spark pa Kubernetes popanda kuwopa chitetezo chadongosolo lanu komanso popanda kufunikira kosintha paokha pazigawo zogwira ntchito.

Kutsiriza.

Source: www.habr.com

Kuwonjezera ndemanga