Te whakahaere i te Apache Spark i runga i nga Kubernetes

E nga kaipānui, tena ra. I tenei ra ka korero iti tatou mo Apache Spark me ona tumanakohanga whanaketanga.

Te whakahaere i te Apache Spark i runga i nga Kubernetes

I roto i te ao hou o Raraunga Nui, ko Apache Spark te paerewa pono mo te whakawhanake i nga mahi tukatuka raraunga puranga. I tua atu, ka whakamahia hoki ki te hanga i nga tono rerema e mahi ana i roto i te ariā puranga moroiti, te tukatuka me te tuku raraunga i roto i nga waahanga iti (Spark Structured Streaming). A, i nga wa o mua, he waahanga o te kohinga Hadoop katoa, ma te whakamahi i te YARN (i etahi wa ko Apache Mesos) hei kaiwhakahaere rauemi. Hei te tau 2020, ko tana whakamahinga i roto i tona ahua tuku iho kei te paatai ​​mo te nuinga o nga kamupene na te kore o nga tohatoha Hadoop tika - kua mutu te whakawhanaketanga o te HDP me te CDH, kaore i te pai te whakawhanaketanga o te CDH me te utu nui, me nga toenga o nga kaiwhakarato Hadoop kua kua mutu te noho, kua pouri ranei te heke mai. Na reira, ko te whakarewanga o Apache Spark ma te whakamahi i te Kubernetes kei te piki ake te hiahia o te hapori me nga kamupene nui - ka noho hei paerewa mo te whakahiato ipu me te whakahaere rauemi i roto i nga kapua motuhake me te iwi whanui, ka whakatauhia te raru me te whakaraerae i nga whakaritenga rauemi mo nga mahi Spark i runga i te YARN me te whakarato. he papa e whanake haere tonu ana me te maha o nga tohatoha arumoni me te tuwhera mo nga kamupene o nga rahi me nga whiu katoa. I tua atu, i te wa o te rongonui, ko te nuinga kua kaha ki te whiwhi i nga waahanga e rua o raatau ake, kua piki ake o raatau tohungatanga ki te whakamahi, he maamaa te nekehanga.

I timata mai i te putanga 2.3.0, i whiwhi tautoko mana a Apache Spark mo te whakahaere i nga mahi i roto i te roopu Kubernetes, a, i tenei ra, ka korero tatou mo te pakeketanga o tenei huarahi, nga momo whiringa mo tona whakamahinga me nga raru ka pa ki te waa whakatinanatanga.

Tuatahi, me titiro tatou ki te tukanga o te whakawhanake i nga mahi me nga tono i runga i te Apache Spark me te whakaatu i nga keehi angamaheni e hiahia ana koe ki te whakahaere i tetahi mahi i runga i te roopu Kubernetes. I te whakareri i tenei pou, ka whakamahia a OpenShift hei tohatoha ka tukuna nga whakahau e pa ana ki tana taputapu raina whakahau (oc). Mo etahi atu tohatoha Kubernetes, ka taea te whakamahi i nga whakahau e pa ana mai i te taputapu raina whakahau a Kubernetes (kubectl) me o raatau taapiri (hei tauira, mo te kaupapa here oc adm).

Ko te take whakamahi tuatahi - korakora-tuku

I te wa o te whanaketanga o nga mahi me nga tono, me whakahaere e te kaiwhakawhanake nga mahi hei patuiro i te huringa raraunga. Ko te tikanga, ka taea te whakamahi stubs mo enei kaupapa, engari ko te whakawhanaketanga me te whai waahi o nga tauira tuuturu (ahakoa whakamatautau) o nga punaha mutunga kua kitea he tere ake, he pai ake i roto i tenei momo mahi. I te wa e patuiro ana tatou i nga ahuatanga o nga punaha mutunga, e rua nga ahuatanga ka taea:

  • ka whakahaerehia e te kaiwhakawhanake he mahi Spark i te rohe i roto i te aratau tuuturu;

    Te whakahaere i te Apache Spark i runga i nga Kubernetes

  • ka whakahaerehia e te kaiwhakawhanake he mahi Spark i runga i te kahui Kubernetes i roto i te kopae whakamatautau.

    Te whakahaere i te Apache Spark i runga i nga Kubernetes

Ko te kōwhiringa tuatahi he tika ki te noho, engari he maha nga ngoikoretanga:

  • Me whai waahi ki ia kaiwhakawhanake mai i te waahi mahi ki nga ahuatanga katoa o nga punaha mutunga e hiahiatia ana e ia;
  • he rawaka nga rauemi e hiahiatia ana i runga i te miihini mahi hei whakahaere i te mahi e whakawhanakehia ana.

Ko te whiringa tuarua karekau enei ngoikoretanga, na te mea ka taea e koe te whakamahi i te roopu Kubernetes ki te toha i te puna rauemi e tika ana mo te whakahaere i nga mahi me te whakarato ki a ia te urunga e tika ana ki te whakamutu i nga tauira punaha, ma te ngawari ki te tuku uru ki a ia ma te whakamahi i te tauira a Kubernetes mo nga mema katoa o te roopu whanaketanga. Me tohuhia hei take whakamahi tuatahi - ka whakarewahia nga mahi Spark mai i tetahi miihini kaiwhakawhanake a-rohe i runga i te kahui Kubernetes i roto i te riu whakamatautau.

Me korero ake mo te mahi whakatu Spark ki te whakahaere i te rohe. Hei timata ki te whakamahi Spark me whakauru koe:

mkdir /opt/spark
cd /opt/spark
wget http://mirror.linux-ia64.org/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar zxvf spark-2.4.5.tgz
rm -f spark-2.4.5.tgz

Ka kohia e matou nga kohinga e tika ana mo te mahi tahi me Kubernetes:

cd spark-2.4.5/
./build/mvn -Pkubernetes -DskipTests clean package

He roa te wa o te hanga, me te hanga whakaahua Docker me te whakahaere i runga i te roopu Kubernetes, me tino hiahia koe ki nga konae ipu mai i te raarangi "huihui/", na ka taea e koe te hanga i tenei kaupapa iti:

./build/mvn -f ./assembly/pom.xml -Pkubernetes -DskipTests clean package

Hei whakahaere i nga mahi Spark i runga i nga Kubernetes, me hanga e koe he ahua Docker hei whakamahi hei ahua turanga. E rua nga huarahi ka taea i konei:

  • Kei roto i te ahua o Docker te waehere mahi Spark;
  • Ko te ahua i hangaia ko Spark anake me nga mea e tika ana, ko te waehere ka taea te whakahaere i tawhiti (hei tauira, i te HDFS).

Tuatahi, me hanga he ahua Docker kei roto he tauira whakamatautau mo te mahi Spark. Hei hanga whakaahua Docker, he taputapu a Spark e kiia nei ko "docker-image-tool". Kia ako tatou i te awhina mo taua mea:

./bin/docker-image-tool.sh --help

Ma tana awhina, ka taea e koe te hanga whakaahua Docker me te tuku atu ki nga rehitatanga mamao, engari ma te taunoa he maha nga ngoikoretanga:

  • kare e taka ka hanga e 3 nga whakaahua Docker i te wa kotahi - mo Spark, PySpark me R;
  • e kore e whakaae ki a koe ki te whakapūtā ingoa whakaahua.

Na reira, ka whakamahia e matou he putanga whakarereke o tenei whaipainga kua homai i raro nei:

vi bin/docker-image-tool-upd.sh

#!/usr/bin/env bash

function error {
  echo "$@" 1>&2
  exit 1
}

if [ -z "${SPARK_HOME}" ]; then
  SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"

function image_ref {
  local image="$1"
  local add_repo="${2:-1}"
  if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
    image="$REPO/$image"
  fi
  if [ -n "$TAG" ]; then
    image="$image:$TAG"
  fi
  echo "$image"
}

function build {
  local BUILD_ARGS
  local IMG_PATH

  if [ ! -f "$SPARK_HOME/RELEASE" ]; then
    IMG_PATH=$BASEDOCKERFILE
    BUILD_ARGS=(
      ${BUILD_PARAMS}
      --build-arg
      img_path=$IMG_PATH
      --build-arg
      datagram_jars=datagram/runtimelibs
      --build-arg
      spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
    )
  else
    IMG_PATH="kubernetes/dockerfiles"
    BUILD_ARGS=(${BUILD_PARAMS})
  fi

  if [ -z "$IMG_PATH" ]; then
    error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
  fi

  if [ -z "$IMAGE_REF" ]; then
    error "Cannot find docker image reference. Please add -i arg."
  fi

  local BINDING_BUILD_ARGS=(
    ${BUILD_PARAMS}
    --build-arg
    base_img=$(image_ref $IMAGE_REF)
  )
  local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}

  docker build $NOCACHEARG "${BUILD_ARGS[@]}" 
    -t $(image_ref $IMAGE_REF) 
    -f "$BASEDOCKERFILE" .
}

function push {
  docker push "$(image_ref $IMAGE_REF)"
}

function usage {
  cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.

Commands:
  build       Build image. Requires a repository address to be provided if the image will be
              pushed to a different registry.
  push        Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
  -f file               Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
  -p file               Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
  -R file               Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
  -r repo               Repository address.
  -i name               Image name to apply to the built image, or to identify the image to be pushed.  
  -t tag                Tag to apply to the built image, or to identify the image to be pushed.
  -m                    Use minikube's Docker daemon.
  -n                    Build docker image with --no-cache
  -b arg      Build arg to build or push the image. For multiple build args, this option needs to
              be used separately for each build arg.

Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.

Check the following documentation for more information on using the minikube Docker daemon:

  https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon

Examples:
  - Build image in minikube with tag "testing"
    $0 -m -t testing build

  - Build and push image with tag "v2.3.0" to docker.io/myrepo
    $0 -r docker.io/myrepo -t v2.3.0 build
    $0 -r docker.io/myrepo -t v2.3.0 push
EOF
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  usage
  exit 0
fi

REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
 case "${option}"
 in
 f) BASEDOCKERFILE=${OPTARG};;
 r) REPO=${OPTARG};;
 t) TAG=${OPTARG};;
 n) NOCACHEARG="--no-cache";;
 i) IMAGE_REF=${OPTARG};;
 b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
 esac
done

case "${@: -1}" in
  build)
    build
    ;;
  push)
    if [ -z "$REPO" ]; then
      usage
      exit 1
    fi
    push
    ;;
  *)
    usage
    exit 1
    ;;
esac

Ma tana awhina, ka kohia e matou he ahua Spark taketake kei roto he mahi whakamatautau mo te tatau Pi ma te whakamahi i te Spark (i konei ko {docker-registry-url} te URL o to rehita whakaahua Docker, {repo} te ingoa o te rehitatanga kei roto i te rehita, e rite ana ki te kaupapa i OpenShift , {image-name} - ingoa o te ahua (mehemea ka whakamahia te wehewehenga taumata-toru o nga whakaahua, hei tauira, penei i te rehitatanga whakauru o nga whakaahua Red Hat OpenShift), {tag} - tohu o tenei putanga o te ahua):

./bin/docker-image-tool-upd.sh -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile -r {docker-registry-url}/{repo} -i {image-name} -t {tag} build

Takiuru ki te huinga OKD ma te whakamahi i te taputapu papatohu (konei {OKD-API-URL} ko te OKD cluster API URL):

oc login {OKD-API-URL}

Me tiki te tohu a te kaiwhakamahi o naianei mo te whakamanatanga i te Rehita Docker:

oc whoami -t

Whakauruhia ki te Rehita Docker o roto o te roopu OKD (ka whakamahia e matou te tohu i whiwhi ma te whakamahi i te whakahau o mua hei kupuhipa):

docker login {docker-registry-url}

Me tuku ake te ahua Docker kua huihuia ki te Docker Registry OKD:

./bin/docker-image-tool-upd.sh -r {docker-registry-url}/{repo} -i {image-name} -t {tag} push

Kia tirohia kei te waatea te ahua kua whakaemihia i OKD. Ki te mahi i tenei, whakatuwheratia te URL i roto i te tirotiro me te rarangi o nga whakaahua o te kaupapa e rite ana (i konei ko {kaupapa} te ingoa o te kaupapa kei roto i te roopu OpenShift, {OKD-WEBUI-URL} ko te URL o te papatohu Tukutuku OpenShift ) - https://{OKD-WEBUI-URL}/console /project/{project}/browse/images/{image-name}.

Hei whakahaere i nga mahi, me hanga he putea ratonga me nga mana ki te whakahaere i nga poti hei pakiaka (ka korerohia e tatou tenei waahanga i muri mai):

oc create sa spark -n {project}
oc adm policy add-scc-to-user anyuid -z spark -n {project}

Me whakahaere e tatou te whakahau tuku-korikori ki te whakaputa i tetahi mahi Spark ki te roopu OKD, e tohu ana i te kaute ratonga i hangaia me te ahua Docker:

 /opt/spark/bin/spark-submit --name spark-test --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar

Here:

—ingoa — te ingoa o te mahi ka uru ki te hanganga o te ingoa o nga poti Kubernetes;

—class — akomanga o te konae whakahaere, ka karangahia ina timata te mahi;

—conf — Tawhā whirihoranga korakora;

spark.executor.instances — te maha o nga kaikohikohi Spark hei whakarewa;

spark.kubernetes.authenticate.driver.serviceAccountName - te ingoa o te kaute ratonga Kubernetes e whakamahia ana i te wa e whakarewa ana i nga pene (hei tautuhi i te horopaki haumarutanga me nga kaha i te wa e taunekeneke ana me te API Kubernetes);

spark.kubernetes.namespace — Mokowāingo Kubernetes hei whakarewahia nga porowhita taraiwa me nga kaikohikohi;

spark.submit.deployMode — te tikanga mo te whakarewa i te Spark (mo te korakora tuku-paerewa "cluster" ka whakamahia, mo te Spark Operator me nga putanga o muri mai o Spark "kiritaki");

spark.kubernetes.container.image - Ataahua Docker i whakamahia hei whakarewa i nga poti;

spark.master — URL API Kubernetes (kua tohua o waho kia puta mai te uru mai i te miihini o te rohe);

rohe: // ko te ara ki te Spark ka taea te whakahaere i roto i te ahua Docker.

Ka haere matou ki te kaupapa OKD e pa ana, ka ako i nga pene i hangaia - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods.

Hei whakangwari i te tukanga whakawhanaketanga, ka taea te whakamahi i tetahi atu whiringa, ka hangaia he ahua noa o Spark, ka whakamahia e nga mahi katoa hei whakahaere, ka whakaputahia nga whakaahua o nga konae whakahaere ki te rokiroki o waho (hei tauira, Hadoop) ka tohua i te wa e waea ana. korakora-tuku hei hono. I tenei keehi, ka taea e koe te whakahaere i nga momo momo mahi Spark me te kore hanga ano i nga whakaahua Docker, ma te whakamahi, hei tauira, WebHDFS ki te whakaputa whakaahua. Ka tukuna he tono ki te hanga i tetahi konae (i konei ko {host} te kaihautu o te ratonga WebHDFS, ko {port} te tauranga o te ratonga WebHDFS, ko {path-to-file-on-hdfs} te ara e hiahiatia ana ki te konae i runga i te HDFS):

curl -i -X PUT "http://{host}:{port}/webhdfs/v1/{path-to-file-on-hdfs}?op=CREATE

Ka whiwhi koe i te whakautu penei (i konei ko {location} te URL hei tango i te konae):

HTTP/1.1 307 TEMPORARY_REDIRECT
Location: {location}
Content-Length: 0

Utaina te konae ka taea te kawe ki roto i te HDFS (i konei ko {path-to-local-file} te ara ki te konae whakahaere Spark i runga i te kaihautu o naianei):

curl -i -X PUT -T {path-to-local-file} "{location}"

Whai muri i tenei, ka taea e tatou te tuku korakora ma te whakamahi i te konae Spark i tukuna ki HDFS (i konei ko {class-name} te ingoa o te karaehe e tika ana kia whakarewahia hei whakaoti i te mahi):

/opt/spark/bin/spark-submit --name spark-test --class {class-name} --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  hdfs://{host}:{port}/{path-to-file-on-hdfs}

Kia maharahia kia uru atu ki te HDFS me te whakarite kia pai te mahi, me whakarereke pea koe i te Dockerfile me te tuhinga entrypoint.sh - taapirihia he tohutohu ki te Dockerfile ki te kape i nga whare pukapuka whakawhirinaki ki te whaiaronga /opt/spark/jars me whakauruhia te konae whirihoranga HDFS ki SPARK_CLASSPATH ki te waahi urunga.

Ko te take whakamahi tuarua - Apache Livy

I tua atu, ka whakawhanakehia he mahi me te whakamatau i te hua, ka puta te patai mo te whakarewatanga hei waahanga o te tukanga CI/CD me te whai i te mana o tana mahi. Ae ra, ka taea e koe te whakahaere ma te waea tuku-a-rohe, engari he mea whakararu tenei i te hanganga CI/CD na te mea me whakauru me te whirihora i a Spark ki runga i nga apiha tūmau/kaiwhakahaere CI me te whakarite uru ki te API Kubernetes. Mo tenei keehi, kua tohua e te whakatinanatanga ki te whakamahi i a Apache Livy hei REST API mo te whakahaere i nga mahi Spark e whakahaerehia ana i roto i te roopu Kubernetes. Ma tana awhina, ka taea e koe te whakahaere i nga mahi Spark i runga i te roopu Kubernetes ma te whakamahi i nga tono cURL auau, he ngawari te whakatinana i runga i tetahi otinga CI, a ko tana tuunga ki roto i te roopu Kubernetes ka whakatau i te take o te whakamotuhēhēnga i te taunekeneke me te API Kubernetes.

Te whakahaere i te Apache Spark i runga i nga Kubernetes

Me tohuhia hei take whakamahi tuarua - te whakahaere i nga mahi Spark hei wahanga o te tukanga CI/CD i runga i te kahui Kubernetes i roto i te kopae whakamatautau.

He iti mo Apache Livy - he tūmau HTTP e whakarato ana i te atanga Tukutuku me te API RESTful e taea ai e koe te whakarewa i te korakora-tuku ma te tuku i nga tawhā e tika ana. I nga wa o mua kua tukuna hei waahanga o te tohatoha HDP, engari ka taea ano te tuku ki OKD, ki etahi atu whakaurunga Kubernetes ma te whakamahi i te whakaaturanga tika me te huinga whakaahua Docker, penei - github.com/ttauveron/k8s-big-data-experiments/tree/master/livy-spark-2.3. Mo ta matou keehi, i hangaia he ahua o Docker, tae atu ki te putanga Spark 2.4.5 mai i te Dockerfile e whai ake nei:

FROM java:8-alpine

ENV SPARK_HOME=/opt/spark
ENV LIVY_HOME=/opt/livy
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV SPARK_USER=spark

WORKDIR /opt

RUN apk add --update openssl wget bash && 
    wget -P /opt https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz && 
    tar xvzf spark-2.4.5-bin-hadoop2.7.tgz && 
    rm spark-2.4.5-bin-hadoop2.7.tgz && 
    ln -s /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark

RUN wget http://mirror.its.dal.ca/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip && 
    unzip apache-livy-0.7.0-incubating-bin.zip && 
    rm apache-livy-0.7.0-incubating-bin.zip && 
    ln -s /opt/apache-livy-0.7.0-incubating-bin /opt/livy && 
    mkdir /var/log/livy && 
    ln -s /var/log/livy /opt/livy/logs && 
    cp /opt/livy/conf/log4j.properties.template /opt/livy/conf/log4j.properties

ADD livy.conf /opt/livy/conf
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ADD entrypoint.sh /entrypoint.sh

ENV PATH="/opt/livy/bin:${PATH}"

EXPOSE 8998

ENTRYPOINT ["/entrypoint.sh"]
CMD ["livy-server"]

Ka taea te hanga me te tuku ake i te ahua hanga ki to putunga Docker o naianei, penei i te putunga OKD o roto. Hei tuku, whakamahia te whakaaturanga e whai ake nei ({registry-url} - URL o te rehita whakaahua Docker, {image-name} - ingoa ahua Docker, {tag} - Tohu tohu Docker, {livy-url} - URL e hiahiatia ana kei hea te Ka uru te tūmau ki a Livy; ka whakamahia te whakaaturanga "Route" mena ka whakamahia te Red Hat OpenShift hei tohatoha Kubernetes, mena ka whakamahia te tohu Ingress, Ratonga ranei o te momo NodePort):

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: livy
  name: livy
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: livy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: livy
    spec:
      containers:
        - command:
            - livy-server
          env:
            - name: K8S_API_HOST
              value: localhost
            - name: SPARK_KUBERNETES_IMAGE
              value: 'gnut3ll4/spark:v1.0.14'
          image: '{registry-url}/{image-name}:{tag}'
          imagePullPolicy: Always
          name: livy
          ports:
            - containerPort: 8998
              name: livy-rest
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/log/livy
              name: livy-log
            - mountPath: /opt/.livy-sessions/
              name: livy-sessions
            - mountPath: /opt/livy/conf/livy.conf
              name: livy-config
              subPath: livy.conf
            - mountPath: /opt/spark/conf/spark-defaults.conf
              name: spark-config
              subPath: spark-defaults.conf
        - command:
            - /usr/local/bin/kubectl
            - proxy
            - '--port'
            - '8443'
          image: 'gnut3ll4/kubectl-sidecar:latest'
          imagePullPolicy: Always
          name: kubectl
          ports:
            - containerPort: 8443
              name: k8s-api
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: spark
      serviceAccountName: spark
      terminationGracePeriodSeconds: 30
      volumes:
        - emptyDir: {}
          name: livy-log
        - emptyDir: {}
          name: livy-sessions
        - configMap:
            defaultMode: 420
            items:
              - key: livy.conf
                path: livy.conf
            name: livy-config
          name: livy-config
        - configMap:
            defaultMode: 420
            items:
              - key: spark-defaults.conf
                path: spark-defaults.conf
            name: livy-config
          name: spark-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: livy-config
data:
  livy.conf: |-
    livy.spark.deploy-mode=cluster
    livy.file.local-dir-whitelist=/opt/.livy-sessions/
    livy.spark.master=k8s://http://localhost:8443
    livy.server.session.state-retain.sec = 8h
  spark-defaults.conf: 'spark.kubernetes.container.image        "gnut3ll4/spark:v1.0.14"'
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: livy
  name: livy
spec:
  ports:
    - name: livy-rest
      port: 8998
      protocol: TCP
      targetPort: 8998
  selector:
    component: livy
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: livy
  name: livy
spec:
  host: {livy-url}
  port:
    targetPort: livy-rest
  to:
    kind: Service
    name: livy
    weight: 100
  wildcardPolicy: None

Whai muri i te tono me te whakarewanga angitu i te pona, kei te waatea te atanga kauwhata Livy i te hono: http://{livy-url}/ui. Ki a Livy, ka taea e matou te whakaputa i ta matou mahi Spark ma te whakamahi i te tono REST mai i, hei tauira, Postman. He tauira o te kohinga me nga tono kei raro nei (ko nga tautohetohe whirihoranga me nga taurangi e tika ana mo te mahi o te mahi kua whakarewahia ka taea te tuku ki te rarangi "args"):

{
    "info": {
        "_postman_id": "be135198-d2ff-47b6-a33e-0d27b9dba4c8",
        "name": "Spark Livy",
        "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
    },
    "item": [
        {
            "name": "1 Submit job with jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar", nt"className": "org.apache.spark.examples.SparkPi",nt"numExecutors":1,nt"name": "spark-test-1",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt}n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        },
        {
            "name": "2 Submit job without jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "hdfs://{host}:{port}/{path-to-file-on-hdfs}", nt"className": "{class-name}",nt"numExecutors":1,nt"name": "spark-test-2",nt"proxyUser": "0",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt},nt"args": [ntt"HADOOP_CONF_DIR=/opt/spark/hadoop-conf",ntt"MASTER=k8s://https://kubernetes.default.svc:8443"nt]n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        }
    ],
    "event": [
        {
            "listen": "prerequest",
            "script": {
                "id": "41bea1d0-278c-40c9-ad42-bf2e6268897d",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        },
        {
            "listen": "test",
            "script": {
                "id": "3cdd7736-a885-4a2d-9668-bd75798f4560",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        }
    ],
    "protocolProfileBehavior": {}
}

Kia mahia te tono tuatahi mai i te kohinga, haere ki te atanga OKD ka tirohia kua pai te whakarewanga o te mahi - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods. I taua wa ano, ka puta mai he huihuinga i roto i te atanga Livy (http://{livy-url}/ui), kei roto, ma te whakamahi i te API Livy, atanga kauwhata ranei, ka taea e koe te whai i te ahunga whakamua o te mahi me te ako i te wahanga. rākau.

Inaianei me whakaatu te mahi a Livy. Ki te mahi i tenei, me tirotirohia nga raarangi o te ipu Livy kei roto i te poti me te tūmau Livy - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods/{livy-pod-name }?tab=raupapa. Mai i a raatau ka kite tatou i te wa e karangahia ana te Livy REST API i roto i tetahi ipu ko "livy", ka mahia he korakora-tuku, he rite ki tera i whakamahia e matou i runga ake nei (i konei ko {livy-pod-name} te ingoa o te pod i hangaia. me te tūmau Livy). Ka whakauru ano te kohinga i tetahi patai tuarua e taea ai e koe te whakahaere i nga mahi e manaaki mamao ana i te Spark ka taea te whakahaere ma te whakamahi i te tūmau Livy.

Tuatoru take whakamahi - Spark Operator

Inaianei kua whakamatauria te mahi, ka ara ake te patai mo te whakahaere i ia wa. Ko te huarahi taketake ki te whakahaere i nga mahi i nga wa katoa i roto i te roopu Kubernetes ko te hinonga CronJob ka taea e koe te whakamahi, engari i tenei wa he tino rongonui te whakamahi a nga kaiwhakahaere ki te whakahaere tono i Kubernetes a mo Spark he kaiwhakahaere tino pakeke, he mea hoki. whakamahia i roto i nga otinga taumata-Hngaonga (hei tauira, Lightbend FastData Platform). Ka tūtohu matou ki te whakamahi - ko te putanga pumau o Spark (2.4.5) he iti noa nga whiringa whirihoranga mo te whakahaere i nga mahi Spark i Kubernetes, ko te putanga nui e whai ake nei (3.0.0) e whakaatu ana i te tautoko katoa mo Kubernetes, engari kare tonu i te mohiotia tona ra tuku. . Ka utua e te Kaiwhakahaere Spark mo tenei ngoikoretanga ma te taapiri i nga whiringa whirihoranga nui (hei tauira, te whakapuru i tetahi ConfigMap me te whirihoranga uru Hadoop ki nga putunga Spark) me te kaha ki te whakahaere i tetahi mahi kua whakaritea.

Te whakahaere i te Apache Spark i runga i nga Kubernetes
Me whakanuia hei take whakamahi tuatoru - te whakahaere i nga mahi Spark i nga wa katoa i runga i te kahui Kubernetes i roto i te mahinga whakangao.

He puna tuwhera a Spark Operator ka whakawhanakehia i roto i te Google Cloud Platform - github.com/GoogleCloudPlatform/spark-on-k8s-operator. Ka taea te whakauru i roto i nga huarahi e toru:

  1. Hei waahanga o te whakaurunga Lightbend FastData Platform/Cloudflow;
  2. Whakamahi Helm:
    helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
    helm install incubator/sparkoperator --namespace spark-operator
    	

  3. Ma te whakamahi i nga whakaaturanga mai i te putunga mana (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest). He mea tika kia kite i nga mea e whai ake nei - Kei roto i te Cloudflow tetahi kaiwhakahaere me te putanga API v1beta1. Mena ka whakamahia tenei momo whakaurunga, ko nga whakaahuatanga whakakitenga tono Spark me hanga ki runga i nga tohu tauira i Git me te putanga API e tika ana, hei tauira, "v1beta1-0.9.0-2.4.0". Ko te putanga o te kaiwhakahaere ka kitea i roto i te whakaahuatanga o te CRD kei roto i te kaiwhakahaere i roto i te papakupu "putanga":
    oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
    	

Mena kua tika te whakaurunga o te kaiwhakahaere, ka puta mai he poro kaha me te kaiwhakahaere Spark i roto i te kaupapa e rite ana (hei tauira, cloudflow-fdp-sparkoperator i te waahi Cloudflow mo te whakaurunga Cloudflow) me tetahi momo rauemi Kubernetes e rite ana ko "sparkapplications" ka puta. . Ka taea e koe te torotoro i nga tono Spark e waatea ana me te whakahau e whai ake nei:

oc get sparkapplications -n {project}

Hei whakahaere i nga mahi ma te whakamahi i te Spark Operator me mahi e koe e toru nga mea:

  • Waihangahia he ahua Docker kei roto nga whare pukapuka katoa e tika ana, me te whirihoranga me nga konae whakahaere. I roto i te pikitia i whaaia, he ahua tenei i hangaia i te atamira CI/CD ka whakamatauria i runga i te kahui whakamatautau;
  • whakaputahia he ahua Docker ki tetahi rehita e taea ana mai i te roopu Kubernetes;
  • hangaia he whakaaturanga me te momo "SparkApplication" me te whakaahuatanga o te mahi ka whakarewahia. E waatea ana nga tauira whakaaturanga i roto i te putunga whaimana (hei tauira. github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta1-0.9.0-2.4.0/examples/spark-pi.yaml). He mea nui hei tohu mo te whakaaturanga:
    1. me tohu te papakupu “apiVersion” i te putanga API e rite ana ki te putanga kaiwhakahaere;
    2. ko te papakupu "metadata.namespace" me tohu te mokowāingoa e whakarewahia ai te tono;
    3. Ko te papakupu "spec.image" me whakauru te wahitau o te ahua Docker i hangaia i roto i te rehita e waatea ana;
    4. Ko te papakupu "spec.mainClass" me whakauru te karaehe mahi Spark me whakahaere ina timata te mahi;
    5. me tohu te ara ki te konae ipu kawe i roto i te papakupu "spec.mainApplicationFile";
    6. me tohu te papakupu "spec.sparkVersion" i te putanga o Spark e whakamahia ana;
    7. me tohu e te papakupu “spec.driver.serviceAccount” te kaute ratonga i roto i te mokowāingoa Kubernetes e hāngai ana ka whakamahia hei whakahaere i te tono;
    8. me tohu te papakupu "spec.executor" i te maha o nga rauemi kua tohaina ki te tono;
    9. me whakaatu e te papakupu "spec.volumeMounts" te whaiaronga rohe hei hanga i nga konae mahi Spark rohe.

He tauira mo te whakaputa whakaaturanga (i konei ko {spark-service-account} he putea ratonga kei roto i te roopu Kubernetes mo te whakahaere i nga mahi Spark):

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: {spark-service-account}
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Ka tohua e tenei whakaaturanga he kaute ratonga, i mua i te whakaputanga o te whakaaturanga, me hanga e koe nga here mahi e tika ana hei whakarato i nga mana uru e tika ana mo te tono Spark ki te taunekeneke me te API Kubernetes (mehemea e tika ana). I roto i a maatau, me whai mana te tono ki te hanga Pods. Me hanga e tatou te herenga mahi e tika ana:

oc adm policy add-role-to-user edit system:serviceaccount:{project}:{spark-service-account} -n {project}

He mea tika ano kia mohio kei roto i tenei whakakitenga he tohu "hadoopConfigMap", ka taea e koe te tautuhi i tetahi ConfigMap me te whirihoranga Hadoop me te kore e tuu tuatahi i te konae e rite ana ki te ahua Docker. He pai hoki mo te whakahaere i nga mahi i ia wa - ma te whakamahi i te tawhā "maramataka", ka taea te whakarite he raarangi mo te whakahaere i tetahi mahi.

Whai muri i tera, ka tiakina e matou ta matou whakaaturanga ki te konae spark-pi.yaml ka hoatu ki ta matou roopu Kubernetes:

oc apply -f spark-pi.yaml

Ka hangaia he ahanoa o te momo "sparkapplications":

oc get sparkapplications -n {project}
> NAME       AGE
> spark-pi   22h

I tenei keehi, ka hangaia he putea me tetahi tono, ka whakaatuhia te mana o te "sparkapplications" i hangaia. Ka taea e koe te tiro me te whakahau e whai ake nei:

oc get sparkapplications spark-pi -o yaml -n {project}

I te otinga o te mahi, ka neke te POD ki te mana "Kua oti", ka whakahou ano i "sparkapplications". Ka taea te tiro i nga raarangi tono i roto i te kaitirotiro, ma te whakamahi ranei i te whakahau e whai ake nei (i konei ko {sparkapplications-pod-name} te ingoa o te pene o te mahi whakahaere):

oc logs {sparkapplications-pod-name} -n {project}

Ka taea hoki te whakahaere i nga mahi korakora ma te whakamahi i te taputapu sparkctl motuhake. Hei whakauru, kati i te putunga me tana waehere puna, whakauruhia Haere ka hanga tenei taputapu:

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin

Ka tirohia te rarangi o nga mahi Spark e whakahaere ana:

sparkctl list -n {project}

Me hanga he whakaahuatanga mo te mahi Spark:

vi spark-app.yaml

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Me whakahaere te mahi i whakaahuahia ma te whakamahi i te sparkctl:

sparkctl create spark-app.yaml -n {project}

Ka tirohia te rarangi o nga mahi Spark e whakahaere ana:

sparkctl list -n {project}

Ka tirohia te rarangi o nga huihuinga o tetahi mahi Spark i whakarewahia:

sparkctl event spark-pi -n {project} -f

Ka tirohia te mana o te mahi Spark e whakahaere ana:

sparkctl status spark-pi -n {project}

Hei mutunga, e hiahia ana ahau ki te whakaaro ki nga huakore kua kitea o te whakamahi i te putanga pumau o Spark (2.4.5) i Kubernetes:

  1. Ko te tuatahi me te mea pea, ko te ngoikoretanga matua ko te kore o te Raraunga Raraunga. Ahakoa nga ngoikoretanga katoa o te YARN, he pai ano te whakamahi, hei tauira, te kaupapa o te tuku waehere ki nga raraunga (kaore i te raraunga ki te waehere). Ko te mihi ki a ia, i mahia nga mahi Spark i runga i nga kohanga kei reira nga raraunga e uru ana ki nga tatauranga, me te wa i tangohia ki te tuku raraunga i runga i te whatunga kua tino heke. I te wa e whakamahi ana i nga Kubernetes, kei te raru tatou ki te nuku i nga raraunga e uru ana ki tetahi mahi puta noa i te whatunga. Mena he nui rawa, ka tino piki ake te waa mahi, me te hiahia hoki kia nui te mokowā kōpae kua tohaina ki nga tauira mahi Spark mo to ratou rokiroki rangitahi. Ka taea te whakaiti i tenei ngoikoretanga ma te whakamahi i nga rorohiko motuhake e whakapumau ana i te waahi raraunga i Kubernetes (hei tauira, Alluxio), engari ko te tikanga me penapena he kape katoa o nga raraunga i runga i nga kopuku o te roopu Kubernetes.
  2. Ko te rua o nga painga nui ko te haumaru. Ma te taunoa, ka monoa nga ahuatanga e pa ana ki te haumarutanga e pa ana ki te whakahaere i nga mahi Spark, karekau te whakamahinga o Kerberos i kapi i roto i nga tuhinga whaimana (ahakoa i whakaurua nga whiringa e rite ana ki te putanga 3.0.0, me mahi taapiri), me nga tuhinga haumarutanga mo ma te whakamahi i te Spark (https://spark.apache.org/docs/2.4.5/security.html) anake ka puta ko YARN, Mesos me Standalone Cluster hei toa matua. I te wa ano, kaore e taea te tautuhi tika te kaiwhakamahi kei raro i a ia nga mahi Spark - ka tohua noa e matou te kaute ratonga e mahi ai, ka tohua te kaiwhakamahi i runga i nga kaupapa here haumaru kua whirihorahia. I runga i tenei, ka whakamahia te kaiwhakamahi pakiaka, kaore i te haumaru i roto i te taiao whai hua, he kaiwhakamahi ranei he UID matapōkere, he raruraru i te wa e tohatoha ana i nga mana uru ki nga raraunga (ka taea te whakatau ma te hanga PodSecurityPolicies me te hono atu ki te pūkete ratonga e hāngai ana). I tenei wa, ko te otinga ko te tuu tika i nga konae katoa ki te ahua Docker, te whakarereke ranei i te tuhinga whakarewatanga Spark hei whakamahi i te tikanga mo te penapena me te whakahoki mai i nga mea ngaro i tangohia i roto i to whakahaere.
  3. Ko te whakahaere i nga mahi Spark ma te whakamahi i nga Kubernetes kei roto tonu i te aratau whakamatautau, a tera pea ka nui nga huringa o nga taonga toi i whakamahia (nga konae whirihora, nga whakaahua turanga Docker, me nga tuhinga whakarewatanga) a meake nei. Na, i te wa e whakarite ana i nga rauemi, i whakamatauhia nga putanga 2.3.0 me 2.4.5, he rereke te ahua o te whanonga.

Tatari tatou mo nga whakahou - he putanga hou o Spark (3.0.0) i tukuna ina tata nei, he nui nga huringa ki nga mahi a Spark i runga i nga Kubernetes, engari i mau tonu te mana whakamatautau o te tautoko mo tenei kaiwhakahaere rauemi. Tena pea ko nga whakahoutanga e whai ake nei ka taea ki te tino taunaki kia whakarerea te YARN me te whakahaere i nga mahi Spark i runga i nga Kubernetes me te kore mataku mo te haumarutanga o to punaha me te kore e hiahia ki te whakarereke takitahi i nga waahanga mahi.

Whakamutunga.

Source: will.com

Tāpiri i te kōrero