Apache Spark na-agba ọsọ na Kubernetes

Ezigbo ndị na-agụ akwụkwọ, ehihie ọma. Taa, anyị ga-ekwu ntakịrị banyere Apache Spark na atụmanya mmepe ya.

Apache Spark na-agba ọsọ na Kubernetes

N'ime ụwa nke Big Data ọgbara ọhụrụ, Apache Spark bụ ọkọlọtọ maka ịmepụta ọrụ nhazi data. Tụkwasị na nke a, a na-ejikwa ya mepụta ngwa ngwa ngwa ngwa na-arụ ọrụ na echiche micro batch, nhazi na mbupu data na obere akụkụ (Spark Structured Streaming). Na omenala, ọ bụla akụkụ nke mkpokọta Hadoop, na-eji YARN (ma ọ bụ mgbe ụfọdụ Apache Mesos) dị ka onye njikwa akụrụngwa. Site na 2020, ojiji ya n'ụdị ọdịnala ya bụ ajụjụ maka ọtụtụ ụlọ ọrụ n'ihi enweghị nkesa Hadoop dị mma - mmepe nke HDP na CDH akwụsịla, CDH adịghị emepe nke ọma ma nwee ọnụ ahịa dị elu, na ndị na-eweta Hadoop fọdụrụnụ nwere. ma ọ bụ kwụsịrị ịdị adị ma ọ bụ nwee ọdịnihu adịghị mma. Ya mere, mwepụta nke Apache Spark na-eji Kubernetes bụ mmasị na-arịwanye elu n'etiti obodo na nnukwu ụlọ ọrụ - ịghọ ọkọlọtọ na nhazi nhazi na njikwa akụ na igwe ojii na nke ọha, ọ na-edozi nsogbu ahụ site na nhazi usoro ihe onwunwe na-adịghị mma nke ọrụ Spark na YARN ma na-enye. ikpo okwu na-emepe emepe nke nwere ọtụtụ nkesa azụmahịa na nke mepere emepe maka ụlọ ọrụ nke nha na ọnyá niile. Tụkwasị na nke ahụ, n'azụ nke ewu ewu, ọtụtụ ejirilarị nweta nrụnye ole na ole nke onwe ha ma nwekwuo nkà ha n'iji ya eme ihe, nke na-eme ka njem ahụ dị mfe.

Malite na ụdị 2.3.0, Apache Spark nwetara nkwado gọọmentị maka ịrụ ọrụ na-arụ ọrụ na ụyọkọ Kubernetes na taa, anyị ga-ekwu maka ntozu oke nke usoro a, nhọrọ dị iche iche maka iji ya na ọnyà ndị a ga-ezute n'oge mmejuputa.

Nke mbụ, ka anyị leba anya na usoro nke ịmepụta ọrụ na ngwa dabere na Apache Spark ma gosipụta ụdị ikpe nke ịchọrọ iji rụọ ọrụ na ụyọkọ Kubernetes. N'ịkwado ọkwa a, a na-eji OpenShift dị ka nkesa na iwu dị mkpa maka ọrụ ahịrị iwu ya (oc) ga-enye. Maka nkesa Kubernetes ndị ọzọ, enwere ike iji iwu kwekọrọ na ọkọlọtọ Kubernetes Command line utility (kubectl) ma ọ bụ analogues ha (dịka ọmụmaatụ, maka iwu oc adm).

Akpa eji eme ihe - spark-nobe

N'oge mmepe nke ọrụ na ngwa ngwa, onye nrụpụta kwesịrị ịrụ ọrụ iji mebie mgbanwe data. N'ụzọ doro anya, enwere ike iji stubs mee ihe maka ebumnuche ndị a, mana mmepe na ntinye aka nke ezigbo (ọ bụ ezie na ule) ihe atụ nke usoro njedebe egosila na ọ dị ngwa ngwa ma dị mma na klas nke ọrụ a. N'okwu ahụ mgbe anyị na-emezigharị na ezigbo ihe atụ nke sistemụ njedebe, ọnọdụ abụọ ga-ekwe omume:

  • onye nrụpụta na-arụ ọrụ Spark na mpaghara na ọnọdụ kwụ ọtọ;

    Apache Spark na-agba ọsọ na Kubernetes

  • onye nrụpụta na-arụ ọrụ Spark na ụyọkọ Kubernetes na loop ule.

    Apache Spark na-agba ọsọ na Kubernetes

Nhọrọ nke mbụ nwere ikike ịdị adị, mana ọ gụnyere ọtụtụ ọghọm:

  • A ghaghị inye onye nrụpụta ọ bụla ohere site na ebe ọrụ gaa na oge niile nke usoro njedebe ọ chọrọ;
  • a chọrọ ego zuru oke na igwe na-arụ ọrụ iji rụọ ọrụ a na-emepụta.

Nhọrọ nke abụọ enweghị ọghọm ndị a, ebe ọ bụ na iji ụyọkọ Kubernetes na-enye gị ohere ikenye ọdọ mmiri akụrụngwa dị mkpa maka ịrụ ọrụ ma nye ya ohere dị mkpa iji kwụsị usoro usoro, na-agbanwe agbanwe na-enye ohere ịnweta ya site na iji Kubernetes nlereanya maka. ndị niile so na mmepe otu. Ka anyị gosipụta ya dị ka ikpe izizi mbụ - ịmalite ọrụ Spark site na igwe nrụpụta mpaghara na ụyọkọ Kubernetes na sekit ule.

Ka anyị kwukwuo maka usoro ịtọlite ​​Spark ka ọ na-agba ọsọ na mpaghara. Iji malite iji Spark, ịkwesịrị ịwụnye ya:

mkdir /opt/spark
cd /opt/spark
wget http://mirror.linux-ia64.org/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar zxvf spark-2.4.5.tgz
rm -f spark-2.4.5.tgz

Anyị na-anakọta ngwugwu ndị dị mkpa maka ịrụ ọrụ na Kubernetes:

cd spark-2.4.5/
./build/mvn -Pkubernetes -DskipTests clean package

Nrụpụta zuru oke na-ewe oge buru ibu, yana imepụta onyonyo Docker wee mee ya na ụyọkọ Kubernetes, naanị ị ga-achọ faịlụ ite sitere na ndekọ “mgbakọ /”, yabụ naanị ị nwere ike wuo isiokwu a:

./build/mvn -f ./assembly/pom.xml -Pkubernetes -DskipTests clean package

Iji rụọ ọrụ Spark na Kubernetes, ịkwesịrị ịmepụta onyonyo Docker iji mee ihe dị ka ihe ndabere. Enwere ụzọ 2 nwere ike ime ebe a:

  • Onyonyo Docker emepụtara gụnyere koodu ọrụ Spark arụrụ arụ;
  • Onyonyo emepụtara na-agụnye naanị Spark na ihe ndabere dị mkpa, a na-akwado koodu executable na anya (dịka ọmụmaatụ, na HDFS).

Nke mbụ, ka anyị wuo onyonyo Docker nwere ihe atụ nnwale nke ọrụ Spark. Iji mepụta onyonyo Docker, Spark nwere akụrụngwa akpọrọ "docker-image-tool". Ka anyị mụọ enyemaka na ya:

./bin/docker-image-tool.sh --help

Site n'enyemaka ya, ị nwere ike ịmepụta ihe oyiyi Docker wee bulite ha na ndekọ dịpụrụ adịpụ, mana na ndabara ọ nwere ọtụtụ ọghọm:

  • na-emepụta ihe oyiyi Docker 3 ozugbo - maka Spark, PySpark na R;
  • anaghị ekwe ka ị kọwapụta aha onyonyo.

Ya mere, anyị ga-eji ụdị ọrụ a gbanwetụrụ enyere n'okpuru:

vi bin/docker-image-tool-upd.sh

#!/usr/bin/env bash

function error {
  echo "$@" 1>&2
  exit 1
}

if [ -z "${SPARK_HOME}" ]; then
  SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"

function image_ref {
  local image="$1"
  local add_repo="${2:-1}"
  if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
    image="$REPO/$image"
  fi
  if [ -n "$TAG" ]; then
    image="$image:$TAG"
  fi
  echo "$image"
}

function build {
  local BUILD_ARGS
  local IMG_PATH

  if [ ! -f "$SPARK_HOME/RELEASE" ]; then
    IMG_PATH=$BASEDOCKERFILE
    BUILD_ARGS=(
      ${BUILD_PARAMS}
      --build-arg
      img_path=$IMG_PATH
      --build-arg
      datagram_jars=datagram/runtimelibs
      --build-arg
      spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
    )
  else
    IMG_PATH="kubernetes/dockerfiles"
    BUILD_ARGS=(${BUILD_PARAMS})
  fi

  if [ -z "$IMG_PATH" ]; then
    error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
  fi

  if [ -z "$IMAGE_REF" ]; then
    error "Cannot find docker image reference. Please add -i arg."
  fi

  local BINDING_BUILD_ARGS=(
    ${BUILD_PARAMS}
    --build-arg
    base_img=$(image_ref $IMAGE_REF)
  )
  local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}

  docker build $NOCACHEARG "${BUILD_ARGS[@]}" 
    -t $(image_ref $IMAGE_REF) 
    -f "$BASEDOCKERFILE" .
}

function push {
  docker push "$(image_ref $IMAGE_REF)"
}

function usage {
  cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.

Commands:
  build       Build image. Requires a repository address to be provided if the image will be
              pushed to a different registry.
  push        Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
  -f file               Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
  -p file               Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
  -R file               Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
  -r repo               Repository address.
  -i name               Image name to apply to the built image, or to identify the image to be pushed.  
  -t tag                Tag to apply to the built image, or to identify the image to be pushed.
  -m                    Use minikube's Docker daemon.
  -n                    Build docker image with --no-cache
  -b arg      Build arg to build or push the image. For multiple build args, this option needs to
              be used separately for each build arg.

Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.

Check the following documentation for more information on using the minikube Docker daemon:

  https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon

Examples:
  - Build image in minikube with tag "testing"
    $0 -m -t testing build

  - Build and push image with tag "v2.3.0" to docker.io/myrepo
    $0 -r docker.io/myrepo -t v2.3.0 build
    $0 -r docker.io/myrepo -t v2.3.0 push
EOF
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  usage
  exit 0
fi

REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
 case "${option}"
 in
 f) BASEDOCKERFILE=${OPTARG};;
 r) REPO=${OPTARG};;
 t) TAG=${OPTARG};;
 n) NOCACHEARG="--no-cache";;
 i) IMAGE_REF=${OPTARG};;
 b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
 esac
done

case "${@: -1}" in
  build)
    build
    ;;
  push)
    if [ -z "$REPO" ]; then
      usage
      exit 1
    fi
    push
    ;;
  *)
    usage
    exit 1
    ;;
esac

Site n'enyemaka ya, anyị na-achịkọta ihe oyiyi Spark bụ isi nke nwere ọrụ nnwale maka ịgbakọ Pi site na iji Spark (ebe a {docker-registry-url} bụ URL nke ndekọ ihe oyiyi Docker gị, {repo} bụ aha ebe nchekwa n'ime ndekọ ahụ, nke dabara na oru ngo na OpenShift , {image-name} - aha ihe oyiyi (ọ bụrụ na a na-eji nkewa nke ọkwa atọ nke ihe oyiyi, dịka ọmụmaatụ, dị ka na ndekọ aha nke Red Hat OpenShift images), {tag} - mkpado nke a. ụdị onyonyo a):

./bin/docker-image-tool-upd.sh -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile -r {docker-registry-url}/{repo} -i {image-name} -t {tag} build

Banye na ụyọkọ OKD site na iji njikwa njikwa (ebe a {OKD-API-URL} bụ OKD ụyọkọ API URL):

oc login {OKD-API-URL}

Ka anyị nweta akara nke onye ọrụ ugbu a maka ikike na ndekọ Docker:

oc whoami -t

Banye na ndekọ Docker ime nke ụyọkọ OKD (anyị na-eji akara ngosi enwetara site na iji iwu gara aga dị ka paswọọdụ):

docker login {docker-registry-url}

Ka anyị bulite onyonyo Docker agbakọtara na ndekọ Docker OKD:

./bin/docker-image-tool-upd.sh -r {docker-registry-url}/{repo} -i {image-name} -t {tag} push

Ka anyị lelee na onyonyo agbakọtara dị na OKD. Iji mee nke a, mepee URL na ihe nchọgharị ahụ na ndepụta nke onyonyo nke ọrụ ahụ kwekọrọ (ebe a bụ aha ọrụ n'ime ụyọkọ OpenShift, {OKD-WEBUI-URL} bụ URL nke OpenShift Web console). ) - https://{OKD-WEBUI-URL}/console /project/{project}/browse/images/{image-name}.

Iji mee ihe aga-eme, ekwesịrị ịmepụta akaụntụ ọrụ yana ohere iji mee pods dị ka mgbọrọgwụ (anyị ga-atụle isi ihe a ma emechaa):

oc create sa spark -n {project}
oc adm policy add-scc-to-user anyuid -z spark -n {project}

Ka anyị gbaa ọkụ-nnyefe iwu ka ibipụta ọrụ Spark na ụyọkọ OKD, na-akọwapụta akaụntụ ọrụ emepụtara yana onyonyo Docker:

 /opt/spark/bin/spark-submit --name spark-test --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar

Ebe a:

— aha — aha ọrụ nke ga-ekere òkè na nhazi aha Kubernetes pods;

—klas — klaasị nke faịlụ executable, nke a na-akpọ mgbe ọrụ malitere;

-conf - paramita nhazi nhazi;

spark.executor.instances - ọnụ ọgụgụ nke ndị na-eme ihe ngosi Spark ga-amalite;

spark.kubernetes.authenticate.driver.serviceAccountName - aha akaụntụ ọrụ Kubernetes ejiri mee ihe mgbe ị na-ebupụta pods (iji kọwaa ọnọdụ nchekwa na ike mgbe gị na Kubernetes API na-emekọrịta ihe);

spark.kubernetes.namespace - Kubernetes namespace nke a ga-ewepụta ihe ọkwọ ụgbọ ala na onye mmebe;

spark.submit.deployMode - usoro nke ịmalite Spark (maka ọkọlọtọ spark-nye "ụyọkọ" na-eji, maka Spark Operator na mgbe e mesịrị nsụgharị nke Spark "onye ahịa");

spark.kubernetes.container.image - Docker oyiyi eji malite pọd;

spark.master - Kubernetes API URL (kpọmkwem na mpụga ka ohere na-apụta site na igwe mpaghara);

local: // bụ ụzọ na-aga Spark executable n'ime onyonyo Docker.

Anyị na-aga na ọrụ OKD kwekọrọ wee mụọ pọd ndị emepụtara - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods.

Iji mee ka usoro mmepe ahụ dị mfe, enwere ike iji nhọrọ ọzọ, nke a na-emepụta ihe oyiyi nkịtị nke Spark, nke ọrụ niile na-arụ ọrụ, na-ebipụta foto nke faịlụ ndị nwere ike ime na nchekwa mpụga (dịka ọmụmaatụ, Hadoop) ma kọwaa mgbe ị na-akpọ. spark-nyefere dị ka njikọ. N'okwu a, ị nwere ike ịme ụdị ọrụ Spark dị iche iche na-ewughachi ihe oyiyi Docker, na-eji, dịka ọmụmaatụ, WebHDFS iji bipụta ihe oyiyi. Anyị na-eziga arịrịọ ka ịmepụta faịlụ (ebe a {host} bụ onye ọbịa nke ọrụ WebHDFS, {port} bụ ọdụ ụgbọ mmiri nke ọrụ WebHDFS, {ụzọ-to-file-on-hdfs} bụ ụzọ achọrọ na faịlụ ahụ. na HDFS):

curl -i -X PUT "http://{host}:{port}/webhdfs/v1/{path-to-file-on-hdfs}?op=CREATE

Ị ga-enweta nzaghachi dị ka nke a (ebe a bụ URL nke kwesịrị iji budata faịlụ ahụ):

HTTP/1.1 307 TEMPORARY_REDIRECT
Location: {location}
Content-Length: 0

Bunye faịlụ Spark executable n'ime HDFS (ebe a {ụzọ-to-local-file} bụ ụzọ na faịlụ Spark executable na onye ọbịa ugbu a):

curl -i -X PUT -T {path-to-local-file} "{location}"

Mgbe nke a gachara, anyị nwere ike ime spark-submit site na iji faịlụ Spark ebugoro na HDFS (ebe a bụ aha klaasị nke kwesịrị ịmalite iji rụchaa ọrụ ahụ):

/opt/spark/bin/spark-submit --name spark-test --class {class-name} --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL}  hdfs://{host}:{port}/{path-to-file-on-hdfs}

Ekwesiri iburu n'uche na iji nweta HDFS ma hụ na ọrụ ahụ na-arụ ọrụ, ị nwere ike ịgbanwe Dockerfile na script entrypoint.sh - tinye ntụziaka na Dockerfile iji detuo ụlọ akwụkwọ ndị dabere na / opt / spark / jars directory na tinye faịlụ nhazi HDFS na SPARK_CLASSPATH na ntinye. sh.

Okwu ikpe nke abụọ - Apache Livy

Ọzọkwa, mgbe arụpụtara ọrụ ma ọ dị mkpa ka a nwalee nsonaazụ ya, ajụjụ na-ebilite nke ịmalite ya dịka akụkụ nke usoro CI / CD na nyochaa ọnọdụ nke mmezu ya. N'ezie, ị nwere ike ịgba ya site na iji oku nrubeisi mpaghara, mana nke a na-eme ka akụrụngwa CI/CD gbagwojuru anya ebe ọ na-achọ ịwụnye na ịhazi Spark na ndị ọrụ CI sava / ndị na-agba ọsọ na ịtọlite ​​​​na Kubernetes API. Maka nke a, mmejuputa atumatu ahọrọla iji Apache Livy dị ka API REST maka ịgba ọsọ Spark akwadoro n'ime ụyọkọ Kubernetes. Site n'enyemaka ya, ị nwere ike ịme ọrụ Spark na ụyọkọ Kubernetes site na iji arịrịọ cURL mgbe niile, nke a na-eme ngwa ngwa dabere na ngwọta CI ọ bụla, na ntinye ya n'ime ụyọkọ Kubernetes na-edozi okwu nke nkwenye mgbe gị na Kubernetes API na-emekọrịta ihe.

Apache Spark na-agba ọsọ na Kubernetes

Ka anyị gosipụta ya dị ka ikpe ojiji nke abụọ - na-arụ ọrụ Spark dị ka akụkụ nke usoro CI/CD na ụyọkọ Kubernetes na loop ule.

Obere maka Apache Livy - ọ na-arụ ọrụ dị ka ihe nkesa HTTP na-enye interface Weebụ yana RESTful API nke na-enye gị ohere ibido ọkụ-nrubeisi site na ịgafe oke dị mkpa. Omenala ebula ya dị ka akụkụ nke nkesa HDP, mana enwere ike ibuga ya na OKD ma ọ bụ nrụnye Kubernetes ọ bụla site na iji ngosipụta kwesịrị ekwesị yana ihe onyonyo Docker, dị ka nke a - github.com/ttauveron/k8s-big-data-experiments/tree/master/livy-spark-2.3. Maka ikpe anyị, e wuru ihe oyiyi Docker yiri ya, gụnyere ụdị Spark 2.4.5 site na Dockerfile na-esonụ:

FROM java:8-alpine

ENV SPARK_HOME=/opt/spark
ENV LIVY_HOME=/opt/livy
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV SPARK_USER=spark

WORKDIR /opt

RUN apk add --update openssl wget bash && 
    wget -P /opt https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz && 
    tar xvzf spark-2.4.5-bin-hadoop2.7.tgz && 
    rm spark-2.4.5-bin-hadoop2.7.tgz && 
    ln -s /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark

RUN wget http://mirror.its.dal.ca/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip && 
    unzip apache-livy-0.7.0-incubating-bin.zip && 
    rm apache-livy-0.7.0-incubating-bin.zip && 
    ln -s /opt/apache-livy-0.7.0-incubating-bin /opt/livy && 
    mkdir /var/log/livy && 
    ln -s /var/log/livy /opt/livy/logs && 
    cp /opt/livy/conf/log4j.properties.template /opt/livy/conf/log4j.properties

ADD livy.conf /opt/livy/conf
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ADD entrypoint.sh /entrypoint.sh

ENV PATH="/opt/livy/bin:${PATH}"

EXPOSE 8998

ENTRYPOINT ["/entrypoint.sh"]
CMD ["livy-server"]

Enwere ike wuo ma bulite onyonyo emepụtara na ebe nchekwa Docker gị dị, dị ka ebe nchekwa OKD dị n'ime. Iji bugharịa ya, jiri ihe ngosi na-esonụ ({registry-url} - URL nke ndekọ ihe oyiyi Docker, {image-name} - Aha oyiyi Docker, {tag} - Docker image mkpado, {livy-url} - URL chọrọ ebe nkesa ga-enweta Livy; a na-eji ihe ngosi "Route" ma ọ bụrụ na ejiri Red Hat OpenShift mee ihe dị ka nkesa Kubernetes, ma ọ bụghị ya, a na-eji Ingress ma ọ bụ ọrụ nke ụdị NodePort na-arụ ọrụ):

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: livy
  name: livy
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: livy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: livy
    spec:
      containers:
        - command:
            - livy-server
          env:
            - name: K8S_API_HOST
              value: localhost
            - name: SPARK_KUBERNETES_IMAGE
              value: 'gnut3ll4/spark:v1.0.14'
          image: '{registry-url}/{image-name}:{tag}'
          imagePullPolicy: Always
          name: livy
          ports:
            - containerPort: 8998
              name: livy-rest
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/log/livy
              name: livy-log
            - mountPath: /opt/.livy-sessions/
              name: livy-sessions
            - mountPath: /opt/livy/conf/livy.conf
              name: livy-config
              subPath: livy.conf
            - mountPath: /opt/spark/conf/spark-defaults.conf
              name: spark-config
              subPath: spark-defaults.conf
        - command:
            - /usr/local/bin/kubectl
            - proxy
            - '--port'
            - '8443'
          image: 'gnut3ll4/kubectl-sidecar:latest'
          imagePullPolicy: Always
          name: kubectl
          ports:
            - containerPort: 8443
              name: k8s-api
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: spark
      serviceAccountName: spark
      terminationGracePeriodSeconds: 30
      volumes:
        - emptyDir: {}
          name: livy-log
        - emptyDir: {}
          name: livy-sessions
        - configMap:
            defaultMode: 420
            items:
              - key: livy.conf
                path: livy.conf
            name: livy-config
          name: livy-config
        - configMap:
            defaultMode: 420
            items:
              - key: spark-defaults.conf
                path: spark-defaults.conf
            name: livy-config
          name: spark-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: livy-config
data:
  livy.conf: |-
    livy.spark.deploy-mode=cluster
    livy.file.local-dir-whitelist=/opt/.livy-sessions/
    livy.spark.master=k8s://http://localhost:8443
    livy.server.session.state-retain.sec = 8h
  spark-defaults.conf: 'spark.kubernetes.container.image        "gnut3ll4/spark:v1.0.14"'
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: livy
  name: livy
spec:
  ports:
    - name: livy-rest
      port: 8998
      protocol: TCP
      targetPort: 8998
  selector:
    component: livy
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: livy
  name: livy
spec:
  host: {livy-url}
  port:
    targetPort: livy-rest
  to:
    kind: Service
    name: livy
    weight: 100
  wildcardPolicy: None

Mgbe itinye ya n'ọrụ wee bido pọd ahụ nke ọma, interface eserese Livy dị na njikọ a: http://{livy-url}/ui. Site na Livy, anyị nwere ike bipụta ọrụ Spark anyị site na iji arịrịọ REST si, dịka ọmụmaatụ, onye akwụkwọ ozi. Edepụtara ihe atụ nke mkpokọta nwere arịrịọ n'okpuru (arụmụka nhazi na mgbanwe dị mkpa maka ịrụ ọrụ nke ọrụ ewepụtara nwere ike ịfefe n'usoro "args"):

{
    "info": {
        "_postman_id": "be135198-d2ff-47b6-a33e-0d27b9dba4c8",
        "name": "Spark Livy",
        "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
    },
    "item": [
        {
            "name": "1 Submit job with jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar", nt"className": "org.apache.spark.examples.SparkPi",nt"numExecutors":1,nt"name": "spark-test-1",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt}n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        },
        {
            "name": "2 Submit job without jar",
            "request": {
                "method": "POST",
                "header": [
                    {
                        "key": "Content-Type",
                        "value": "application/json"
                    }
                ],
                "body": {
                    "mode": "raw",
                    "raw": "{nt"file": "hdfs://{host}:{port}/{path-to-file-on-hdfs}", nt"className": "{class-name}",nt"numExecutors":1,nt"name": "spark-test-2",nt"proxyUser": "0",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt},nt"args": [ntt"HADOOP_CONF_DIR=/opt/spark/hadoop-conf",ntt"MASTER=k8s://https://kubernetes.default.svc:8443"nt]n}"
                },
                "url": {
                    "raw": "http://{livy-url}/batches",
                    "protocol": "http",
                    "host": [
                        "{livy-url}"
                    ],
                    "path": [
                        "batches"
                    ]
                }
            },
            "response": []
        }
    ],
    "event": [
        {
            "listen": "prerequest",
            "script": {
                "id": "41bea1d0-278c-40c9-ad42-bf2e6268897d",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        },
        {
            "listen": "test",
            "script": {
                "id": "3cdd7736-a885-4a2d-9668-bd75798f4560",
                "type": "text/javascript",
                "exec": [
                    ""
                ]
            }
        }
    ],
    "protocolProfileBehavior": {}
}

Ka anyị mee arịrịọ mbụ sitere na mkpokọta ahụ, gaa na interface OKD wee lelee na ewepụtala ọrụ ahụ nke ọma - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods. N'otu oge ahụ, nnọkọ ga-apụta na interface Livy (http://{livy-url}/ui), n'ime nke, iji Livy API ma ọ bụ eserese eserese, ị nwere ike soro ọganihu nke ọrụ ahụ ma mụọ nnọkọ ahụ. ndekọ.

Ugbu a, ka anyị gosi otú Livy si arụ ọrụ. Iji mee nke a, ka anyị nyochaa ndekọ nke akpa Livy n'ime pọd ya na sava Livy - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods/{livy-pod-name }?tab=akwụkwọ ndekọ. Site na ha anyị nwere ike ịhụ na mgbe ị na-akpọ Livy REST API n'ime akpa aha ya bụ "livy", a na-egbu ọkụ-submit, dị ka nke anyị ji n'elu (ebe a {livy-pod-name} bụ aha nke pọd ahụ mepụtara. ya na sava Livy). Nchịkọta a na-ewebata ajụjụ nke abụọ na-enye gị ohere ịrụ ọrụ ndị na-akwado Spark executable site na iji sava Livy.

Ojiji nke atọ - Spark Operator

Ugbu a a nwalere ọrụ ahụ, ajụjụ nke ịgba ọsọ ya na-ebilite mgbe niile. Ụzọ obodo ị ga-esi na-arụ ọrụ mgbe niile na ụyọkọ Kubernetes bụ ụlọ ọrụ CronJob ma ị nwere ike iji ya, mana n'oge a iji ndị na-arụ ọrụ iji jikwaa ngwa na Kubernetes bụ ihe a ma ama na maka Spark nwere onye ọrụ tozuru oke, nke bụkwa onye na-arụ ọrụ nke ọma. ejiri na ngwọta ọkwa ụlọ ọrụ (dịka ọmụmaatụ, Lightbend FastData Platform). Anyị na-akwado iji ya - ụdị Spark kwụsiri ike ugbu a (2.4.5) nwere nhọrọ nhazi nwere oke maka ịgba ọsọ Spark na Kubernetes, ebe isi na-esote (3.0.0) na-ekwupụta nkwado zuru oke maka Kubernetes, mana ụbọchị mwepụta ya ka amabeghị. . Onye ọrụ Spark na-akwụ ụgwọ maka adịghị ike a site n'ịgbakwunye nhọrọ nhazi dị mkpa (dịka ọmụmaatụ, ịkwanye ConfigMap nwere nhazi ohere Hadoop na Spark pods) na ikike ịme ọrụ a na-ahazi mgbe niile.

Apache Spark na-agba ọsọ na Kubernetes
Ka anyị gosipụta ya dị ka ikpe ojiji nke atọ - na-arụ ọrụ Spark mgbe niile na ụyọkọ Kubernetes na loop mmepụta.

Spark Operator bụ ebe mepere emepe wee mepụta n'ime Google Cloud Platform - github.com/GoogleCloudPlatform/spark-on-k8s-operator. Enwere ike ime ntinye ya n'ụzọ atọ:

  1. Dị ka akụkụ nke Lightbend FastData Platform / Cloudflow nwụnye;
  2. Iji Helm:
    helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
    helm install incubator/sparkoperator --namespace spark-operator
    	

  3. Iji ngosipụta sitere na ebe nchekwa gọọmentị (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest). Ọ dị mma ịmara ihe ndị a - Cloudflow gụnyere onye ọrụ nwere ụdị API v1beta1. Ọ bụrụ na ejiri ụdị nrụnye a, nkọwa nkọwa ngwa Spark kwesịrị ịdabere na mkpado ọmụmaatụ na Git nwere ụdị API dabara adaba, dịka ọmụmaatụ, "v1beta1-0.9.0-2.4.0". Enwere ike ịchọta ụdị onye ọrụ na nkọwa nke CRD gụnyere n'ime onye na-arụ ọrụ na ọkọwa okwu “ụdị”:
    oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
    	

Ọ bụrụ na arụnyere onye ọrụ ahụ nke ọma, pọd na-arụ ọrụ na onye na-arụ ọrụ Spark ga-apụta na ọrụ ahụ kwekọrọ (dịka ọmụmaatụ, cloudflow-fdp-sparkoperator na oghere Cloudflow maka nrụnye Cloudflow) na ụdị akụ Kubernetes kwekọrọ aha ya bụ "sparkapplications" ga-apụta. . Ị nwere ike inyocha ngwa Spark dị site na iji iwu a:

oc get sparkapplications -n {project}

Iji rụọ ọrụ site na iji Spark Operator, ị ga-eme ihe atọ:

  • mepụta onyonyo Docker nke gụnyere ọba akwụkwọ niile dị mkpa, yana nhazi na faịlụ enwere ike ime ya. Na foto a na-achọsi ike, nke a bụ ihe oyiyi emepụtara na ọkwa CI / CD ma nwalee na ụyọkọ ule;
  • bipụta onyonyo Docker na ndekọ nke enwere ike ịnweta site na ụyọkọ Kubernetes;
  • wepụta ihe ngosi na ụdị "SparkApplication" yana nkọwa nke ọrụ a ga-amalite. Ihe ngosi ngosi dị na ebe nchekwa gọọmentị (dịka. github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta1-0.9.0-2.4.0/examples/spark-pi.yaml). Enwere isi ihe dị mkpa ị ga-arịba ama gbasara ihe ngosi:
    1. akwụkwọ ọkọwa okwu “apiVersion” ga-egosirịrị ụdị API nke dabara na ụdị onye ọrụ;
    2. akwụkwọ ọkọwa okwu “metadata.namespace” ga-egosirịrị oghere aha nke a ga-ewepụta ngwa ahụ;
    3. akwụkwọ ọkọwa okwu “spec.image” ga-enwerịrị adreesị nke onyonyo Docker emepụtara na ndekọ enwere ike ịnweta;
    4. Akwụkwọ ọkọwa okwu “spec.mainClass” ga-enwerịrị klas ọrụ Spark nke kwesịrị ịgba ọsọ mgbe usoro a malitere;
    5. akwụkwọ ọkọwa okwu “spec.mainApplicationFile” ga-enwerịrị ụzọ nke faịlụ ite nwere ike ime;
    6. akwụkwọ ọkọwa okwu “spec.sparkVersion” ga-egosirịrị ụdị nke Spark na-eji;
    7. akwụkwọ ọkọwa okwu “spec.driver.serviceAccount” ga-ezipụta akaụntụ ọrụ n'ime aha aha Kubernetes kwekọrọ nke a ga-eji mee ngwa ahụ;
    8. akwụkwọ ọkọwa okwu "spec.executor" ga-egosi ọnụọgụ ego ekenyela na ngwa ahụ;
    9. akwụkwọ ọkọwa okwu "spec.volumeMounts" ga-ezipụta akwụkwọ ndekọ aha mpaghara ebe a ga-emepụta faịlụ ọrụ Spark mpaghara.

Ọmụmaatụ nke imepụta ngosipụta (ebe a {spark-service-account} bụ akaụntụ ọrụ n'ime ụyọkọ Kubernetes maka ịrụ ọrụ Spark):

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: {spark-service-account}
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Ngosipụta a na-akọwapụta akaụntụ ọrụ nke, tupu ibipụta ihe ngosi ahụ, ị ​​ga-enwerịrị ike imepụta njikọ dị mkpa nke na-enye ikike ohere dị mkpa maka ngwa Spark iji soro Kubernetes API na-emekọrịta ihe (ọ bụrụ na ọ dị mkpa). N'ọnọdụ anyị, ngwa ahụ chọrọ ikike ịmepụta Pods. Ka anyị mepụta njikọ dị mkpa:

oc adm policy add-role-to-user edit system:serviceaccount:{project}:{spark-service-account} -n {project}

Ọ dịkwa mma ịmara na nkọwapụta ngosipụta a nwere ike ịgụnye paramita "hadoopConfigMap", nke na-enye gị ohere ịkọwapụta ConfigMap nwere nhazi Hadoop na-ebughị ụzọ tinye faịlụ kwekọrọ na onyonyo Docker. Ọ dịkwa mma maka ịrụ ọrụ mgbe niile - site na iji paramita "nhazi oge", enwere ike ịkọwa usoro maka ịrụ ọrụ enyere.

Mgbe nke ahụ gasịrị, anyị na-echekwa akwụkwọ akụkọ anyị na faịlụ spark-pi.yaml wee tinye ya na ụyọkọ Kubernetes anyị:

oc apply -f spark-pi.yaml

Nke a ga-emepụta ihe ụdị "sparkapplications":

oc get sparkapplications -n {project}
> NAME       AGE
> spark-pi   22h

N'okwu a, a ga-emepụta pọd nwere ngwa, ọnọdụ nke a ga-egosipụta na "sparkapplications" emepụtara. Ị nwere ike ịlele ya site na iwu a:

oc get sparkapplications spark-pi -o yaml -n {project}

Mgbe arụchara ọrụ ahụ, POD ga-aga na ọnọdụ "Emechara", nke ga-emelitekwa na "sparkapplications". Enwere ike ịlele ndekọ ngwa na ihe nchọgharị ma ọ bụ jiri iwu na-esonụ (ebe a {sparkapplications-pod-name} bụ aha pod nke ọrụ na-agba ọsọ):

oc logs {sparkapplications-pod-name} -n {project}

Enwere ike ijikwa ọrụ spark site na iji sparkctl pụrụ iche. Iji wụnye ya, mechie ebe nchekwa ahụ na koodu isi mmalite ya, wụnye Go wee wuo ike a:

git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin

Ka anyị lelee ndepụta nke ọrụ Spark na-agba ọsọ:

sparkctl list -n {project}

Ka anyị mepụta nkọwa maka ọrụ Spark:

vi spark-app.yaml

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: {project}
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Ka anyị jiri sparkctl rụọ ọrụ akọwara:

sparkctl create spark-app.yaml -n {project}

Ka anyị lelee ndepụta nke ọrụ Spark na-agba ọsọ:

sparkctl list -n {project}

Ka anyị lelee ndepụta ihe omume nke ọrụ Spark ewepụtara:

sparkctl event spark-pi -n {project} -f

Ka anyị lelee ọnọdụ ọrụ Spark na-agba ọsọ:

sparkctl status spark-pi -n {project}

N'ikpeazụ, ọ ga-amasị m ịtụle ọghọm ndị achọpụtara na iji ụdị Spark (2.4.5) kwụsiri ike ugbu a na Kubernetes:

  1. Nke mbụ na, ikekwe, nnukwu ọghọm bụ enweghị Data Mpaghara. N'agbanyeghị adịghị ike niile nke YARN, enwerekwa uru dị na iji ya, dịka ọmụmaatụ, ụkpụrụ nke ịnyefe koodu na data (kama data na koodu). N'ihi ya, a na-arụ ọrụ Spark na ọnụ ebe data etinyere na mgbakọ ahụ dị, na oge ọ na-ewe iji nyefee data na netwọk ahụ belatara nke ukwuu. Mgbe anyị na-eji Kubernetes, anyị na-eche mkpa ịkwaga data na-etinye aka na ọrụ n'ofe netwọkụ ahụ. Ọ bụrụ na ha buru oke ibu, oge mmebe ọrụ nwere ike ịbawanye nke ukwuu, ma chọkwara nnukwu ohere diski ekenye na oge ọrụ Spark maka nchekwa nwa oge ha. Enwere ike ibelata ọghọm a site na iji sọftụwia pụrụ iche nke na-eme ka mpaghara data dị na Kubernetes (dịka ọmụmaatụ, Alluxio), mana nke a pụtara n'ezie mkpa ịchekwa data zuru oke na ọnụ nke ụyọkọ Kubernetes.
  2. Ihe ọghọm nke abụọ dị mkpa bụ nchekwa. Site na ndabara, atụmatụ ndị metụtara nchekwa gbasara ịgba ọsọ Spark agaghị enwe nkwarụ, ekpuchighị iji Kerberos na akwụkwọ gọọmentị (n'agbanyeghị na ewepụtara nhọrọ kwekọrọ na ụdị 3.0.0, nke ga-achọ ọrụ ọzọ), yana akwụkwọ nchekwa maka iji Spark (https://spark.apache.org/docs/2.4.5/security.html) naanị YARN, Mesos na Standalone Cluster pụtara dị ka ụlọ ahịa isi. N'otu oge ahụ, onye ọrụ a na-arụ ọrụ Spark enweghị ike ịkọwa kpọmkwem - naanị anyị na-akọwapụta akaụntụ ọrụ nke ọ ga-arụ ọrụ, ma họrọ onye ọrụ dabere na atumatu nchekwa ahaziri. N'akụkụ a, a na-eji onye ọrụ mgbọrọgwụ eme ihe, nke na-adịghị mma na gburugburu ebe obibi na-arụpụta ihe, ma ọ bụ onye ọrụ nwere UID na-enweghị ihe ọ bụla, nke na-adịghị mma mgbe ị na-ekesa ikike ịnweta data (nke a nwere ike idozi site na ịmepụta PodSecurityPolicies na ijikọta ha na . akaụntụ ọrụ kwekọrọ). Ugbu a, ihe ngwọta bụ itinye faịlụ niile dị mkpa ozugbo na onyonyo Docker, ma ọ bụ gbanwee edemede mmalite Spark iji usoro maka ịchekwa na iweghachite ihe nzuzo anabatara na nzukọ gị.
  3. Na-arụ ọrụ Spark na-eji Kubernetes ka nọ na nnwale yana enwere ike inwe mgbanwe dị ukwuu na arịa ndị ejiri (faịlụ nhazi, ihe onyonyo Docker, na edemede mmalite) n'ọdịnihu. Na n'ezie, mgbe ị na-akwadebe ihe ahụ, a nwalere nsụgharị 2.3.0 na 2.4.5, omume ahụ dị nnọọ iche.

Ka anyị chere maka mmelite - ụdị ọhụrụ nke Spark (3.0.0) ka ewepụtara n'oge na-adịbeghị anya, nke wetara mgbanwe dị ukwuu na ọrụ Spark na Kubernetes, mana jigidere ọnọdụ nnwale nke nkwado maka onye njikwa akụ a. Ikekwe mmelite na-esote ga-eme ka o kwe omume ịkwado ịhapụ YARN na ịrụ ọrụ Spark na Kubernetes n'atụghị egwu maka nchekwa nke sistemu gị na-enweghị mkpa ịmegharị ihe ndị na-arụ ọrụ n'onwe ya.

Ọgwụgwụ.

isi: www.habr.com

Tinye a comment