Chithunzi chopangidwa ndi Docker chimaphatikizapo nambala yantchito ya Spark yomwe ingagwiritsidwe ntchito;
Chithunzi chopangidwa chimangophatikizapo Spark ndi kudalira kofunikira, code yogwiritsiridwa ntchito imayendetsedwa kutali (mwachitsanzo, mu HDFS).
Choyamba, tiyeni tipange chithunzi cha Docker chokhala ndi chitsanzo choyesera cha ntchito ya Spark. Kuti mupange zithunzi za Docker, Spark ili ndi chida chotchedwa "docker-image-tool". Tiyeni tiphunzire thandizo pa izi:
./bin/docker-image-tool.sh --help
Ndi chithandizo chake, mutha kupanga zithunzi za Docker ndikuziyika ku zolembera zakutali, koma mwachisawawa zimakhala ndi zovuta zingapo:
mosalephera amapanga zithunzi za 3 Docker nthawi imodzi - za Spark, PySpark ndi R;
#!/usr/bin/env bash
function error {
echo "$@" 1>&2
exit 1
}
if [ -z "${SPARK_HOME}" ]; then
SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"
function image_ref {
local image="$1"
local add_repo="${2:-1}"
if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
image="$REPO/$image"
fi
if [ -n "$TAG" ]; then
image="$image:$TAG"
fi
echo "$image"
}
function build {
local BUILD_ARGS
local IMG_PATH
if [ ! -f "$SPARK_HOME/RELEASE" ]; then
IMG_PATH=$BASEDOCKERFILE
BUILD_ARGS=(
${BUILD_PARAMS}
--build-arg
img_path=$IMG_PATH
--build-arg
datagram_jars=datagram/runtimelibs
--build-arg
spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
)
else
IMG_PATH="kubernetes/dockerfiles"
BUILD_ARGS=(${BUILD_PARAMS})
fi
if [ -z "$IMG_PATH" ]; then
error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
fi
if [ -z "$IMAGE_REF" ]; then
error "Cannot find docker image reference. Please add -i arg."
fi
local BINDING_BUILD_ARGS=(
${BUILD_PARAMS}
--build-arg
base_img=$(image_ref $IMAGE_REF)
)
local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}
docker build $NOCACHEARG "${BUILD_ARGS[@]}"
-t $(image_ref $IMAGE_REF)
-f "$BASEDOCKERFILE" .
}
function push {
docker push "$(image_ref $IMAGE_REF)"
}
function usage {
cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.
Commands:
build Build image. Requires a repository address to be provided if the image will be
pushed to a different registry.
push Push a pre-built image to a registry. Requires a repository address to be provided.
Options:
-f file Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
-p file Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
-R file Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
-r repo Repository address.
-i name Image name to apply to the built image, or to identify the image to be pushed.
-t tag Tag to apply to the built image, or to identify the image to be pushed.
-m Use minikube's Docker daemon.
-n Build docker image with --no-cache
-b arg Build arg to build or push the image. For multiple build args, this option needs to
be used separately for each build arg.
Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.
Check the following documentation for more information on using the minikube Docker daemon:
https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon
Examples:
- Build image in minikube with tag "testing"
$0 -m -t testing build
- Build and push image with tag "v2.3.0" to docker.io/myrepo
$0 -r docker.io/myrepo -t v2.3.0 build
$0 -r docker.io/myrepo -t v2.3.0 push
EOF
}
if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
usage
exit 0
fi
REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
case "${option}"
in
f) BASEDOCKERFILE=${OPTARG};;
r) REPO=${OPTARG};;
t) TAG=${OPTARG};;
n) NOCACHEARG="--no-cache";;
i) IMAGE_REF=${OPTARG};;
b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
esac
done
case "${@: -1}" in
build)
build
;;
push)
if [ -z "$REPO" ]; then
usage
exit 1
fi
push
;;
*)
usage
exit 1
;;
esac
Ndi chithandizo chake, timasonkhanitsa chithunzi choyambirira cha Spark chokhala ndi ntchito yoyesa kuwerengera Pi pogwiritsa ntchito Spark (pano {docker-registry-url} ndi ulalo wa kaundula wa zithunzi za Docker, {repo} ndi dzina la malo osungira mkati mwa registry, zomwe zimagwirizana ndi polojekitiyi mu OpenShift , {image-name} - dzina la fano (ngati magawo atatu olekanitsa zithunzi akugwiritsidwa ntchito, mwachitsanzo, monga mu kaundula wophatikizika wa Red Hat OpenShift zithunzi), {tag} - tag ya izi mtundu wa chithunzi):
Tiyeni tiwone ngati chithunzi chophatikizidwa chikupezeka mu OKD. Kuti muchite izi, tsegulani ulalo wa msakatuli wokhala ndi mndandanda wazithunzi za polojekiti yofananira (pano {project} ndi dzina la polojekiti mkati mwa gulu la OpenShift, {OKD-WEBUI-URL} ndi ulalo wa OpenShift Web console. ) - https://{OKD-WEBUI-URL}/console /project/{project}/browse/images/{image-name}.
Kuti mugwire ntchito, akaunti yautumiki iyenera kupangidwa ndi mwayi woyendetsa ma pods ngati mizu (tidzakambirana mfundoyi pambuyo pake):
Tiyeni tiyendetse lamulo la spark-submit kuti tisindikize ntchito ya Spark ku gulu la OKD, kutchula akaunti yomwe idapangidwa ndi chithunzi cha Docker:
spark.executor.instances - chiwerengero cha Spark executors kuti ayambitse;
spark.kubernetes.authenticate.driver.serviceAccountName - dzina laakaunti yautumiki wa Kubernetes yomwe imagwiritsidwa ntchito poyambitsa ma pod (kutanthauzira zachitetezo ndi kuthekera polumikizana ndi Kubernetes API);
spark.kubernetes.namespace - Kubernetes namespace momwe dalaivala ndi executor pods adzakhazikitsidwa;
spark.submit.deployMode β njira yoyambitsira Spark (pa "cluster" yokhazikika ya spark-submit, ya Spark Operator ndi mitundu ina ya Spark "client");
spark.kubernetes.container.image - Chithunzi cha Docker chomwe chimagwiritsidwa ntchito poyambitsa ma pod;
Dziwani kuti kuti mupeze HDFS ndikuwonetsetsa kuti ntchitoyo ikugwira ntchito, mungafunike kusintha Dockerfile ndi entrypoint.sh script - onjezerani chitsogozo ku Dockerfile kuti mukopere malaibulale odalira ku / opt/spark/mitsuko ndi Phatikizani fayilo yosinthira ya HDFS mu SPARK_CLASSPATH polowera.
Mlandu wachiwiri wogwiritsa ntchito - Apache Livy
Kupitilira apo, ntchito ikapangidwa ndipo zotsatira zake ziyenera kuyesedwa, funso limabwera pakuyiyambitsa ngati gawo la njira ya CI / CD ndikutsata momwe ikugwiritsidwira ntchito. Zachidziwikire, mutha kuyiyendetsa pogwiritsa ntchito kuyimba komweko, koma izi zimasokoneza zomangamanga za CI/CD popeza zimafunikira kukhazikitsa ndikusintha Spark pa othandizira / othamanga a CI ndikukhazikitsa mwayi wofikira Kubernetes API. Pachifukwa ichi, cholinga chake chasankha kugwiritsa ntchito Apache Livy ngati REST API yoyendetsa ntchito za Spark zomwe zimachitika mkati mwa gulu la Kubernetes. Ndi chithandizo chake, mutha kuyendetsa ntchito za Spark pagulu la Kubernetes pogwiritsa ntchito zopempha zanthawi zonse za cURL, zomwe zimakhazikitsidwa mosavuta kutengera yankho lililonse la CI, ndikuyika kwake mkati mwa gulu la Kubernetes kumathetsa vuto la kutsimikizika polumikizana ndi Kubernetes API.
Tiyeni tiwunikire ngati njira yachiwiri yogwiritsira ntchito - kuyendetsa ntchito za Spark ngati gawo la CI/CD pagulu la Kubernetes mu chipika choyesera.
Pang'ono ndi Apache Livy - imagwira ntchito ngati seva ya HTTP yomwe imapereka mawonekedwe a Webusaiti ndi RESTful API yomwe imakulolani kuti muyambitse kutalikirana ndi spark-submit podutsa magawo ofunikira. Mwachikhalidwe idatumizidwa ngati gawo la kugawa kwa HDP, koma imathanso kutumizidwa ku OKD kapena kukhazikitsa kwina kulikonse kwa Kubernetes pogwiritsa ntchito chiwonetsero choyenera ndi seti ya zithunzi za Docker, monga iyi - github.com/ttauveron/k8s-big-data-experiments/tree/master/livy-spark-2.3. Kwa ife, chithunzi chofananira cha Docker chidapangidwa, kuphatikiza mtundu wa Spark 2.4.5 kuchokera pa Dockerfile yotsatirayi:
Tiyeni tipereke pempho loyamba kuchokera mgululi, pitani ku mawonekedwe a OKD ndikuwona ngati ntchitoyi yayambika bwino - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods. Panthawi imodzimodziyo, gawo lidzawonekera mu mawonekedwe a Livy (http://{livy-url}/ui), momwemo, pogwiritsa ntchito Livy API kapena mawonekedwe azithunzi, mukhoza kuyang'ana momwe ntchito ikuyendera ndikuphunzira gawolo. mitengo.
Tsopano tiyeni tiwone momwe Livy amagwirira ntchito. Kuti tichite izi, tiyeni tiwone zipika za chidebe cha Livy mkati mwa pod ndi seva ya Livy - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods/{livy-pod-name }?tabu=zipika. Kuchokera kwa iwo titha kuwona kuti poyitanitsa Livy REST API mu chidebe chotchedwa "livy", kutumiza kwa spark kumachitidwa, kofanana ndi komwe tidagwiritsa ntchito pamwambapa (pano {livy-pod-name} ndi dzina la pod yopangidwa. ndi seva ya Livy). Zosonkhanitsazo zimabweretsanso funso lachiwiri lomwe limakupatsani mwayi woyendetsa ntchito zomwe zimagwira kutali ndi Spark yomwe ingagwiritsidwe ntchito ndi seva ya Livy.
Njira yachitatu - Spark Operator
Tsopano popeza kuti ntchitoyi yayesedwa, funso loyendetsa nthawi zonse limakhalapo. Njira yachilengedwe yoyendetsera ntchito pafupipafupi mugulu la Kubernetes ndi gulu la CronJob ndipo mutha kuligwiritsa ntchito, koma pakadali pano kugwiritsa ntchito ogwiritsa ntchito kuyang'anira ntchito ku Kubernetes ndikotchuka kwambiri ndipo kwa Spark pali wogwiritsa ntchito wokhwima, yemwenso ali amagwiritsidwa ntchito pamayankho a Enterprise-level (mwachitsanzo, Lightbend FastData Platform). Tikupangira kugwiritsa ntchito - mtundu wokhazikika wa Spark (2.4.5) uli ndi zosankha zochepa zosinthira ntchito za Spark ku Kubernetes, pomwe mtundu wotsatira waukulu (3.0.0) umalengeza kuthandizira kwathunthu kwa Kubernetes, koma tsiku lake lomasulidwa silikudziwika. . Spark Operator amalipira cholakwikacho powonjezera zosankha zofunika zosinthira (mwachitsanzo, kuyika ConfigMap yokhala ndi kasinthidwe kofikira kwa Hadoop ku Spark pods) komanso kuthekera koyendetsa ntchito yomwe imakonzedwa pafupipafupi.
Tiyeni tiwunikire ngati njira yachitatu yogwiritsira ntchito - kuyendetsa ntchito za Spark pafupipafupi pagulu la Kubernetes mu loop yopanga.
Kugwiritsa ntchito ziwonetsero zochokera kumalo ovomerezeka (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest). Ndizofunikira kudziwa zotsatirazi - Cloudflow imaphatikizapo wogwiritsa ntchito API v1beta1. Ngati kuyika kwamtunduwu kukugwiritsidwa ntchito, mafotokozedwe a mawonekedwe a Spark akuyenera kutengera ma tag achitsanzo mu Git okhala ndi mtundu woyenerera wa API, mwachitsanzo, "v1beta1-0.9.0-2.4.0". Mtundu wa wogwiritsa ntchito ungapezeke pofotokozera CRD yophatikizidwa mu mtanthauzira mawu wa "mabaibulo":
oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
Ngati woyendetsayo aikidwa molondola, pod yogwira ntchito ndi Spark operator idzawonekera mu pulojekiti yofanana (mwachitsanzo, cloudflow-fdp-sparkoperator mu Cloudflow space poyika Cloudflow) ndi mtundu wofananira wa Kubernetes wotchedwa "sparkapplications" udzawonekera. . Mutha kuyang'ana mapulogalamu omwe alipo a Spark ndi lamulo ili:
pangani chithunzi cha Docker chomwe chimaphatikizapo malaibulale onse ofunikira, komanso makonzedwe ndi mafayilo otheka. Mu chithunzi chandamale, ichi ndi chithunzi chopangidwa pa CI / CD siteji ndikuyesedwa pamagulu oyesera;
sindikizani chithunzi cha Docker ku registry yomwe ikupezeka kuchokera ku gulu la Kubernetes;
dikishonale ya "apiVersion" iyenera kuwonetsa mtundu wa API wolingana ndi mtundu wa opareta;
dikishonale ya "metadata.namespace" iyenera kuwonetsa malo omwe pulogalamuyo idzayambitse;
dikishonale ya "spec.image" iyenera kukhala ndi adilesi ya chithunzi cha Docker chopangidwa mu kaundula wopezeka;
dikishonale ya "spec.mainClass" iyenera kukhala ndi gulu la Spark lomwe liyenera kuyendetsedwa ntchito ikayamba;
dikishonale ya "spec.mainApplicationFile" iyenera kukhala ndi njira yopita ku fayilo ya mtsuko yomwe ingathe kuchitika;
dikishonale ya "sparkVersion" iyenera kuwonetsa mtundu wa Spark womwe ukugwiritsidwa ntchito;
buku lotanthauzira mawu la "spec.driver.serviceAccount" liyenera kufotokoza akaunti ya ntchito yomwe ili mkati mwa dzina la Kubernetes lomwe lidzagwiritsidwe ntchito poyendetsa pulogalamuyi;
dikishonale ya "spec.executor" iyenera kuwonetsa kuchuluka kwa zinthu zomwe zaperekedwa ku ntchitoyo;
dikishonale ya "spec.volumeMounts" iyenera kufotokoza chikwatu chapafupi momwe mafayilo a Spark apafupi adzapangidwira.
Chitsanzo chopanga chiwonetsero (pano {spark-service-account} ndi akaunti yantchito mkati mwa gulu la Kubernetes poyendetsa ntchito za Spark):
Ndizofunikiranso kudziwa kuti mawonekedwe owonetserawa angaphatikizepo gawo la "hadoopConfigMap", lomwe limakupatsani mwayi wofotokozera ConfigMap ndi kasinthidwe ka Hadoop osayika kaye fayilo yofananira pachithunzi cha Docker. Ndiwoyeneranso kugwira ntchito nthawi zonse - pogwiritsa ntchito gawo la "ndandanda", ndandanda yoyendetsera ntchito yomwe wapatsidwa ikhoza kufotokozedwa.
Pambuyo pake, timasunga chiwonetsero chathu ku fayilo ya spark-pi.yaml ndikuyiyika ku gulu lathu la Kubernetes:
oc apply -f spark-pi.yaml
Izi zipanga chinthu chamtundu wa "sparkapplications":
oc get sparkapplications -n {project}
> NAME AGE
> spark-pi 22h
Pankhaniyi, pod yokhala ndi pulogalamu idzapangidwa, yomwe idzawonetsedwa mu "sparkapplications" zomwe zidapangidwa. Mutha kuziwona ndi lamulo ili:
oc get sparkapplications spark-pi -o yaml -n {project}
Mukamaliza ntchitoyi, POD idzapita ku "Yamalizidwa", yomwe idzasinthidwenso mu "sparkapplications". Zolemba zamapulogalamu zitha kuwonedwa mumsakatuli kapena kugwiritsa ntchito lamulo lotsatirali (apa {sparkapplications-pod-name} ndiye dzina la pod ya ntchitoyo):
oc logs {sparkapplications-pod-name} -n {project}
Ntchito za Spark zitha kuyendetsedwanso pogwiritsa ntchito zida zapadera za sparkctl. Kuti muyike, phatikizani chosungiracho ndi gwero lake, ikani Go ndikumanga izi:
git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin
Tiyeni tiwone mndandanda wazinthu zomwe zidayambitsa ntchito ya Spark:
sparkctl event spark-pi -n {project} -f
Tiyeni tiwone momwe ntchito ya Spark ikuyendetsa:
sparkctl status spark-pi -n {project}
Pomaliza, ndikufuna kuganizira zoyipa zomwe zidapezeka pogwiritsa ntchito mtundu waposachedwa wa Spark (2.4.5) ku Kubernetes:
Choyamba ndipo, mwinamwake, choyipa chachikulu ndi kusowa kwa Data Locality. Ngakhale zofooka zonse za YARN, panalinso ubwino wogwiritsa ntchito, mwachitsanzo, mfundo yopereka code ku deta (osati deta ku code). Chifukwa cha izo, ntchito za Spark zinachitidwa pa node zomwe deta yokhudzana ndi kuwerengera inalipo, ndipo nthawi yomwe inatenga kuti apereke deta pa intaneti inachepetsedwa kwambiri. Mukamagwiritsa ntchito Kubernetes, timayang'anizana ndi kufunikira kosuntha deta yomwe ikukhudzidwa ndi ntchito pa intaneti. Ngati ndi yayikulu mokwanira, nthawi yogwirira ntchito imatha kuwonjezeka kwambiri, komanso imafunikanso malo ochulukirapo a disk omwe amaperekedwa ku zochitika za Spark zosungirako kwakanthawi. Zoyipa izi zitha kuchepetsedwa pogwiritsa ntchito mapulogalamu apadera omwe amatsimikizira malo a data ku Kubernetes (mwachitsanzo, Alluxio), koma izi zikutanthauza kufunikira kosunga deta yonse pama node a gulu la Kubernetes.
Choyipa chachiwiri chofunikira ndi chitetezo. Mwachikhazikitso, zinthu zokhudzana ndi chitetezo zokhudzana ndi kuyendetsa ntchito za Spark ndizozimitsidwa, kugwiritsa ntchito Kerberos sikunafotokozedwe m'malemba ovomerezeka (ngakhale zosankha zofananira zinayambitsidwa mu mtundu wa 3.0.0, womwe udzafunika ntchito yowonjezera), ndi zolemba zachitetezo cha pogwiritsa ntchito Spark (https://spark.apache.org/docs/2.4.5/security.html) CHIKWANGWANI chokha, Mesos ndi Standalone Cluster amaoneka ngati masitolo ofunikira. Nthawi yomweyo, wogwiritsa ntchito yemwe ntchito za Spark zimakhazikitsidwa sangatchulidwe mwachindunji - timangotchula akaunti yautumiki yomwe idzagwire ntchito, ndipo wogwiritsa ntchito amasankhidwa kutengera ndondomeko zachitetezo zomwe zidakhazikitsidwa. Pachifukwa ichi, mwina wogwiritsa ntchito mizu amagwiritsidwa ntchito, omwe sali otetezeka m'malo opindulitsa, kapena wogwiritsa ntchito UID mwachisawawa, zomwe zimakhala zovuta pamene akugawa ufulu wopeza deta (izi zikhoza kuthetsedwa mwa kupanga PodSecurityPolicies ndi kuwagwirizanitsa ndi akaunti zofananira zautumiki). Pakadali pano, yankho ndikuyika mafayilo onse ofunikira mwachindunji mu chithunzi cha Docker, kapena kusintha Spark kukhazikitsa script kuti mugwiritse ntchito makina osungira ndi kubweza zinsinsi zomwe zatengedwa m'gulu lanu.
Kuthamanga kwa ntchito za Spark pogwiritsa ntchito Kubernetes kudakali mumayendedwe oyesera ndipo pakhoza kukhala kusintha kwakukulu muzinthu zakale zomwe zimagwiritsidwa ntchito (mafayilo osintha, zithunzi za Docker base, ndi kukhazikitsa zolemba) mtsogolomo. Ndipo ndithudi, pokonzekera zakuthupi, matembenuzidwe 2.3.0 ndi 2.4.5 anayesedwa, khalidweli linali losiyana kwambiri.
Tiyeni tidikire zosintha - mtundu watsopano wa Spark (3.0.0) watulutsidwa posachedwa, womwe udabweretsa kusintha kwakukulu pantchito ya Spark pa Kubernetes, koma adasungabe kuyesa kothandizira woyang'anira izi. Mwina zosintha zina zipangitsa kuti zikhale zotheka kuvomereza kwathunthu kusiya YARN ndikuyendetsa ntchito za Spark pa Kubernetes popanda kuwopa chitetezo chadongosolo lanu komanso popanda kufunikira kosintha paokha pazigawo zogwira ntchito.