èªè ã®çæ§ãããã«ã¡ã¯ã ä»æ¥ã¯ãApache Spark ãšãã®éçºã®èŠéãã«ã€ããŠå°ãã話ããŸãã
çŸä»£ã®ããã° ããŒã¿ã®äžçã§ã¯ãApache Spark ãããã ããŒã¿åŠçã¿ã¹ã¯ãéçºããããã®äºå®äžã®æšæºã§ãã ããã«ããã€ã¯ã ãããã®æŠå¿µã§åäœããããŒã¿ãå°ããã€åŠçããŠéä¿¡ããã¹ããªãŒãã³ã° ã¢ããªã±ãŒã·ã§ã³ (Spark Structured Streaming) ã®äœæã«ã䜿çšãããŸãã ãããŠäŒçµ±çã«ãYARN (å Žåã«ãã£ãŠã¯ Apache Mesos) ããªãœãŒã¹ ãããŒãžã£ãŒãšããŠäœ¿çšããHadoop ã¹ã¿ãã¯å
šäœã®äžéšãšããŠäœ¿çšãããŠããŸããã 2020 幎ãŸã§ã«ããŸãšã㪠Hadoop ãã£ã¹ããªãã¥ãŒã·ã§ã³ãäžè¶³ããŠãããããã»ãšãã©ã®äŒæ¥ã«ãšã£ãŠåŸæ¥ã®åœ¢åŒã§ã®äœ¿çšãçåèŠãããŠããŸããHDP ãš CDH ã®éçºã¯åæ¢ããŠãããCDH ã¯ååã«éçºãããŠããããã³ã¹ããé«ããæ®ãã® Hadoop ãµãã©ã€ã€ãŒã¯ååšããªããªã£ãããæãæªæ¥ããããã®ã©ã¡ããã§ãã ãããã£ãŠãKubernetes ã䜿çšãã Apache Spark ã®ç«ã¡äžãã¯ãã³ãã¥ããã£ã倧äŒæ¥ã®éã§é¢å¿ãé«ãŸã£ãŠããŸãããã©ã€ããŒã ã¯ã©ãŠããšãããªã㯠ã¯ã©ãŠãã«ãããã³ã³ãã ãªãŒã±ã¹ãã¬ãŒã·ã§ã³ãšãªãœãŒã¹ç®¡çã®æšæºãšãªããYARN äžã® Spark ã¿ã¹ã¯ã®äžäŸ¿ãªãªãœãŒã¹ ã¹ã±ãžã¥ãŒãªã³ã°ã®åé¡ã解決ãã次ã®ãããªæ©èœãæäŸããŸããããããèŠæš¡ãçš®é¡ã®äŒæ¥åãã«å€ãã®åçšãã£ã¹ããªãã¥ãŒã·ã§ã³ãšãªãŒãã³ ãã£ã¹ããªãã¥ãŒã·ã§ã³ãåããçå®ã«çºå±ããŠãããã©ââãããã©ãŒã ã§ãã ããã«ã人æ°ãåããŠãã»ãšãã©ã®äŒæ¥ã¯ãã§ã«ç¬èªã®ã€ã³ã¹ããŒã«ãããã€ãååŸãããã®äœ¿çšã«é¢ããå°éç¥èãé«ããŠããããã移è¡ãç°¡çŽ åãããŠããŸãã
ããŒãžã§ã³ 2.3.0 以éãApache Spark 㯠Kubernetes ã¯ã©ã¹ã¿ãŒã§ã¿ã¹ã¯ãå®è¡ããããã®å
¬åŒãµããŒããååŸããŸãããä»æ¥ã¯ããã®ã¢ãããŒãã®çŸåšã®æç床ããã®äœ¿çšã®ããã®ããŸããŸãªãªãã·ã§ã³ãããã³å®è£
äžã«ééããèœãšãç©Žã«ã€ããŠèª¬æããŸãã
ãŸãæåã«ãApache Spark ã«åºã¥ããŠã¿ã¹ã¯ãšã¢ããªã±ãŒã·ã§ã³ãéçºããããã»ã¹ãèŠãŠãKubernetes ã¯ã©ã¹ã¿ãŒã§ã¿ã¹ã¯ãå®è¡ããå¿
èŠãããäžè¬çãªã±ãŒã¹ãåãäžããŸãã ãã®æçš¿ãæºåããéã«ã¯ãOpenShift ããã£ã¹ããªãã¥ãŒã·ã§ã³ãšããŠäœ¿çšããããã®ã³ãã³ã ã©ã€ã³ ãŠãŒãã£ãªã㣠(oc) ã«é¢é£ããã³ãã³ããæäŸãããŸãã ä»ã® Kubernetes ãã£ã¹ããªãã¥ãŒã·ã§ã³ã®å Žåã¯ãæšæºã® Kubernetes ã³ãã³ã ã©ã€ã³ ãŠãŒãã£ãªã㣠(kubectl) ãŸãã¯ãã®é¡äŒŒç© (ããšãã°ãoc adm ããªã·ãŒ) ã®å¯Ÿå¿ããã³ãã³ãã䜿çšã§ããŸãã
æåã®äœ¿çšäŸ - ã¹ããŒã¯éä¿¡
ã¿ã¹ã¯ãšã¢ããªã±ãŒã·ã§ã³ã®éçºäžãéçºè ã¯ã¿ã¹ã¯ãå®è¡ããŠããŒã¿å€æããããã°ããå¿ èŠããããŸãã çè«çã«ã¯ãã¹ã¿ãã¯ãããã®ç®çã«äœ¿çšã§ããŸããããã®ã¯ã©ã¹ã®ã¿ã¹ã¯ã§ã¯ããšã³ã ã·ã¹ãã ã®å®éã® (ãã¹ãã§ã¯ããã) ã€ã³ã¹ã¿ã³ã¹ãåå ãããéçºã®æ¹ãé«éãã€åªããŠããããšã蚌æãããŠããŸãã ãšã³ã ã·ã¹ãã ã®å®éã®ã€ã³ã¹ã¿ã³ã¹ã§ãããã°ããå Žåã次㮠XNUMX ã€ã®ã·ããªãªãèããããŸãã
- éçºè
㯠Spark ã¿ã¹ã¯ãã¹ã¿ã³ãã¢ãã³ ã¢ãŒãã§ããŒã«ã«ã§å®è¡ããŸãã
- éçºè
ã¯ãã¹ãã«ãŒã㧠Kubernetes ã¯ã©ã¹ã¿ãŒäžã§ Spark ã¿ã¹ã¯ãå®è¡ããŸãã
æåã®ãªãã·ã§ã³ã«ã¯ååšããæš©å©ããããŸãããå€ãã®æ¬ ç¹ã䌎ããŸãã
- åéçºè ã«ã¯ãè·å Žããå¿ èŠãªãšã³ã ã·ã¹ãã ã®ãã¹ãŠã®ã€ã³ã¹ã¿ã³ã¹ãžã®ã¢ã¯ã»ã¹ãæäŸãããªããã°ãªããŸããã
- éçºäžã®ã¿ã¹ã¯ãå®è¡ããã«ã¯ãäœæ¥ãã·ã³äžã«ååãªéã®ãªãœãŒã¹ãå¿ èŠã§ãã
XNUMX çªç®ã®ãªãã·ã§ã³ã«ã¯ããããã®æ¬ ç¹ã¯ãããŸãããKubernetes ã¯ã©ã¹ã¿ãŒã䜿çšãããšãã¿ã¹ã¯ã®å®è¡ã«å¿ èŠãªãªãœãŒã¹ ããŒã«ãå²ãåœãŠããšã³ã ã·ã¹ãã ã€ã³ã¹ã¿ã³ã¹ãžã®å¿ èŠãªã¢ã¯ã»ã¹ãæäŸã§ãããããKubernetes ããŒã« ã¢ãã«ã䜿çšããŠãã®ãªãœãŒã¹ ããŒã«ãžã®ã¢ã¯ã»ã¹ãæè»ã«æäŸã§ããŸããéçºããŒã ã®ã¡ã³ããŒå šå¡ã æåã®ãŠãŒã¹ã±ãŒã¹ãšããŠããã¹ãã«ãŒãå ã® Kubernetes ã¯ã©ã¹ã¿ãŒäžã®ããŒã«ã«éçºè ãã·ã³ãã Spark ã¿ã¹ã¯ãèµ·åããããšã匷調ããŸãããã
Spark ãããŒã«ã«ã§å®è¡ããããã«èšå®ããããã»ã¹ã«ã€ããŠè©³ãã説æããŸãã Spark ã®äœ¿çšãéå§ããã«ã¯ããããã€ã³ã¹ããŒã«ããå¿ èŠããããŸãã
mkdir /opt/spark
cd /opt/spark
wget http://mirror.linux-ia64.org/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar zxvf spark-2.4.5.tgz
rm -f spark-2.4.5.tgz
Kubernetes ãæäœããããã«å¿ èŠãªããã±ãŒãžãåéããŸãã
cd spark-2.4.5/
./build/mvn -Pkubernetes -DskipTests clean package
å®å šãªãã«ãã«ã¯å€ãã®æéãããããDocker ã€ã¡ãŒãžãäœæã㊠Kubernetes ã¯ã©ã¹ã¿ãŒäžã§å®è¡ããã«ã¯ãå®éã«å¿ èŠãªã®ã¯ãassembly/ããã£ã¬ã¯ããªã® jar ãã¡ã€ã«ã®ã¿ã§ããããããã®ãµããããžã§ã¯ãã®ã¿ããã«ãã§ããŸãã
./build/mvn -f ./assembly/pom.xml -Pkubernetes -DskipTests clean package
Kubernetes 㧠Spark ãžã§ããå®è¡ããã«ã¯ãåºæ¬ã€ã¡ãŒãžãšããŠäœ¿çšãã Docker ã€ã¡ãŒãžãäœæããå¿ èŠããããŸãã ããã§èããããã¢ãããŒã㯠2 ã€ãããŸãã
- çæããã Docker ã€ã¡ãŒãžã«ã¯ãå®è¡å¯èœãª Spark ã¿ã¹ã¯ ã³ãŒããå«ãŸããŠããŸãã
- äœæãããã€ã¡ãŒãžã«ã¯ Spark ãšå¿ èŠãªäŸåé¢ä¿ã®ã¿ãå«ãŸããŠãããå®è¡å¯èœã³ãŒãã¯ãªã¢ãŒã (HDFS ãªã©) ã§ãã¹ããããŸãã
ãŸããSpark ã¿ã¹ã¯ã®ãã¹ãäŸãå«ã Docker ã€ã¡ãŒãžãæ§ç¯ããŸãããã Docker ã€ã¡ãŒãžãäœæããããã«ãSpark ã«ã¯ãdocker-image-toolããšåŒã°ãããŠãŒãã£ãªãã£ããããŸãã ããã«é¢ãããã«ãã調ã¹ãŠã¿ãŸãããã
./bin/docker-image-tool.sh --help
ãããå©çšãããšãDocker ã€ã¡ãŒãžãäœæããŠãªã¢ãŒã ã¬ãžã¹ããªã«ã¢ããããŒãã§ããŸãããããã©ã«ãã§ã¯ããã€ãã®æ¬ ç¹ããããŸãã
- å¿ ãäžåºŠã« 3 ã€ã® Docker ã€ã¡ãŒãž (SparkãPySparkãR çš) ãäœæããŸãã
- ã€ã¡ãŒãžåãæå®ããããšã¯ã§ããŸããã
ãããã£ãŠã以äžã«ç€ºããã®ãŠãŒãã£ãªãã£ã®ä¿®æ£ããŒãžã§ã³ã䜿çšããŸãã
vi bin/docker-image-tool-upd.sh
#!/usr/bin/env bash
function error {
echo "$@" 1>&2
exit 1
}
if [ -z "${SPARK_HOME}" ]; then
SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/bin/load-spark-env.sh"
function image_ref {
local image="$1"
local add_repo="${2:-1}"
if [ $add_repo = 1 ] && [ -n "$REPO" ]; then
image="$REPO/$image"
fi
if [ -n "$TAG" ]; then
image="$image:$TAG"
fi
echo "$image"
}
function build {
local BUILD_ARGS
local IMG_PATH
if [ ! -f "$SPARK_HOME/RELEASE" ]; then
IMG_PATH=$BASEDOCKERFILE
BUILD_ARGS=(
${BUILD_PARAMS}
--build-arg
img_path=$IMG_PATH
--build-arg
datagram_jars=datagram/runtimelibs
--build-arg
spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
)
else
IMG_PATH="kubernetes/dockerfiles"
BUILD_ARGS=(${BUILD_PARAMS})
fi
if [ -z "$IMG_PATH" ]; then
error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
fi
if [ -z "$IMAGE_REF" ]; then
error "Cannot find docker image reference. Please add -i arg."
fi
local BINDING_BUILD_ARGS=(
${BUILD_PARAMS}
--build-arg
base_img=$(image_ref $IMAGE_REF)
)
local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/docker/Dockerfile"}
docker build $NOCACHEARG "${BUILD_ARGS[@]}"
-t $(image_ref $IMAGE_REF)
-f "$BASEDOCKERFILE" .
}
function push {
docker push "$(image_ref $IMAGE_REF)"
}
function usage {
cat <<EOF
Usage: $0 [options] [command]
Builds or pushes the built-in Spark Docker image.
Commands:
build Build image. Requires a repository address to be provided if the image will be
pushed to a different registry.
push Push a pre-built image to a registry. Requires a repository address to be provided.
Options:
-f file Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
-p file Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
-R file Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
-r repo Repository address.
-i name Image name to apply to the built image, or to identify the image to be pushed.
-t tag Tag to apply to the built image, or to identify the image to be pushed.
-m Use minikube's Docker daemon.
-n Build docker image with --no-cache
-b arg Build arg to build or push the image. For multiple build args, this option needs to
be used separately for each build arg.
Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.
Check the following documentation for more information on using the minikube Docker daemon:
https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon
Examples:
- Build image in minikube with tag "testing"
$0 -m -t testing build
- Build and push image with tag "v2.3.0" to docker.io/myrepo
$0 -r docker.io/myrepo -t v2.3.0 build
$0 -r docker.io/myrepo -t v2.3.0 push
EOF
}
if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
usage
exit 0
fi
REPO=
TAG=
BASEDOCKERFILE=
NOCACHEARG=
BUILD_PARAMS=
IMAGE_REF=
while getopts f:mr:t:nb:i: option
do
case "${option}"
in
f) BASEDOCKERFILE=${OPTARG};;
r) REPO=${OPTARG};;
t) TAG=${OPTARG};;
n) NOCACHEARG="--no-cache";;
i) IMAGE_REF=${OPTARG};;
b) BUILD_PARAMS=${BUILD_PARAMS}" --build-arg "${OPTARG};;
esac
done
case "${@: -1}" in
build)
build
;;
push)
if [ -z "$REPO" ]; then
usage
exit 1
fi
push
;;
*)
usage
exit 1
;;
esac
ãã®å©ããåããŠãSpark ã䜿çšã㊠Pi ãèšç®ããããã®ãã¹ã ã¿ã¹ã¯ãå«ãåºæ¬ç㪠Spark ã€ã¡ãŒãžãçµã¿ç«ãŠãŸã (ããã§ã{docker-registry-url} 㯠Docker ã€ã¡ãŒãž ã¬ãžã¹ããªã® URLã{repo} ã¯ã¬ãžã¹ããªå ã®ãªããžããªã®ååã OpenShift ã®ãããžã§ã¯ããšäžèŽãããã®ã {image-name} - ã€ã¡ãŒãžã®åå (ããšãã°ãRed Hat OpenShift ã€ã¡ãŒãžã®çµ±åã¬ãžã¹ããªã®ããã«ã€ã¡ãŒãžã® XNUMX ã¬ãã«ã®åé¢ã䜿çšãããŠããå Žå)ã {tag} - ãã®ã¿ã°ç»åã®ããŒãžã§ã³):
./bin/docker-image-tool-upd.sh -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile -r {docker-registry-url}/{repo} -i {image-name} -t {tag} build
ã³ã³ãœãŒã« ãŠãŒãã£ãªãã£ã䜿çšã㊠OKD ã¯ã©ã¹ã¿ãŒã«ãã°ã€ã³ããŸã (ããã§ã{OKD-API-URL} 㯠OKD ã¯ã©ã¹ã¿ãŒ API URL ã§ã)ã
oc login {OKD-API-URL}
Docker ã¬ãžã¹ããªã§çŸåšã®ãŠãŒã¶ãŒã®èªèšŒããŒã¯ã³ãååŸããŠã¿ãŸãããã
oc whoami -t
OKD ã¯ã©ã¹ã¿ãŒã®å éš Docker ã¬ãžã¹ããªã«ãã°ã€ã³ããŸã (åã®ã³ãã³ãã䜿çšããŠååŸããããŒã¯ã³ããã¹ã¯ãŒããšããŠäœ¿çšããŸã)ã
docker login {docker-registry-url}
ã¢ã»ã³ãã«ããã Docker ã€ã¡ãŒãžã Docker ã¬ãžã¹ã㪠OKD ã«ã¢ããããŒãããŸãããã
./bin/docker-image-tool-upd.sh -r {docker-registry-url}/{repo} -i {image-name} -t {tag} push
çµã¿ç«ãŠãããã€ã¡ãŒãžã OKD ã§å©çšã§ããããšã確èªããŠã¿ãŸãããã ãããè¡ãã«ã¯ããã©ãŠã¶ã§ URL ãéãã察å¿ãããããžã§ã¯ãã®ã€ã¡ãŒãžã®ãªã¹ãã衚瀺ããŸã (ããã§ã{project} 㯠OpenShift ã¯ã©ã¹ã¿ãŒå ã®ãããžã§ã¯ãã®ååã{OKD-WEBUI-URL} 㯠OpenShift Web ã³ã³ãœãŒã«ã® URL ã§ã) ) - https://{OKD-WEBUI-URL}/console /project/{ãããžã§ã¯ã}/browse/images/{ã€ã¡ãŒãžå}ã
ã¿ã¹ã¯ãå®è¡ããã«ã¯ããããã root ãšããŠå®è¡ããæš©éãæã€ãµãŒãã¹ ã¢ã«ãŠã³ããäœæããå¿ èŠããããŸã (ãã®ç¹ã«ã€ããŠã¯åŸã§èª¬æããŸã)ã
oc create sa spark -n {project}
oc adm policy add-scc-to-user anyuid -z spark -n {project}
äœæãããµãŒãã¹ ã¢ã«ãŠã³ããš Docker ã€ã¡ãŒãžãæå®ããŠãspark-submit ã³ãã³ããå®è¡ã㊠Spark ã¿ã¹ã¯ã OKD ã¯ã©ã¹ã¿ãŒã«å ¬éããŸãããã
/opt/spark/bin/spark-submit --name spark-test --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL} local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar
ããã«ïŒ
âname â Kubernetes ãããã®ååã®åœ¢æã«åå ããã¿ã¹ã¯ã®ååã
âclass â ã¿ã¹ã¯ã®éå§æã«åŒã³åºãããå®è¡å¯èœãã¡ã€ã«ã®ã¯ã©ã¹ã
âconf â Spark èšå®ãã©ã¡ãŒã¿ã
smile.executor.instances â èµ·åãã Spark ãšã°ãŒãã¥ãŒã¿ãŒã®æ°ã
smile.kubernetes.authenticate.driver.serviceAccountName - ãããã®èµ·åæã«äœ¿çšããã Kubernetes ãµãŒãã¹ ã¢ã«ãŠã³ãã®åå (Kubernetes API ãšå¯Ÿè©±ããéã®ã»ãã¥ãªã㣠ã³ã³ããã¹ããšæ©èœãå®çŸ©ãããã)ã
smile.kubernetes.namespace â ãã©ã€ããŒããã³ãšã°ãŒãã¥ãŒã¿ãŒ ããããèµ·åããã Kubernetes åå空éã
smile.submit.deployMode â Spark ãèµ·åããã¡ãœãã (æšæºã® Spark-submit ã®ãã¯ã©ã¹ã¿ãŒããSpark Operator ããã³ãã以éã®ããŒãžã§ã³ã® Spark ã®ãã¯ã©ã€ã¢ã³ããã®å Žå)ã
smile.kubernetes.container.image - ãããã®èµ·åã«äœ¿çšããã Docker ã€ã¡ãŒãžã
spark.master â Kubernetes API URL (ããŒã«ã« ãã·ã³ããã¢ã¯ã»ã¹ãçºçããããã«å€éšãæå®ãããŠããŸã)ã
local:// ã¯ãDocker ã€ã¡ãŒãžå ã® Spark å®è¡å¯èœãã¡ã€ã«ãžã®ãã¹ã§ãã
察å¿ãã OKD ãããžã§ã¯ãã«ç§»åããäœæãããããã (https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods) ã調ã¹ãŸãã
éçºããã»ã¹ãç°¡çŽ åããããã«ãå¥ã®ãªãã·ã§ã³ã䜿çšã§ããŸãããã®ãªãã·ã§ã³ã§ã¯ãSpark ã®å ±éã®åºæ¬ã€ã¡ãŒãžãäœæãããå®è¡ãããã¹ãŠã®ã¿ã¹ã¯ã§äœ¿çšãããå®è¡å¯èœãã¡ã€ã«ã®ã¹ãããã·ã§ãããå€éšã¹ãã¬ãŒãž (Hadoop ãªã©) ã«å ¬éãããåŒã³åºãæã«æå®ãããŸããã¹ããŒã¯-ãªã³ã¯ãšããŠéä¿¡ããŸãã ãã®å ŽåãWebHDFS ã䜿çšããŠã€ã¡ãŒãžãå ¬éãããªã©ãDocker ã€ã¡ãŒãžãåæ§ç¯ããã«ãããŸããŸãªããŒãžã§ã³ã® Spark ã¿ã¹ã¯ãå®è¡ã§ããŸãã ãã¡ã€ã«ãäœæãããªã¯ãšã¹ããéä¿¡ããŸã (ããã§ã{host} 㯠WebHDFS ãµãŒãã¹ã®ãã¹ãã{port} 㯠WebHDFS ãµãŒãã¹ã®ããŒãã{path-to-file-on-hdfs} ã¯ãã¡ã€ã«ãžã®ç®çã®ãã¹ã§ãã HDFS äž):
curl -i -X PUT "http://{host}:{port}/webhdfs/v1/{path-to-file-on-hdfs}?op=CREATE
次ã®ãããªå¿çãåãåããŸã (ããã§ã{location} ã¯ãã¡ã€ã«ã®ããŠã³ããŒãã«äœ¿çšããå¿ èŠããã URL ã§ã)ã
HTTP/1.1 307 TEMPORARY_REDIRECT
Location: {location}
Content-Length: 0
Spark å®è¡å¯èœãã¡ã€ã«ã HDFS ã«ããŒãããŸã (ããã§ã{path-to-local-file} ã¯çŸåšã®ãã¹ãäžã® Spark å®è¡å¯èœãã¡ã€ã«ãžã®ãã¹ã§ã)ã
curl -i -X PUT -T {path-to-local-file} "{location}"
ãã®åŸãHDFS ã«ã¢ããããŒãããã Spark ãã¡ã€ã«ã䜿çšããŠãspark-submit ãå®è¡ã§ããŸã (ããã§ã{class-name} ã¯ã¿ã¹ã¯ãå®äºããããã«èµ·åããå¿ èŠãããã¯ã©ã¹ã®ååã§ã)ã
/opt/spark/bin/spark-submit --name spark-test --class {class-name} --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace={project} --conf spark.submit.deployMode=cluster --conf spark.kubernetes.container.image={docker-registry-url}/{repo}/{image-name}:{tag} --conf spark.master=k8s://https://{OKD-API-URL} hdfs://{host}:{port}/{path-to-file-on-hdfs}
HDFS ã«ã¢ã¯ã»ã¹ããŠã¿ã¹ã¯ãåäœããããšã確èªããã«ã¯ãDockerfile ãšentrypoint.sh ã¹ã¯ãªãããå€æŽããå¿ èŠãããå Žåãããããšã«æ³šæããŠãã ãããäŸåããã©ã€ãã©ãªã /opt/spark/jars ãã£ã¬ã¯ããªã«ã³ããŒããããã®ãã£ã¬ã¯ãã£ãã Dockerfile ã«è¿œå ãã HDFS æ§æãã¡ã€ã«ããšã³ããªãã€ã³ãã® SPARK_CLASSPATH ã«å«ããŸãã
XNUMX çªç®ã®äœ¿çšäŸ - Apache Livy
ããã«ãã¿ã¹ã¯ãéçºããããã®çµæããã¹ãããå¿ èŠãããå ŽåãCI/CD ããã»ã¹ã®äžéšãšããŠã¿ã¹ã¯ãèµ·åãããã®å®è¡ã¹ããŒã¿ã¹ã远跡ããããšããåé¡ãçããŸãã ãã¡ãããããŒã«ã«ã®spark-submitåŒã³åºãã䜿çšããŠå®è¡ããããšãã§ããŸãããCIãµãŒããŒã®ãšãŒãžã§ã³ã/ã©ã³ããŒã«Sparkãã€ã³ã¹ããŒã«ããŠæ§æããKubernetes APIãžã®ã¢ã¯ã»ã¹ãèšå®ããå¿ èŠããããããCI/CDã€ã³ãã©ã¹ãã©ã¯ãã£ãè€éã«ãªããŸãã ãã®å Žåãã¿ãŒã²ããå®è£ ã¯ãKubernetes ã¯ã©ã¹ã¿ãŒå ã§ãã¹ãããã Spark ã¿ã¹ã¯ãå®è¡ããããã® REST API ãšã㊠Apache Livy ã䜿çšããããšãéžæããŸããã ãããå©çšãããšãéåžžã® cURL ãªã¯ãšã¹ãã䜿çšã㊠Kubernetes ã¯ã©ã¹ã¿ãŒäžã§ Spark ã¿ã¹ã¯ãå®è¡ã§ããŸããããã¯ãCI ãœãªã¥ãŒã·ã§ã³ã«åºã¥ããŠç°¡åã«å®è£ ã§ããKubernetes ã¯ã©ã¹ã¿ãŒå ã«é 眮ããããšã§ãKubernetes API ãšå¯Ÿè©±ããéã®èªèšŒã®åé¡ã解決ããŸãã
XNUMX çªç®ã®ãŠãŒã¹ã±ãŒã¹ãšããŠããã¹ã ã«ãŒãå
ã® Kubernetes ã¯ã©ã¹ã¿ãŒäžã® CI/CD ããã»ã¹ã®äžéšãšã㊠Spark ã¿ã¹ã¯ãå®è¡ããããšã匷調ããŸãããã
Apache Livy ã«ã€ããŠå°ã説æããŸããApache Livy ã¯ãWeb ã€ã³ã¿ãŒãã§ã€ã¹ãš RESTful API ãæäŸãã HTTP ãµãŒããŒãšããŠæ©èœããŸããããã«ãããå¿
èŠãªãã©ã¡ãŒã¿ãŒãæž¡ãããšã§ãspark-submit ããªã¢ãŒãã§èµ·åã§ããããã«ãªããŸãã åŸæ¥ããã㯠HDP ãã£ã¹ããªãã¥ãŒã·ã§ã³ã®äžéšãšããŠåºè·ãããŠããŸããããé©åãªãããã§ã¹ããšæ¬¡ã®ãããªäžé£ã® Docker ã€ã¡ãŒãžã䜿çšããŠãOKD ãŸãã¯ãã®ä»ã® Kubernetes ã€ã³ã¹ããŒã«ã«ãããã€ããããšãã§ããŸãã
FROM java:8-alpine
ENV SPARK_HOME=/opt/spark
ENV LIVY_HOME=/opt/livy
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
ENV SPARK_USER=spark
WORKDIR /opt
RUN apk add --update openssl wget bash &&
wget -P /opt https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz &&
tar xvzf spark-2.4.5-bin-hadoop2.7.tgz &&
rm spark-2.4.5-bin-hadoop2.7.tgz &&
ln -s /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark
RUN wget http://mirror.its.dal.ca/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip &&
unzip apache-livy-0.7.0-incubating-bin.zip &&
rm apache-livy-0.7.0-incubating-bin.zip &&
ln -s /opt/apache-livy-0.7.0-incubating-bin /opt/livy &&
mkdir /var/log/livy &&
ln -s /var/log/livy /opt/livy/logs &&
cp /opt/livy/conf/log4j.properties.template /opt/livy/conf/log4j.properties
ADD livy.conf /opt/livy/conf
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ADD entrypoint.sh /entrypoint.sh
ENV PATH="/opt/livy/bin:${PATH}"
EXPOSE 8998
ENTRYPOINT ["/entrypoint.sh"]
CMD ["livy-server"]
çæãããã€ã¡ãŒãžã¯ãã«ãããŠãå éš OKD ãªããžããªãªã©ã®æ¢åã® Docker ãªããžããªã«ã¢ããããŒãã§ããŸãã ããããããã€ããã«ã¯ã次ã®ãããã§ã¹ãã䜿çšããŸã ({registry-url} - Docker ã€ã¡ãŒãž ã¬ãžã¹ããªã® URLã{image-name} - Docker ã€ã¡ãŒãžåã{tag} - Docker ã€ã¡ãŒãž ã¿ã°ã{livy-url} - åžæã® URLãµãŒããŒã¯ Livy ã«ã¢ã¯ã»ã¹å¯èœã«ãªããŸããRed Hat OpenShift ã Kubernetes ãã£ã¹ããªãã¥ãŒã·ã§ã³ãšããŠäœ¿çšãããå Žåã¯ãRouteããããã§ã¹ãã䜿çšããããã以å€ã®å Žå㯠NodePort ã¿ã€ãã®å¯Ÿå¿ãã Ingress ãããã§ã¹ããŸãã¯ãµãŒãã¹ ãããã§ã¹ãã䜿çšãããŸã)ã
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
component: livy
name: livy
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
component: livy
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
component: livy
spec:
containers:
- command:
- livy-server
env:
- name: K8S_API_HOST
value: localhost
- name: SPARK_KUBERNETES_IMAGE
value: 'gnut3ll4/spark:v1.0.14'
image: '{registry-url}/{image-name}:{tag}'
imagePullPolicy: Always
name: livy
ports:
- containerPort: 8998
name: livy-rest
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/livy
name: livy-log
- mountPath: /opt/.livy-sessions/
name: livy-sessions
- mountPath: /opt/livy/conf/livy.conf
name: livy-config
subPath: livy.conf
- mountPath: /opt/spark/conf/spark-defaults.conf
name: spark-config
subPath: spark-defaults.conf
- command:
- /usr/local/bin/kubectl
- proxy
- '--port'
- '8443'
image: 'gnut3ll4/kubectl-sidecar:latest'
imagePullPolicy: Always
name: kubectl
ports:
- containerPort: 8443
name: k8s-api
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: spark
serviceAccountName: spark
terminationGracePeriodSeconds: 30
volumes:
- emptyDir: {}
name: livy-log
- emptyDir: {}
name: livy-sessions
- configMap:
defaultMode: 420
items:
- key: livy.conf
path: livy.conf
name: livy-config
name: livy-config
- configMap:
defaultMode: 420
items:
- key: spark-defaults.conf
path: spark-defaults.conf
name: livy-config
name: spark-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: livy-config
data:
livy.conf: |-
livy.spark.deploy-mode=cluster
livy.file.local-dir-whitelist=/opt/.livy-sessions/
livy.spark.master=k8s://http://localhost:8443
livy.server.session.state-retain.sec = 8h
spark-defaults.conf: 'spark.kubernetes.container.image "gnut3ll4/spark:v1.0.14"'
---
apiVersion: v1
kind: Service
metadata:
labels:
app: livy
name: livy
spec:
ports:
- name: livy-rest
port: 8998
protocol: TCP
targetPort: 8998
selector:
component: livy
sessionAffinity: None
type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
labels:
app: livy
name: livy
spec:
host: {livy-url}
port:
targetPort: livy-rest
to:
kind: Service
name: livy
weight: 100
wildcardPolicy: None
ãããé©çšããŠããããæ£åžžã«èµ·åãããšããªã³ã¯ http://{livy-url}/ui 㧠Livy ã°ã©ãã£ã«ã« ã€ã³ã¿ãŒãã§ã€ã¹ãå©çšå¯èœã«ãªããŸãã Livy ã䜿çšãããšãPostman ãªã©ããã® REST ãªã¯ãšã¹ãã䜿çšã㊠Spark ã¿ã¹ã¯ãå ¬éã§ããŸãã ãªã¯ãšã¹ããå«ãã³ã¬ã¯ã·ã§ã³ã®äŸã以äžã«ç€ºããŸã (èµ·åãããã¿ã¹ã¯ã®æäœã«å¿ èŠãªå€æ°ãå«ãæ§æåŒæ°ã¯ããargsãé åã§æž¡ãããšãã§ããŸã)ã
{
"info": {
"_postman_id": "be135198-d2ff-47b6-a33e-0d27b9dba4c8",
"name": "Spark Livy",
"schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
},
"item": [
{
"name": "1 Submit job with jar",
"request": {
"method": "POST",
"header": [
{
"key": "Content-Type",
"value": "application/json"
}
],
"body": {
"mode": "raw",
"raw": "{nt"file": "local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.5.jar", nt"className": "org.apache.spark.examples.SparkPi",nt"numExecutors":1,nt"name": "spark-test-1",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt}n}"
},
"url": {
"raw": "http://{livy-url}/batches",
"protocol": "http",
"host": [
"{livy-url}"
],
"path": [
"batches"
]
}
},
"response": []
},
{
"name": "2 Submit job without jar",
"request": {
"method": "POST",
"header": [
{
"key": "Content-Type",
"value": "application/json"
}
],
"body": {
"mode": "raw",
"raw": "{nt"file": "hdfs://{host}:{port}/{path-to-file-on-hdfs}", nt"className": "{class-name}",nt"numExecutors":1,nt"name": "spark-test-2",nt"proxyUser": "0",nt"conf": {ntt"spark.jars.ivy": "/tmp/.ivy",ntt"spark.kubernetes.authenticate.driver.serviceAccountName": "spark",ntt"spark.kubernetes.namespace": "{project}",ntt"spark.kubernetes.container.image": "{docker-registry-url}/{repo}/{image-name}:{tag}"nt},nt"args": [ntt"HADOOP_CONF_DIR=/opt/spark/hadoop-conf",ntt"MASTER=k8s://https://kubernetes.default.svc:8443"nt]n}"
},
"url": {
"raw": "http://{livy-url}/batches",
"protocol": "http",
"host": [
"{livy-url}"
],
"path": [
"batches"
]
}
},
"response": []
}
],
"event": [
{
"listen": "prerequest",
"script": {
"id": "41bea1d0-278c-40c9-ad42-bf2e6268897d",
"type": "text/javascript",
"exec": [
""
]
}
},
{
"listen": "test",
"script": {
"id": "3cdd7736-a885-4a2d-9668-bd75798f4560",
"type": "text/javascript",
"exec": [
""
]
}
}
],
"protocolProfileBehavior": {}
}
ã³ã¬ã¯ã·ã§ã³ããæåã®ãªã¯ãšã¹ããå®è¡ããOKD ã€ã³ã¿ãŒãã§ã€ã¹ã«ç§»åããŠãã¿ã¹ã¯ãæ£åžžã«èµ·åãããããšã確èªããŠã¿ãŸããã (https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods)ã åæã«ãã»ãã·ã§ã³ã Livy ã€ã³ã¿ãŒãã§ã€ã¹ (http://{livy-url}/ui) ã«è¡šç€ºããããã®äžã§ Livy API ãŸãã¯ã°ã©ãã£ã«ã« ã€ã³ã¿ãŒãã§ã€ã¹ã䜿çšããŠãã¿ã¹ã¯ã®é²è¡ç¶æ³ã远跡ããã»ãã·ã§ã³ã調æ»ã§ããŸãããã°ã
ããã§ã¯ãLivy ãã©ã®ããã«æ©èœããããèŠãŠã¿ãŸãããã ãããè¡ãã«ã¯ãLivy ãµãŒããŒã䜿çšããŠãããå ã® Livy ã³ã³ããã®ãã°ã調ã¹ãŠã¿ãŸããã - https://{OKD-WEBUI-URL}/console/project/{project}/browse/pods/{livy-pod-name }?tab=ãã°ã ãããããããlivyããšããååã®ã³ã³ãããŒã§ Livy REST API ãåŒã³åºããšãäžã§äœ¿çšãããã®ãšåæ§ã® Spark-Submit ãå®è¡ãããããšãããããŸã (ããã§ã{livy-pod-name} ã¯äœæããããããã®ååã§ã) Livy ãµãŒããŒã䜿çš)ã ãã®ã³ã¬ã¯ã·ã§ã³ã«ã¯ãLivy ãµãŒããŒã䜿çšã㊠Spark å®è¡å¯èœãã¡ã€ã«ããªã¢ãŒãã§ãã¹ãããã¿ã¹ã¯ãå®è¡ã§ãã XNUMX çªç®ã®ã¯ãšãªãå°å ¥ãããŠããŸãã
XNUMX çªç®ã®äœ¿çšäŸ - Spark ãªãã¬ãŒã¿ãŒ
ã¿ã¹ã¯ã®ãã¹ããå®äºããã®ã§ãå®æçã«å®è¡ãããã©ãããšããåé¡ãçããŸãã Kubernetes ã¯ã©ã¹ã¿ãŒã§ã¿ã¹ã¯ãå®æçã«å®è¡ãããã€ãã£ããªæ¹æ³ã¯ CronJob ãšã³ãã£ãã£ã§ãããããã䜿çšã§ããŸãããçŸæç¹ã§ã¯ Kubernetes ã§ã¢ããªã±ãŒã·ã§ã³ã管çããããã®ãªãã¬ãŒã¿ãŒã®äœ¿çšãéåžžã«äººæ°ããããSpark ã«ã¯ããªãæçãããªãã¬ãŒã¿ãŒããããŸãããšã³ã¿ãŒãã©ã€ãºã¬ãã«ã®ãœãªã¥ãŒã·ã§ã³ (Lightbend FastData Platform ãªã©) ã§äœ¿çšãããŸãã ããã䜿çšããããšããå§ãããŸããSpark ã®çŸåšã®å®å®ããŒãžã§ã³ (2.4.5) ã«ã¯ãKubernetes 㧠Spark ã¿ã¹ã¯ãå®è¡ããããã®æ§æãªãã·ã§ã³ãããªãå¶éãããŠããŸããã次ã®ã¡ãžã£ãŒ ããŒãžã§ã³ (3.0.0) ã§ã¯ Kubernetes ã®å®å šãªãµããŒãã宣èšãããŠããŸããããªãªãŒã¹æ¥ã¯äžæã®ãŸãŸã§ãã ã Spark Operator ã¯ãéèŠãªæ§æãªãã·ã§ã³ (ããšãã°ãHadoop ã¢ã¯ã»ã¹æ§æã䜿çšãã ConfigMap ã Spark ãããã«ããŠã³ããã) ãšãå®æçã«ã¹ã±ãžã¥ãŒã«ãããã¿ã¹ã¯ãå®è¡ããæ©èœãè¿œå ããããšã§ããã®æ¬ ç¹ãè£ããŸãã
XNUMX çªç®ã®ãŠãŒã¹ã±ãŒã¹ãšããŠãå®çšŒåã«ãŒãå
ã® Kubernetes ã¯ã©ã¹ã¿ãŒäžã§ Spark ã¿ã¹ã¯ãå®æçã«å®è¡ããããšã匷調ããŸãããã
Spark Operator ã¯ãªãŒãã³ãœãŒã¹ã§ãããGoogle Cloud Platform å
ã§éçºãããŠããŸã -
- Lightbend FastData Platform/Cloudflow ã€ã³ã¹ããŒã«ã®äžéšãšããŠã
- ãã«ã ã®äœ¿çš:
helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator helm install incubator/sparkoperator --namespace spark-operator
- å
¬åŒãªããžã㪠(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/manifest) ã®ãããã§ã¹ãã䜿çšããŸãã 次ã®ç¹ã«æ³šæããŠãã ãããCloudflow ã«ã¯ API ããŒãžã§ã³ v1beta1 ã®ãªãã¬ãŒã¿ãŒãå«ãŸããŠããŸãã ãã®ã¿ã€ãã®ã€ã³ã¹ããŒã«ã䜿çšããå ŽåãSpark ã¢ããªã±ãŒã·ã§ã³ã®ãããã§ã¹ãã®èª¬æã¯ãé©å㪠API ããŒãžã§ã³ (ãv1beta1-0.9.0-2.4.0ããªã©) ãåãã Git ã®ãµã³ãã« ã¿ã°ã«åºã¥ããŠããå¿
èŠããããŸãã æŒç®åã®ããŒãžã§ã³ã¯ããããŒãžã§ã³ãèŸæžå
ã®æŒç®åã«å«ãŸãã CRD ã®èª¬æã§ç¢ºèªã§ããŸãã
oc get crd sparkapplications.sparkoperator.k8s.io -o yaml
ãªãã¬ãŒã¿ãŒãæ£ããã€ã³ã¹ããŒã«ãããŠããå ŽåãSpark ãªãã¬ãŒã¿ãŒãå«ãã¢ã¯ãã£ããªãããã察å¿ãããããžã§ã¯ã (ããšãã°ãCloudflow ã€ã³ã¹ããŒã«ã® Cloudflow ã¹ããŒã¹å ã® Cloudflow-fdp-sparkoperator) ã«è¡šç€ºããããsparkapplicationsããšããååã®å¯Ÿå¿ãã Kubernetes ãªãœãŒã¹ ã¿ã€ãã衚瀺ãããŸãã ã 次ã®ã³ãã³ãã䜿çšããŠãå©çšå¯èœãª Spark ã¢ããªã±ãŒã·ã§ã³ãæ¢çŽ¢ã§ããŸãã
oc get sparkapplications -n {project}
Spark Operator ã䜿çšããŠã¿ã¹ã¯ãå®è¡ããã«ã¯ã次㮠3 ã€ã®ããšãè¡ãå¿ èŠããããŸãã
- å¿ èŠãªãã¹ãŠã®ã©ã€ãã©ãªãæ§æãã¡ã€ã«ãå®è¡å¯èœãã¡ã€ã«ãå«ã Docker ã€ã¡ãŒãžãäœæããŸãã ã¿ãŒã²ããã®å³ã§ã¯ããã㯠CI/CD 段éã§äœæããããã¹ã ã¯ã©ã¹ã¿ãŒã§ãã¹ããããã€ã¡ãŒãžã§ãã
- Kubernetes ã¯ã©ã¹ã¿ãŒããã¢ã¯ã»ã¹å¯èœãªã¬ãžã¹ããªã« Docker ã€ã¡ãŒãžãå ¬éããŸãã
- ãSparkApplicationãã¿ã€ããšèµ·åããã¿ã¹ã¯ã®èª¬æãå«ããããã§ã¹ããçæããŸãã ãããã§ã¹ãã®äŸã¯å
¬åŒãªããžããªã§å
¥æã§ããŸã (äŸ:
github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/v1beta1-0.9.0-2.4.0/examples/spark-pi.yaml ïŒã ãããã§ã¹ãã«ã€ããŠã¯æ¬¡ã®ç¹ã«æ³šæããå¿ èŠããããŸãã- ãapiVersionãèŸæžã¯ããªãã¬ãŒã¿ãŒã®ããŒãžã§ã³ã«å¯Ÿå¿ãã API ããŒãžã§ã³ã瀺ãå¿ èŠããããŸãã
- ãmetadata.namespaceãèŸæžã¯ãã¢ããªã±ãŒã·ã§ã³ãèµ·åãããåå空éã瀺ãå¿ èŠããããŸãã
- ãspec.imageãèŸæžã«ã¯ãã¢ã¯ã»ã¹å¯èœãªã¬ãžã¹ããªå ã«äœæããã Docker ã€ã¡ãŒãžã®ã¢ãã¬ã¹ãå«ãŸããŠããå¿ èŠããããŸãã
- ãspec.mainClassããã£ã¯ã·ã§ããªã«ã¯ãããã»ã¹ã®éå§æã«å®è¡ããå¿ èŠããã Spark ã¿ã¹ã¯ ã¯ã©ã¹ãå«ãŸããŠããå¿ èŠããããŸãã
- å®è¡å¯èœ jar ãã¡ã€ã«ãžã®ãã¹ã¯ãspec.mainApplicationFileããã£ã¯ã·ã§ããªã§æå®ããå¿ èŠããããŸãã
- ãspec.sparkVersionãèŸæžã¯ã䜿çšãããŠãã Spark ã®ããŒãžã§ã³ã瀺ãå¿ èŠããããŸãã
- ãspec.driver.serviceAccountããã£ã¯ã·ã§ããªã§ã¯ãã¢ããªã±ãŒã·ã§ã³ã®å®è¡ã«äœ¿çšãããã察å¿ãã Kubernetes åå空éå ã®ãµãŒãã¹ ã¢ã«ãŠã³ããæå®ããå¿ èŠããããŸãã
- ãspec.executorãèŸæžã¯ãã¢ããªã±ãŒã·ã§ã³ã«å²ãåœãŠããããªãœãŒã¹ã®æ°ã瀺ãå¿ èŠããããŸãã
- ãspec.volumeMountsããã£ã¯ã·ã§ããªã§ã¯ãããŒã«ã« Spark ã¿ã¹ã¯ ãã¡ã€ã«ãäœæãããããŒã«ã« ãã£ã¬ã¯ããªãæå®ããå¿ èŠããããŸãã
ãããã§ã¹ãã®çæäŸ (ãã㧠{spark-service-account} ã¯ãSpark ã¿ã¹ã¯ãå®è¡ããããã® Kubernetes ã¯ã©ã¹ã¿ãŒå ã®ãµãŒãã¹ ã¢ã«ãŠã³ãã§ã)ã
apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
name: spark-pi
namespace: {project}
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
sparkVersion: "2.4.0"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 0.1
coreLimit: "200m"
memory: "512m"
labels:
version: 2.4.0
serviceAccount: {spark-service-account}
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 2.4.0
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
ãã®ãããã§ã¹ãã¯ããããã§ã¹ããå ¬éããåã«ã(å¿ èŠã«å¿ããŠ) Kubernetes API ãšå¯Ÿè©±ããããã« Spark ã¢ããªã±ãŒã·ã§ã³ã«å¿ èŠãªã¢ã¯ã»ã¹æš©ãæäŸããå¿ èŠãªããŒã« ãã€ã³ãã£ã³ã°ãäœæããå¿ èŠããããµãŒãã¹ ã¢ã«ãŠã³ããæå®ããŸãã ãã®å Žåãã¢ããªã±ãŒã·ã§ã³ã«ã¯ããããäœæããæš©éãå¿ èŠã§ãã å¿ èŠãªããŒã« ãã€ã³ãã£ã³ã°ãäœæããŸãããã
oc adm policy add-role-to-user edit system:serviceaccount:{project}:{spark-service-account} -n {project}
ãã®ãããã§ã¹ãä»æ§ã«ã¯ãhadoopConfigMapããã©ã¡ãŒã¿ãå«ãŸããå Žåãããããšã«ã泚ç®ããŠãã ãããããã«ãããæåã«å¯Ÿå¿ãããã¡ã€ã«ã Docker ã€ã¡ãŒãžã«é 眮ããªããŠããHadoop æ§æã䜿çšã㊠ConfigMap ãæå®ã§ããŸãã ãŸããã¿ã¹ã¯ãå®æçã«å®è¡ããã®ã«ãé©ããŠããŸãããã¹ã±ãžã¥ãŒã«ããã©ã¡ãŒã¿ã䜿çšãããšãç¹å®ã®ã¿ã¹ã¯ãå®è¡ããã¹ã±ãžã¥ãŒã«ãæå®ã§ããŸãã
ãã®åŸããããã§ã¹ããspark-pi.yamlãã¡ã€ã«ã«ä¿åãããããKubernetesã¯ã©ã¹ã¿ãŒã«é©çšããŸãã
oc apply -f spark-pi.yaml
ããã«ããããsparkapplicationsãã¿ã€ãã®ãªããžã§ã¯ããäœæãããŸãã
oc get sparkapplications -n {project}
> NAME AGE
> spark-pi 22h
ãã®å Žåãã¢ããªã±ãŒã·ã§ã³ãå«ãããããäœæããããã®ã¹ããŒã¿ã¹ã¯äœæããããsparkapplicationsãã«è¡šç€ºãããŸãã 次ã®ã³ãã³ãã§è¡šç€ºã§ããŸãã
oc get sparkapplications spark-pi -o yaml -n {project}
ã¿ã¹ã¯ãå®äºãããšãPOD ã¯ãå®äºãã¹ããŒã¿ã¹ã«ç§»è¡ãããsparkapplicationsãã§ãæŽæ°ãããŸãã ã¢ããªã±ãŒã·ã§ã³ ãã°ã¯ããã©ãŠã¶ãŒã§è¡šç€ºãããã次ã®ã³ãã³ãã䜿çšããŠè¡šç€ºã§ããŸã (ããã§ã{sparkapplications-pod-name} ã¯å®è¡äžã®ã¿ã¹ã¯ã®ãããã®ååã§ã)ã
oc logs {sparkapplications-pod-name} -n {project}
Spark ã¿ã¹ã¯ã¯ãç¹æ®ãª sparkctl ãŠãŒãã£ãªãã£ã䜿çšããŠç®¡çããããšãã§ããŸãã ãããã€ã³ã¹ããŒã«ããã«ã¯ããœãŒã¹ ã³ãŒããå«ããªããžããªã®ã¯ããŒã³ãäœæããGo ãã€ã³ã¹ããŒã«ããŠãã®ãŠãŒãã£ãªãã£ããã«ãããŸãã
git clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
tar -xzf go1.13.3.linux-amd64.tar.gz
sudo mv go /usr/local
mkdir $HOME/Projects
export GOROOT=/usr/local/go
export GOPATH=$HOME/Projects
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
go -version
cd sparkctl
go build -o sparkctl
sudo mv sparkctl /usr/local/bin
å®è¡äžã® Spark ã¿ã¹ã¯ã®ãªã¹ãã調ã¹ãŠã¿ãŸãããã
sparkctl list -n {project}
Spark ã¿ã¹ã¯ã®èª¬æãäœæããŸãããã
vi spark-app.yaml
apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
name: spark-pi
namespace: {project}
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
sparkVersion: "2.4.0"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1000m"
memory: "512m"
labels:
version: 2.4.0
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 2.4.0
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
説æãããŠããã¿ã¹ã¯ããsparkctl ã䜿çšããŠå®è¡ããŠã¿ãŸãããã
sparkctl create spark-app.yaml -n {project}
å®è¡äžã® Spark ã¿ã¹ã¯ã®ãªã¹ãã調ã¹ãŠã¿ãŸãããã
sparkctl list -n {project}
èµ·åããã Spark ã¿ã¹ã¯ã®ã€ãã³ãã®ãªã¹ãã調ã¹ãŠã¿ãŸãããã
sparkctl event spark-pi -n {project} -f
å®è¡äžã® Spark ã¿ã¹ã¯ã®ã¹ããŒã¿ã¹ã調ã¹ãŠã¿ãŸãããã
sparkctl status spark-pi -n {project}
çµè«ãšããŠãKubernetes 㧠Spark ã®çŸåšã®å®å®ããŒãžã§ã³ (2.4.5) ã䜿çšããããšã§å€æããæ¬ ç¹ã«ã€ããŠèããŠã¿ãããšæããŸãã
- æåã®ããããŠããããäž»ãªæ¬ ç¹ã¯ãããŒã¿ã®å±ææ§ãæ¬ åŠããŠããããšã§ãã YARN ã«ã¯ãã¹ãŠã®æ¬ ç¹ããããŸãããYARN ã䜿çšããå©ç¹ããããŸãããããšãã°ã(ããŒã¿ããã³ãŒãã§ã¯ãªã) ã³ãŒããããŒã¿ã«é ä¿¡ããååãªã©ã§ãã ãã®ãããã§ãèšç®ã«é¢ä¿ããããŒã¿ãååšããããŒãäžã§ Spark ã¿ã¹ã¯ãå®è¡ããããããã¯ãŒã¯çµç±ã§ããŒã¿ãé ä¿¡ããã®ã«ãããæéãå€§å¹ ã«ççž®ãããŸããã Kubernetes ã䜿çšããå Žåãã¿ã¹ã¯ã«é¢ä¿ããããŒã¿ããããã¯ãŒã¯çµç±ã§ç§»åããå¿ èŠããããŸãã ããããååã«å€§ããå Žåãã¿ã¹ã¯ã®å®è¡æéãå€§å¹ ã«å¢å ããå¯èœæ§ããããäžæã¹ãã¬ãŒãžãšã㊠Spark ã¿ã¹ã¯ ã€ã³ã¹ã¿ã³ã¹ã«å²ãåœãŠãããããªã倧éã®ãã£ã¹ã¯é åãå¿ èŠã«ãªããŸãã ãã®æ¬ ç¹ã¯ãKubernetes ã§ã®ããŒã¿ã®å±ææ§ãä¿èšŒããç¹æ®ãªãœãããŠã§ã¢ (Alluxio ãªã©) ã䜿çšããããšã§è»œæžã§ããŸãããå®éã«ã¯ãããŒã¿ã®å®å šãªã³ããŒã Kubernetes ã¯ã©ã¹ã¿ãŒ ããŒãã«ä¿åããå¿ èŠãããããšãæå³ããŸãã
- 3.0.0 çªç®ã®éèŠãªæ¬ ç¹ã¯ã»ãã¥ãªãã£ã§ãã ããã©ã«ãã§ã¯ãSpark ã¿ã¹ã¯ã®å®è¡ã«é¢ããã»ãã¥ãªãã£é¢é£ã®æ©èœã¯ç¡å¹ã«ãªã£ãŠãããKerberos ã®äœ¿çšã¯å ¬åŒããã¥ã¡ã³ãã§ã«ããŒãããŠããŸãã (ãã ãã察å¿ãããªãã·ã§ã³ã¯ããŒãžã§ã³ 2.4.5 ã§å°å ¥ãããŠãããè¿œå ã®äœæ¥ãå¿ èŠã§ã)ã Spark (https://spark.apache.org/docs/XNUMX/security.html) ã䜿çšãããšãYARNãMesosãããã³ã¹ã¿ã³ãã¢ãã³ ã¯ã©ã¹ã¿ãŒã®ã¿ãã㌠ã¹ãã¢ãšããŠè¡šç€ºãããŸãã åæã«ãSpark ã¿ã¹ã¯ãèµ·åãããŠãŒã¶ãŒãçŽæ¥æå®ããããšã¯ã§ããŸãããã¿ã¹ã¯ãåäœãããµãŒãã¹ ã¢ã«ãŠã³ãã®ã¿ãæå®ãããŠãŒã¶ãŒã¯æ§æãããã»ãã¥ãªã㣠ããªã·ãŒã«åºã¥ããŠéžæãããŸãã ãã®ç¹ã«é¢ããŠãçç£ç°å¢ã§ã¯å®å šã§ã¯ãªã root ãŠãŒã¶ãŒã䜿çšãããããã©ã³ãã 㪠UID ãæã€ãŠãŒã¶ãŒã䜿çšãããŸããããã¯ãããŒã¿ãžã®ã¢ã¯ã»ã¹æš©ãé åžãããšãã«äžäŸ¿ã§ã (ããã¯ãPodSecurityPolicies ãäœæãããããã察å¿ãããµãŒãã¹ã¢ã«ãŠã³ã)ã çŸæç¹ã§ã®è§£æ±ºçã¯ãå¿ èŠãªãã¹ãŠã®ãã¡ã€ã«ã Docker ã€ã¡ãŒãžã«çŽæ¥é 眮ããããçµç¹ã§æ¡çšãããŠããã·ãŒã¯ã¬ãããä¿åããã³ååŸããããã®ã¡ã«ããºã ã䜿çšããããã« Spark èµ·åã¹ã¯ãªãããå€æŽããããšã§ãã
- Kubernetes ã䜿çšãã Spark ãžã§ãã®å®è¡ã¯ãå ¬åŒã«ã¯ãŸã å®éšã¢ãŒãã§ãããå°æ¥çã«äœ¿çšãããã¢ãŒãã£ãã¡ã¯ã (æ§æãã¡ã€ã«ãDocker ããŒã¹ ã€ã¡ãŒãžãããã³èµ·åã¹ã¯ãªãã) ã«å€§å¹ ãªå€æŽãçºçããå¯èœæ§ããããŸãã å®éãè³æãæºåãããšãã«ããŒãžã§ã³ 2.3.0 ãš 2.4.5 ããã¹ããããšãããåäœã¯å€§ããç°ãªããŸããã
æŽæ°ãåŸ ã¡ãŸããã - Spark ã®æ°ããããŒãžã§ã³ (3.0.0) ãæè¿ãªãªãŒã¹ãããŸãããããã«ãããKubernetes äžã® Spark ã®åäœã«å€§å¹ ãªå€æŽãå ããããŸãããããã®ãªãœãŒã¹ ãããŒãžã£ãŒã®ãµããŒãã¯å®éšçãªç¶æ ãç¶æãããŸããã ãããã次ã®ã¢ããããŒãã§ã¯ãã·ã¹ãã ã®ã»ãã¥ãªãã£ãå¿é ããããšãªããæ©èœã³ã³ããŒãã³ããåå¥ã«å€æŽããå¿ èŠããªããYARN ãæŸæ£ã㊠Kubernetes äžã§ Spark ã¿ã¹ã¯ãå®è¡ããããšãå®å šã«æšå¥šã§ããããã«ãªãã§ãããã
çµãã
åºæïŒ habr.com