Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื”ืขืœื, ื”ืื‘ืจ! ื”ื™ื™ึทื ื˜ ืžื™ืจ ื•ื•ืขืœืŸ ื‘ื•ื™ืขืŸ ืึท ืกื™ืกื˜ืขื ื•ื•ืึธืก ื•ื•ืขื˜ ืคึผืจืึธืฆืขืก Apache Kafka ืึธื ื–ืึธื’ ืกื˜ืจื™ืžื– ื ื™ืฆืŸ Spark Streaming ืื•ืŸ ืฉืจื™ื™ึทื‘ืŸ ื“ื™ ืคึผืจืึทืกืขืกื™ื ื’ ืจืขื–ื•ืœื˜ืึทื˜ืŸ ืฆื• ื“ื™ AWS RDS ื•ื•ืึธืœืงืŸ ื“ืึทื˜ืึทื‘ื™ื™ืก.

ื–ืืœ ืก ื™ืžืึทื“ื–ืฉืึทืŸ ืึทื– ืึท ื–ื™ื›ืขืจ ืงืจืขื“ื™ื˜ ื™ื ืกื˜ื™ื˜ื•ืฉืึทืŸ ืฉื˜ืขืœื˜ ืื•ื ื“ื– ื“ื™ ืึทืจื‘ืขื˜ ืคื•ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ื™ื ืงืึทืžื™ื ื’ ื˜ืจืึทื ื–ืึทืงืฉืึทื ื– "ืื•ื™ืฃ ื“ื™ ืคืœื™ืขืŸ" ืึทืจื™ื‘ืขืจ ืึทืœืข ืคื•ืŸ โ€‹โ€‹ื–ื™ื™ึทืŸ ืฆื•ื•ื™ื™ื’ืŸ. ื“ืขื ืงืขื ืขืŸ ื–ื™ื™ืŸ ื’ืขื˜ืืŸ ืคึฟืึทืจ ื“ื™ ืฆื™ืœ ืคื•ืŸ ืคึผื•ื ืงื˜ ืงืึทืœืงื™ืึทืœื™ื™ื˜ื™ื ื’ ืึท ืขืคืขื ืขืŸ ืงืจืึทื ื˜ืงื™ื™ึทื˜ ืฉื˜ืขืœืข ืคึฟืึทืจ ื“ื™ ืฉืึทืฆืงืึทืžืขืจ, ืœื™ืžืึทืฅ ืึธื“ืขืจ ืคื™ื ืึทื ืฆื™ืขืœ ืจืขื–ื•ืœื˜ืึทื˜ืŸ ืคึฟืึทืจ ื˜ืจืึทื ื–ืึทืงืฉืึทื ื–, ืืื–"ื• ื•.

ื•ื•ื™ ืฆื• ื™ื ืกื˜ืจื•ืžืขื ื˜ ื“ืขื ืคืึทืœ ืึธืŸ ื“ื™ ื ื•ืฆืŸ ืคื•ืŸ ืžืึทื’ื™ืฉ ืื•ืŸ ืžืึทื’ื™ืฉ ืกืคึผืขืœื– - ืœื™ื™ืขื ืขืŸ ืื•ื ื˜ืขืจ ื“ื™ ืฉื ื™ื™ึทื“ืŸ! ื’ื™ื™!

Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming
(ื‘ื™ืœื“ ืžืงื•ืจ)

ื”ืงื“ืžื”

ืคื•ืŸ ืงื•ืจืก, ืคึผืจืึทืกืขืกื™ื ื’ ืึท ื’ืจื•ื™ืก ืกื•ืžืข ืคื•ืŸ โ€‹โ€‹ื“ืึทื˜ืŸ ืื™ืŸ ืคืึทืงื˜ื™ืฉ ืฆื™ื™ื˜ ื’ื™ื˜ ื’ืขื ื•ื’ื™ืง ืึทืคึผืขืจื˜ื•ื ืึทื˜ื™ื– ืคึฟืึทืจ ื ื•ืฆืŸ ืื™ืŸ ืžืึธื“ืขืจืŸ ืกื™ืกื˜ืขืžืขืŸ. ืื™ื™ื ืขืจ ืคื•ืŸ ื“ื™ ืžืขืจืกื˜ ืคืึธืœืงืก ืงืึทืžื‘ืึทื ื™ื™ืฉืึทื ื– ืคึฟืึทืจ ื“ืขื ืื™ื– ื“ื™ ื˜ืึทื ื“ืึทื ืคื•ืŸ Apache Kafka ืื•ืŸ Spark Streaming, ื•ื•ื• Kafka ืงืจื™ื™ื™ืฅ ืึท ื˜ื™ื™ึทืš ืคื•ืŸ ื™ื ืงืึทืžื™ื ื’ ืึธื ื–ืึธื’ ืคึผืึทืงื™ืฅ, ืื•ืŸ Spark Streaming ืคึผืจืึทืกืขืกืึทื– ื“ื™ ืคึผืึทืงื™ืฅ ืื™ืŸ ืึท ื’ืขื’ืขื‘ืŸ ืฆื™ื™ื˜ ืžืขื”ืึทืœืขืš.

ืฆื• ืคืึทืจื’ืจืขืกืขืจืŸ ื“ื™ ืฉื•ืœื“ ื˜ืึธืœืขืจืึทื ืฅ ืคื•ืŸ ื“ื™ ืึทืคึผืœืึทืงื™ื™ืฉืึทืŸ, ืžื™ืจ ื•ื•ืขืœืŸ ื ื•ืฆืŸ ื˜ืฉืขืงืคึผื•ื™ื ืฅ. ืžื™ื˜ ื“ืขื ืžืขืงืึทื ื™ื–ืึทื, ื•ื•ืขืŸ ื“ื™ Spark Streaming ืžืึธื˜ืึธืจ ื“ืึทืจืฃ ืฆื•ืจื™ืงืงืจื™ื’ืŸ ืคืึทืจืคืึทืœืŸ ื“ืึทื˜ืŸ, ืขืก ื ืึธืจ ื“ืึทืจืฃ ืฆื• ื’ื™ื™ืŸ ืฆื•ืจื™ืง ืฆื• ื“ื™ ืœืขืฆื˜ืข ื˜ืฉืขืงืคึผื•ื™ื ื˜ ืื•ืŸ ื ืขืžืขื  ื–ื™ื› ื•ื•ื™ื“ืขืจ ื—ืฉื‘ื•ื ื•ืช ืคื•ืŸ ื“ืึธืจื˜.

ืึทืจืงืึทื˜ืขืงื˜ืฉืขืจ ืคื•ืŸ ื“ื™ ื“ืขื•ื•ืขืœืึธืคึผืขื“ ืกื™ืกื˜ืขื

Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื‘ืื ื•ืฆื˜ ืงืึทืžืคึผืึธื•ื ืึทื ืฅ:

  • ืึทืคึผืึทื˜ืฉื™ ืงืึทืคืงืึท ืื™ื– ืึท ืคื•ื ืื ื“ืขืจื’ืขื˜ื™ื™ืœื˜ ืึทืจื•ื™ืกื’ืขื‘ืŸ-ืึทื‘ืึธื ื™ืจืŸ ืžืขืกื™ื“ื–ืฉื™ื ื’ ืกื™ืกื˜ืขื. ืคึผืึทืกื™ืง ืคึฟืึทืจ ื‘ื™ื™ื“ืข ืึธืคืคืœื™ื ืข ืื•ืŸ ืึธื ืœื™ื™ืŸ ืึธื ื–ืึธื’ ืงืึทื ืกืึทืžืฉืึทืŸ. ืฆื• ืคืึทืจืžื™ื™ึทื“ืŸ ื“ืึทื˜ืŸ ืึธื ื•ื•ืขืจ, Kafka ืึทืจื˜ื™ืงืœืขืŸ ื–ืขื ืขืŸ ืกื˜ืึธืจื“ ืื•ื™ืฃ ื“ื™ืกืง ืื•ืŸ ืจืขืคึผืœื™ืงื™ื™ื˜ื™ื“ ืื™ืŸ ื“ืขื ืงื ื•ื™ืœ. ื“ื™ ืงืึทืคืงืึท ืกื™ืกื˜ืขื ืื™ื– ื’ืขื‘ื•ื™ื˜ ืื•ื™ืฃ ืฉืคึผื™ืฅ ืคื•ืŸ ื“ื™ ZooKeeper ืกื™ื ื’ืงืจืึทื ืึทื–ื™ื™ืฉืึทืŸ ื“ื™ื ืกื˜;
  • ืึทืคึผืึทื˜ืฉื™ ืกืคึผืึทืจืง ืกื˜ืจื™ืžื™ื ื’ - ืกืคึผืึทืจืง ืงืึธืžืคึผืึธื ืขื ื˜ ืคึฟืึทืจ ืคึผืจืึทืกืขืกื™ื ื’ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ. ื“ื™ Spark Streaming ืžืึธื“ื•ืœืข ืื™ื– ื’ืขื‘ื•ื™ื˜ ืžื™ื˜ ืึท ืžื™ืงืจืึธ-ืคึผืขืงืœ ืึทืจืงืึทื˜ืขืงื˜ืฉืขืจ, ื•ื•ื• ื“ื™ ื“ืึทื˜ืŸ ื˜ื™ื™ึทืš ืื™ื– ื™ื ื˜ืขืจืคึผืจืึทื˜ืึทื“ ื•ื•ื™ ืึท ืงืขืกื™ื™ื“ืขืจื“ื™ืง ืกื™ืงื•ื•ืึทื ืก ืคื•ืŸ ืงืœื™ื™ืŸ ื“ืึทื˜ืŸ ืคึผืึทืงื™ืฅ. Spark Streaming ื ืขืžื˜ ื“ืึทื˜ืŸ ืคื•ืŸ ืคืึทืจืฉื™ื“ืขื ืข ืงื•ื•ืืœืŸ ืื•ืŸ ืงืึทืžื‘ื™ื™ื ื– ืขืก ืื™ืŸ ืงืœื™ื™ืŸ ืคึผืึทืงืึทื“ื–ืฉืึทื–. ื ื™ื• ืคึผืึทืงืึทื“ื–ืฉืึทื– ื–ืขื ืขืŸ ื‘ืืฉืืคืŸ ืžื™ื˜ ืจืขื’ื•ืœืขืจ ื™ื ื˜ืขืจื•ื•ืึทืœื–. ืื™ืŸ ื“ื™ ืึธื ื”ื™ื™ื‘ ืคื•ืŸ ื™ืขื“ืขืจ ืฆื™ื™ื˜ ืžืขื”ืึทืœืขืš, ืึท ื ื™ื™ึทืข ืคึผืึทืงืึทื˜ ืื™ื– ื‘ืืฉืืคืŸ, ืื•ืŸ ืงื™ื™ืŸ ื“ืึทื˜ืŸ ื‘ืืงื•ืžืขืŸ ื‘ืขืฉืึทืก ื“ืขื ืžืขื”ืึทืœืขืš ืื™ื– ืึทืจื™ื™ึทื ื’ืขืจืขื›ื ื˜ ืื™ืŸ ื“ื™ ืคึผืึทืงืึทื˜. ืื™ืŸ ื“ื™ ืกื•ืฃ ืคื•ืŸ ื“ื™ ืžืขื”ืึทืœืขืš, ืคึผืึทืงืึทื˜ ื’ืจืึธื•ื˜ ืกื˜ืึทืคึผืก. ื“ื™ ื’ืจื™ื™ืก ืคื•ืŸ ื“ืขื ื™ื ื˜ืขืจื•ื•ืึทืœ ืื™ื– ื‘ืืฉืœืืกืŸ ื“ื•ืจืš ืึท ืคึผืึทืจืึทืžืขื˜ืขืจ ื’ืขืจื•ืคืŸ ื“ื™ ืคึผืขืงืœ ืžืขื”ืึทืœืขืš;
  • ืึทืคึผืึทื˜ืฉื™ ืกืคึผืึทืจืง ืกืงืœ - ืงืึทืžื‘ื™ื™ื ื– ืจื™ืœื™ื™ืฉืึทื ืึทืœ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark ืคืึทื ื’ืงืฉืึทื ืึทืœ ืคึผืจืึธื’ืจืึทืžืžื™ื ื’. ืกื˜ืจืึทืงื˜ืฉืขืจื“ ื“ืึทื˜ืŸ ืžื™ื˜ืœ ื“ืึทื˜ืŸ ื•ื•ืึธืก ื”ืึธื‘ืŸ ืึท ืกื˜ืฉืขืžืึท, ื“ืึธืก ืื™ื–, ืึท ืื™ื™ืŸ ื’ืึทื ื’ ืคื•ืŸ ืคืขืœื“ืขืจ ืคึฟืึทืจ ืึทืœืข ืจืขืงืึธืจื“ืก. Spark SQL ืฉื˜ื™ืฆื˜ ืึทืจื™ื™ึทื ืฉืจื™ื™ึทื‘ ืคึฟื•ืŸ ืคืึทืจืฉื™ื“ืŸ ืกื˜ืจืึทืงื˜ืฉืขืจื“ ื“ืึทื˜ืŸ ืงื•ื•ืืœืŸ ืื•ืŸ, ื“ืึทื ืง ืฆื• ื“ื™ ืึทื•ื•ื™ื™ืœืึทื‘ื™ืœืึทื˜ื™ ืคื•ืŸ ืกื˜ืฉืขืžืึท ืื™ื ืคึฟืึธืจืžืึทืฆื™ืข, ืขืก ืงืขื ืขืŸ ื™ืคื™ืฉืึทื ื˜ืœื™ ืฆื•ืจื™ืงืงืจื™ื’ืŸ ื‘ืœื•ื™ื– ื“ื™ ืคืืจืœืื ื’ื˜ ืคืขืœื“ ืคื•ืŸ ืจืขืงืึธืจื“ืก, ืื•ืŸ ืื•ื™ืš ื’ื™ื˜ ื“ืึทื˜ืึทืคืจืึทืžืข ืึทืคึผื™ืก;
  • AWS RDS ืื™ื– ืึท ืœืขืคื™ืขืจืขืš ื‘ื™ืœื™ืง ื•ื•ืึธืœืงืŸ-ื‘ืื–ื™ืจื˜ ืจื™ืœื™ื™ืฉืึทื ืึทืœ ื“ืึทื˜ืึทื‘ื™ื™ืก, ื•ื•ืขื‘ ืกืขืจื•ื•ื™ืก ื•ื•ืึธืก ืกื™ืžืคึผืœืึทืคื™ื™ื– ืกืขื˜ืึทืคึผ, ืึธืคึผืขืจืึทืฆื™ืข ืื•ืŸ ืกืงื™ื™ืœื™ื ื’, ืื•ืŸ ืื™ื– ืึทื“ืžื™ื ืึทืกื˜ืขืจื“ ื’ืœื™ื™ึทืš ื“ื•ืจืš ืึทืžืึทื–ืึธืŸ.

ื™ื ืกื˜ืึทืœื™ืจืŸ ืื•ืŸ ืœื•ื™ืคืŸ ื“ื™ Kafka ืกืขืจื•ื•ืขืจ

ืื™ื™ื“ืขืจ ืื™ืจ ื ื•ืฆืŸ Kafka ื’ืœื™ื™ืš, ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ืžืึทื›ืŸ ื–ื™ื›ืขืจ ืึทื– ืื™ืจ ื”ืึธื‘ืŸ Java, ื•ื•ื™ื™ึทืœ ... JVM ืื™ื– ื’ืขื ื™ืฆื˜ ืคึฟืึทืจ ืึทืจื‘ืขื˜:

sudo apt-get update 
sudo apt-get install default-jre
java -version

ืœืึธืžื™ืจ ืฉืึทืคึฟืŸ ืึท ื ื™ื™ึทืข ื‘ืึทื ื™ืฆืขืจ ืฆื• ืึทืจื‘ืขื˜ืŸ ืžื™ื˜ Kafka:

sudo useradd kafka -m
sudo passwd kafka
sudo adduser kafka sudo

ื“ืขืจื ืึธืš ืืจืืคืงืืคื™ืข ื“ื™ ืคืึทืจืฉืคึผืจื™ื™ื˜ื•ื ื’ ืคึฟื•ืŸ ื“ืขืจ ื‘ืึทืึทืžื˜ืขืจ Apache Kafka ื•ื•ืขื‘ื–ื™ื™ื˜ืœ:

wget -P /YOUR_PATH "http://apache-mirror.rbc.ru/pub/apache/kafka/2.2.0/kafka_2.12-2.2.0.tgz"

ื•ืคืฉืœื™ืกืŸ ื“ื™ ื“ืึทื•ื ืœืึธื•ื“ื™ื“ ืึทืจืงื™ื™ื•ื•:

tar -xvzf /YOUR_PATH/kafka_2.12-2.2.0.tgz
ln -s /YOUR_PATH/kafka_2.12-2.2.0 kafka

ื“ืขืจ ื•ื•ื™ื™ึทื˜ืขืจ ืฉืจื™ื˜ ืื™ื– ืึทืคึผืฉืึทื ืึทืœ. ื“ืขืจ ืคืึทืงื˜ ืื™ื– ืึทื– ื“ื™ ืคืขืœื™ืงื™ื™ึทื˜ ืกืขื˜ื˜ื™ื ื’ืก ื˜ืึธืŸ ื ื™ื˜ ืœืึธื–ืŸ ืื™ืจ ืฆื• ื ื•ืฆืŸ ืึทืœืข ื“ื™ ืงื™ื™ืคึผืึทื‘ื™ืœืึทื˜ื™ื– ืคื•ืŸ Apache Kafka. ืคึฟืึทืจ ื‘ื™ื™ึทืฉืคึผื™ืœ, ื•ื™ืกืžืขืงืŸ ืึท ื˜ืขืžืข, ืงืึทื˜ืขื’ืึธืจื™ืข, ื’ืจื•ืคึผืข ืฆื• ื•ื•ืึธืก ืึทืจื˜ื™ืงืœืขืŸ ืงืขื ืขืŸ ื–ื™ื™ืŸ ืืจื•ื™ืก. ืฆื• ื˜ื•ื™ืฉืŸ ื“ืขื, ืœืึธื–ืŸ ืื•ื ื“ื– ืจืขื“ืึทื’ื™ืจืŸ ื“ื™ ืงืึทื ืคื™ื’ื™ืขืจื™ื™ืฉืึทืŸ ื˜ืขืงืข:

vim ~/kafka/config/server.properties

ืœื™ื™ื’ ื“ื™ ืคืืœื’ืขื ื“ืข ืฆื• ื“ื™ ืกื•ืฃ ืคื•ืŸ ื“ืขืจ ื˜ืขืงืข:

delete.topic.enable = true

ืื™ื™ื“ืขืจ ืื™ืจ ืึธื ื”ื™ื™ื‘ืŸ ื“ื™ Kafka ืกืขืจื•ื•ืขืจ, ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ืึธื ื”ื™ื™ื‘ืŸ ื“ืขื ZooKeeper ืกืขืจื•ื•ืขืจ; ืžื™ืจ ื•ื•ืขืœืŸ ื ื•ืฆืŸ ื“ื™ ืึทื’ื–ื™ืœื™ืขืจื™ ืฉืจื™ืคื˜ ื•ื•ืึธืก ืงื•ืžื˜ ืžื™ื˜ ื“ื™ Kafka ืคืึทืจืฉืคึผืจื™ื™ื˜ื•ื ื’:

Cd ~/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties

ื ืึธืš ื“ืขืจ ื”ืฆืœื—ื” ืคื•ืŸ ZooKeeper, ืงืึทื˜ืขืจ ื“ื™ Kafka ืกืขืจื•ื•ืขืจ ืื™ืŸ ืึท ื‘ืึทื–ื•ื ื“ืขืจ ื•ื•ืึธืงื–ืึทืœ:

bin/kafka-server-start.sh config/server.properties

ืœืึธืžื™ืจ ืฉืึทืคึฟืŸ ืึท ื ื™ื™ึทืข ื˜ืขืžืข ื’ืขืจื•ืคืŸ ื˜ืจืึทื ืกืึทืงื˜ื™ืึธืŸ:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic transaction

ืœืึธืžื™ืจ ืžืึทื›ืŸ ื–ื™ื›ืขืจ ืึทื– ืึท ื˜ืขืžืข ืžื™ื˜ ื“ื™ ืคืืจืœืื ื’ื˜ ื ื•ืžืขืจ ืคื•ืŸ ืคึผืึทืจื˜ื™ืฉืึทื ื– ืื•ืŸ ืจืขืคึผืœืึทืงื™ื™ืฉืึทืŸ ืื™ื– ื‘ืืฉืืคืŸ:

bin/kafka-topics.sh --describe --zookeeper localhost:2181

Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื–ืืœ ืก ืคืึทืจืคื™ืจืŸ ื“ื™ ืžืึธื•ืžืึทื ืฅ ืคื•ืŸ ื˜ืขืกื˜ื™ื ื’ ื“ื™ ืคึผืจืึธื“ื•ืฆื™ืจืขืจ ืื•ืŸ ืงืึทื ืกื•ืžืขืจ ืคึฟืึทืจ ื“ื™ ื ื™ื™ ื‘ืืฉืืคืŸ ื˜ืขืžืข. ืžืขืจ ื“ืขื˜ืึทื™ืœืก ื•ื•ืขื’ืŸ ื•ื•ื™ ืื™ืจ ืงืขื ืขืŸ ืคึผืจื•ื‘ื™ืจืŸ ืฉื™ืงื˜ ืื•ืŸ ืจื™ืกื™ื•ื•ื™ื ื’ ืึทืจื˜ื™ืงืœืขืŸ ื–ืขื ืขืŸ ื’ืขืฉืจื™ื‘ืŸ ืื™ืŸ ื“ืขืจ ื‘ืึทืึทืžื˜ืขืจ ื“ืึทืงื™ื•ืžืขื ื˜ื™ื™ืฉืึทืŸ - ืฉื™ืงืŸ ืขื˜ืœืขื›ืข ืึทืจื˜ื™ืงืœืขืŸ. ื ื•, ืžื™ืจ ืคืึธืจื–ืขืฆืŸ ืฆื• ืฉืจื™ื™ื‘ืŸ ืึท ืคึผืจืึธื“ื•ืฆื™ืจืขืจ ืื™ืŸ ืคึผื™ื˜ื”ืึธืŸ ื ื™ืฆืŸ ื“ื™ KafkaProducer API.

ืคึผืจืึธื“ื•ืฆื™ืจืขืจ ืฉืจื™ื™ื‘ืŸ

ื“ืขืจ ืคึผืจืึธื“ื•ืฆื™ืจืขืจ ื•ื•ืขื˜ ื“ื–ืฉืขื ืขืจื™ื™ื˜ ื˜ืจืึทืค - 100 ืึทืจื˜ื™ืงืœืขืŸ ื™ืขื“ืขืจ ืจื’ืข. ืžื™ื˜ ืจืึทื ื“ืึธื ื“ืึทื˜ืŸ ืžื™ืจ ืžื™ื™ื ืขืŸ ืึท ื•ื•ืขืจื˜ืขืจื‘ื•ืš ื•ื•ืึธืก ื‘ืืฉื˜ื™ื™ื˜ ืคื•ืŸ ื“ืจื™ื™ ืคืขืœื“ืขืจ:

  • ืฆื•ื•ื™ื™ึทื’ - ื ืึธืžืขืŸ ืคื•ืŸ ื“ื™ ืงืจืขื“ื™ื˜ ื™ื ืกื˜ื™ื˜ื•ืฉืึทืŸ ืก ืคื•ื ื˜ ืคื•ืŸ ืคืึทืจืงื•ื™ืฃ;
  • ืงืจืึทื ื˜ืงื™ื™ึทื˜ - ื˜ืจืึทื ืกืึทืงื˜ื™ืึธืŸ ืงืจืึทื ื˜ืงื™ื™ึทื˜;
  • ืกื•ืžืข - ื˜ืจืึทื ืกืึทืงื˜ื™ืึธืŸ ืกื•ืžืข. ื“ื™ ืกื•ืžืข ื•ื•ืขื˜ ื–ื™ื™ืŸ ืึท positive ื ื•ืžืขืจ ืื•ื™ื‘ ืขืก ืื™ื– ืึท ืงื•ื™ืคืŸ ืคื•ืŸ ืงืจืึทื ื˜ืงื™ื™ึทื˜ ื“ื•ืจืš ื“ื™ ื‘ืึทื ืง, ืื•ืŸ ืึท ื ืขื’ืึทื˜ื™ื•ื• ื ื•ืžืขืจ ืื•ื™ื‘ ืขืก ืื™ื– ืึท ืคืึทืจืงื•ื™ืฃ.

ื“ืขืจ ืงืึธื“ ืคึฟืึทืจ ื“ื™ ืคึผืจืึธื“ื•ืฆื™ืจืขืจ ืงื•ืงื˜ ื•ื•ื™ ื“ืึธืก:

from numpy.random import choice, randint

def get_random_value():
    new_dict = {}

    branch_list = ['Kazan', 'SPB', 'Novosibirsk', 'Surgut']
    currency_list = ['RUB', 'USD', 'EUR', 'GBP']

    new_dict['branch'] = choice(branch_list)
    new_dict['currency'] = choice(currency_list)
    new_dict['amount'] = randint(-100, 100)

    return new_dict

ื“ืขืจื ืึธืš, ืžื™ื˜ ื“ื™ ืฉื™ืงืŸ ืื•ืคึฟืŸ, ืžื™ืจ ืฉื™ืงืŸ ืึท ืึธื ื–ืึธื’ ืฆื• ื“ื™ ืกืขืจื•ื•ืขืจ, ืฆื• ื“ืขืจ ื˜ืขืžืข ื•ื•ืึธืก ืžื™ืจ ื“ืึทืจืคึฟืŸ, ืื™ืŸ JSON ืคึฟืึธืจืžืึทื˜:

from kafka import KafkaProducer    

producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                             value_serializer=lambda x:dumps(x).encode('utf-8'),
                             compression_type='gzip')
my_topic = 'transaction'
data = get_random_value()

try:
    future = producer.send(topic = my_topic, value = data)
    record_metadata = future.get(timeout=10)
    
    print('--> The message has been sent to a topic: 
            {}, partition: {}, offset: {}' 
            .format(record_metadata.topic,
                record_metadata.partition,
                record_metadata.offset ))   
                             
except Exception as e:
    print('--> It seems an Error occurred: {}'.format(e))

finally:
    producer.flush()

ื•ื•ืขืŸ ืคืœื™ืกื ื“ื™ืง ื“ืขื ืฉืจื™ืคื˜, ืžื™ืจ ื‘ืึทืงื•ืžืขืŸ ื“ื™ ืคืืœื’ืขื ื“ืข ืึทืจื˜ื™ืงืœืขืŸ ืื™ืŸ ื“ื™ ื•ื•ืึธืงื–ืึทืœ:

Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื“ืึธืก ืžื™ื™ื ื˜ ืึทื– ืึทืœืฅ ืึทืจื‘ืขื˜ ื•ื•ื™ ืžื™ืจ ื’ืขื•ื•ืืœื˜ - ื“ืขืจ ืคึผืจืึธื“ื•ืฆื™ืจืขืจ ื“ื–ืฉืขื ืขืจื™ื™ืฅ ืื•ืŸ ืกืขื ื“ื– ืึทืจื˜ื™ืงืœืขืŸ ืฆื• ื“ื™ ื˜ืขืžืข ื•ื•ืึธืก ืžื™ืจ ื“ืึทืจืคึฟืŸ.
ื“ืขืจ ื•ื•ื™ื™ึทื˜ืขืจ ืฉืจื™ื˜ ืื™ื– ืฆื• ื™ื ืกื˜ืึทืœื™ืจืŸ ืกืคึผืึทืจืง ืื•ืŸ ืคึผืจืึธืฆืขืก ื“ืขื ืึธื ื–ืึธื’ ื˜ื™ื™ึทืš.

ื™ื ืกื˜ืึทืœื™ืจืŸ Apache Spark

ืึทืคึผืึทื˜ืฉื™ ืกืคึผืึทืจืง ืื™ื– ืึท ื•ื ื™ื•ื•ืขืจืกืึทืœ ืื•ืŸ ื”ื•ื™ืš-ืคืึธืจืฉื˜ืขืœื•ื ื’ ืงื ื•ื™ืœ ืงืึทืžืคึผื™ื•ื˜ื™ื ื’ ืคึผืœืึทื˜ืคืึธืจืžืข.

ืกืคึผืึทืจืง ืคึผืขืจืคืึธืจืžื– ื‘ืขืกืขืจ ื•ื•ื™ ืคืึธืœืงืก ื™ืžืคึผืœืึทืžืึทื ืฅ ืคื•ืŸ ื“ื™ MapReduce ืžืึธื“ืขืœ, ื‘ืฉืขืช ืขืก ืฉื˜ื™ืฆื˜ ืึท ื‘ืจื™ื™ื˜ ืงื™ื™ื˜ ืคื•ืŸ ืงืึทืžืคึผื™ื•ื˜ื™ื ื’ ื˜ื™ื™ืคึผืก, ืึทืจื™ื™ึทื ื’ืขืจืขื›ื ื˜ ื™ื ื˜ืขืจืึทืงื˜ื™ื•ื• ืคึฟืจืื’ืŸ ืื•ืŸ ืกื˜ืจื™ื ืคึผืจืึทืกืขืกื™ื ื’. ืกืคึผื™ื“ ืคื™ืขืกืขืก ืึท ื•ื•ื™ื›ื˜ื™ืง ืจืึธืœืข ื•ื•ืขืŸ ืคึผืจืึทืกืขืกื™ื ื’ ื’ืจื•ื™ืก ืึทืžืึทื•ื ืฅ ืคื•ืŸ ื“ืึทื˜ืŸ, ื•ื•ื™ื™ึทืœ ื“ืึธืก ืื™ื– ื“ื™ ื’ื™ื›ืงื™ื™ึทื˜ ื•ื•ืึธืก ืึทืœืึทื•ื– ืื™ืจ ืฆื• ืึทืจื‘ืขื˜ืŸ ื™ื ื˜ืขืจืึทืงื˜ื™ื•ื•ืœื™ ืึธืŸ ืกืคึผืขื ื“ื™ื ื’ ืžื™ื ื•ื˜ ืึธื“ืขืจ ืฉืขื” ื•ื•ืืจื˜ืŸ. ืื™ื™ื ืขืจ ืคื•ืŸ ืกืคึผืึทืจืง ืก ื‘ื™ื’ืึทืกื˜ ืกื˜ืจืขื ื’ืงื˜ืก ื•ื•ืึธืก ืžืื›ื˜ ืขืก ืึทื–ื•ื™ ืฉื ืขืœ ืื™ื– ื“ื™ ืคื™ื™ื™ืงื™ื™ื˜ ืฆื• ื“ื•ืจื›ืคื™ืจืŸ ื–ื™ืงืึธืจืŸ ื—ืฉื‘ื•ื ื•ืช.

ื“ืขืจ ืคืจื™ื™ืžื•ื•ืขืจืง ืื™ื– ื’ืขืฉืจื™ื‘ืŸ ืื™ืŸ ืกืงืึทืœืึท, ืึทื–ื•ื™ ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ื™ื ืกื˜ืึทืœื™ืจืŸ ืขืก ืขืจืฉื˜ืขืจ:

sudo apt-get install scala

ืืจืืคืงืืคื™ืข ื“ื™ Spark ืคืึทืจืฉืคึผืจื™ื™ื˜ื•ื ื’ ืคื•ืŸ ื“ืขืจ ื‘ืึทืึทืžื˜ืขืจ ื•ื•ืขื‘ื–ื™ื™ื˜ืœ:

wget "http://mirror.linux-ia64.org/apache/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz"

ืึธืคึผืœื™ื™ื’ืŸ ื“ืขื ืึทืจืงื™ื™ื•ื•:

sudo tar xvf spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz -C /usr/local/spark

ืœื™ื™ื’ ื“ืขื ื“ืจืš ืฆื• Spark ืฆื• ื“ื™ ื‘ืึทืฉ ื˜ืขืงืข:

vim ~/.bashrc

ืœื™ื™ื’ ื“ื™ ืคืืœื’ืขื ื“ืข ืฉื•ืจื•ืช ื“ื•ืจืš ื“ืขื ืจืขื“ืึทืงื˜ืึธืจ:

SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$PATH

ืœื•ื™ืคืŸ ื“ื™ ื‘ืึทืคึฟืขืœ ืื•ื ื˜ืŸ ื ืึธืš ืžืื›ืŸ ืขื ื“ืขืจื•ื ื’ืขืŸ ืฆื• bashrc:

source ~/.bashrc

ื“ื™ืคึผืœื™ื™ื™ื ื’ AWS PostgreSQL

ืึทืœืข ื•ื•ืึธืก ื‘ืœื™ื™ื‘ื˜ ืื™ื– ืฆื• ืฆืขื•ื•ื™ืงืœืขืŸ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก ืื™ืŸ ื•ื•ืึธืก ืžื™ืจ ื•ื•ืขืœืŸ ืฆื•ืคึฟืขืœื™ืงืขืจ ื“ื™ ืคึผืจืึทืกืขืกื˜ ืื™ื ืคึฟืึธืจืžืึทืฆื™ืข ืคึฟื•ืŸ ื“ื™ ืกื˜ืจื™ืžื–. ืคึฟืึทืจ ื“ืขื ืžื™ืจ ื•ื•ืขืœืŸ ื ื•ืฆืŸ ื“ื™ AWS RDS ื“ื™ื ืกื˜.

ื’ื™ื™ืŸ ืฆื• ื“ื™ AWS ืงืึทื ืกืึธื•ืœ -> AWS RDS -> ื“ืึทื˜ืึทื‘ื™ื™ืกื™ื– -> ืฉืึทืคึฟืŸ ื“ืึทื˜ืึทื‘ื™ื™ืก:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืื•ื™ืกืงืœื™ื™ึทื‘ืŸ PostgreSQL ืื•ืŸ ื’ื™ื˜ ื•ื•ื™ื™ึทื˜ืขืจ:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื•ื•ื™ื™ึทืœ ื“ืขืจ ื‘ื™ื™ืฉืคึผื™ืœ ืื™ื– ื‘ืœื•ื™ื– ืคึฟืึทืจ ื‘ื™ืœื“ื•ื ื’ืงืจื™ื™ื– ืฆื•ื•ืขืงืŸ; ืžื™ืจ ื•ื•ืขืœืŸ ื ื•ืฆืŸ ืึท ืคืจื™ื™ ืกืขืจื•ื•ืขืจ "ืื™ืŸ ืžื™ื ื™ืžื•ื" (ืคืจื™ื™ ืจื™ื™):
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื“ืขืจื ืึธืš, ืžื™ืจ ืฉื˜ืขืœืŸ ืึท ื˜ื™ืงืขืŸ ืื™ืŸ ื“ื™ Free Tier ื‘ืœืึธืง, ืื•ืŸ ื“ืขืจื ืึธืš ืžื™ืจ ื•ื•ืขื˜ ื–ื™ื™ืŸ ืื•ื™ื˜ืึธืžืึทื˜ื™ืฉ ื’ืขืคึฟื™ื ื˜ ืึท ื‘ื™ื™ึทืฉืคึผื™ืœ ืคื•ืŸ ื“ื™ t2.micro ืงืœืึทืก - ื›ืึธื˜ืฉ ืฉื•ื•ืึทืš, ืขืก ืื™ื– ืคืจื™ื™ ืื•ืŸ ื’ืึทื ืฅ ืคึผืึทืกื™ืง ืคึฟืึทืจ ืื•ื ื“ื–ืขืจ ืึทืจื‘ืขื˜:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื•ื•ื™ื™ึทื˜ืขืจ ืงื•ืžืขืŸ ื–ื™ื™ืขืจ ื•ื•ื™ื›ื˜ื™ืง ื–ืื›ืŸ: ื“ื™ ื ืึธืžืขืŸ ืคื•ืŸ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก ื‘ื™ื™ึทืฉืคึผื™ืœ, ื“ื™ ื ืึธืžืขืŸ ืคื•ืŸ ื“ื™ ื‘ืขืœ ื‘ืึทื ื™ืฆืขืจ ืื•ืŸ ื–ื™ื™ืŸ ืคึผืึทืจืึธืœ. ื–ืืœ ืก ื ืึธืžืขืŸ ื“ืขื ื‘ื™ื™ึทืฉืคึผื™ืœ: myHabrTest, ื‘ืขืœ ื‘ืึทื ื™ืฆืขืจ: habr, ืคึผืึทืจืึธืœ: habr12345 ืื•ืŸ ื’ื™ื˜ ืื•ื™ืฃ ื“ื™ ื•ื•ื™ื™ึทื˜ืขืจ ืงื ืขืคึผืœ:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืื•ื™ืฃ ื“ืขืจ ื•ื•ื™ื™ึทื˜ืขืจ ื‘ืœืึทื˜ ืขืก ื–ืขื ืขืŸ ืคึผืึทืจืึทืžืขื˜ืขืจืก ืคืึทืจืึทื ื˜ื•ื•ืึธืจื˜ืœืขืš ืคึฟืึทืจ ื“ื™ ืึทืงืกืขืกืึทื‘ื™ืœื™ื˜ื™ ืคื•ืŸ ืื•ื ื“ื–ืขืจ ื“ืึทื˜ืึทื‘ื™ื™ืก ืกืขืจื•ื•ืขืจ ืคื•ืŸ ื“ื™ ืึทืจื•ื™ืก (ืฆื™ื‘ื•ืจ ืึทืงืกืขืกืึทื‘ื™ืœื™ื˜ื™) ืื•ืŸ ืคึผืึธืจื˜ ืึทื•ื•ื™ื™ืœืึทื‘ื™ืœืึทื˜ื™:

Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืœืึธืžื™ืจ ืžืึทื›ืŸ ืึท ื ื™ื™ึทืข ื‘ืึทืฉื˜ืขื˜ื™ืงืŸ ืคึฟืึทืจ ื“ื™ VPC ื–ื™ื›ืขืจื”ื™ื™ื˜ ื’ืจื•ืคึผืข, ื•ื•ืึธืก ื•ื•ืขื˜ ืœืึธื–ืŸ ืคื•ื ื“ืจื•ื™ืกื ื“ื™ืง ืึทืงืกืขืก ืฆื• ืื•ื ื“ื–ืขืจ ื“ืึทื˜ืึทื‘ื™ื™ืก ืกืขืจื•ื•ืขืจ ื“ื•ืจืš ืคึผืึธืจื˜ 5432 (PostgreSQL).
ืœืึธืžื™ืจ ื’ื™ื™ืŸ ืฆื• ื“ื™ AWS ืงืึทื ืกืึธื•ืœ ืื™ืŸ ืึท ื‘ืึทื–ื•ื ื“ืขืจ ื‘ืœืขื˜ืขืจืขืจ ืคึฟืขื ืฆื˜ืขืจ ืฆื• ื“ื™ VPC ื“ืึทืฉื‘ืึธืจื“ -> ื–ื™ื›ืขืจื”ื™ื™ื˜ ื’ืจื•ืคึผืขืก -> ืฉืึทืคึฟืŸ ื–ื™ื›ืขืจื”ื™ื™ื˜ ื’ืจื•ืคึผืข ืึธืคึผื˜ื™ื™ืœื•ื ื’:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืžื™ืจ ืฉื˜ืขืœืŸ ื“ืขื ื ืึธืžืขืŸ ืคึฟืึทืจ ื“ื™ ื–ื™ื›ืขืจื”ื™ื™ื˜ ื’ืจื•ืคึผืข - PostgreSQL, ืึท ื‘ืึทืฉืจื™ื™ึทื‘ื•ื ื’, ืึธื ื•ื•ื™ื™ึทื–ืŸ ืžื™ื˜ ื•ื•ืึธืก ื•ื•ืคึผืง ื“ื™ ื’ืจื•ืคึผืข ื–ืึธืœ ื–ื™ื™ืŸ ืคืืจื‘ื•ื ื“ืŸ ืžื™ื˜ ืื•ืŸ ื’ื™ื˜ ื“ื™ ืฉืึทืคึฟืŸ ืงื ืขืคึผืœ:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืคึผืœืึธืžื‘ื™ืจืŸ ื“ื™ ื™ื ื‘ืึทื•ื ื“ ื›ึผืœืœื™ื ืคึฟืึทืจ ืคึผืึธืจื˜ 5432 ืคึฟืึทืจ ื“ื™ ื ื™ื™ ื‘ืืฉืืคืŸ ื’ืจื•ืคึผืข, ื•ื•ื™ ื’ืขื•ื•ื™ื–ืŸ ืื™ืŸ ื“ื™ ื‘ื™ืœื“ ืื•ื ื˜ืŸ. ืื™ืจ ืงืขื ื˜ ื ื™ืฉื˜ ืกืคึผืขืฆื™ืคื™ืฆื™ืจืŸ ื“ื™ ืคึผืึธืจื˜ ืžืึทื ื™ื•ืึทืœื™, ืึธื‘ืขืจ ืกืขืœืขืงื˜ื™ืจืŸ PostgreSQL ืคื•ืŸ ื“ื™ ื˜ื™ืคึผ ื“ืจืึธืคึผ-ืึทืจืึธืคึผ ืจืฉื™ืžื”.

ืฉื˜ืจืขื ื’ ื’ืขืจืขื“ื˜, ื“ื™ ื•ื•ืขืจื˜ :: / 0 ืžื™ื˜ืœ ื“ื™ ืึทื•ื•ื™ื™ืœืึทื‘ื™ืœืึทื˜ื™ ืคื•ืŸ ื™ื ืงืึทืžื™ื ื’ ืคืึทืจืงืขืจ ืฆื• ื“ื™ ืกืขืจื•ื•ืขืจ ืคื•ืŸ ืึทืœืข ืื™ื‘ืขืจ ื“ื™ ื•ื•ืขืœื˜, ื•ื•ืึธืก ืื™ื– ืงืึทื ืึทื ืึทืงืœื™ ื ื™ืฉื˜ ืœืขื’ืึทืžืจืข ืืžืช, ืึธื‘ืขืจ ืฆื• ืึทื ืึทืœื™ื™ื– ื“ื™ ื‘ื™ื™ืฉืคึผื™ืœ, ืœืึธื–ืŸ ืื•ื ื“ื– ืœืึธื–ืŸ ื–ื™ืš ืฆื• ื ื•ืฆืŸ ื“ืขื ืฆื•ื’ืึทื ื’:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืžื™ืจ ืฆื•ืจื™ืงืงื•ืžืขืŸ ืฆื• ื“ื™ ื‘ืœืขื˜ืขืจืขืจ ื‘ืœืึทื˜, ื•ื•ื• ืžื™ืจ ื”ืึธื‘ืŸ "ืงืึธื ืคื™ื’ื•ืจืข ืึทื•ื•ืึทื ืกื™ืจื˜ืข ืกืขื˜ื˜ื™ื ื’ืก" ืขืคืขื ืขืŸ ืื•ืŸ ืกืขืœืขืงื˜ื™ืจืŸ ืื™ืŸ ื“ื™ VPC ื–ื™ื›ืขืจื”ื™ื™ื˜ ื’ืจื•ืคึผืขืก ืึธืคึผื˜ื™ื™ืœื•ื ื’ -> ืงืœื™ื™ึทื‘ืŸ ื™ื’ื–ื™ืกื˜ื™ื ื’ VPC ื–ื™ื›ืขืจื”ื™ื™ื˜ ื’ืจื•ืคึผืขืก -> PostgreSQL:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื•ื•ื™ื™ึทื˜ืขืจ, ืื™ืŸ ื“ื™ ื“ืึทื˜ืึทื‘ืึทืกืข ืึธืคึผืฆื™ืขืก -> ื“ืึทื˜ืึทื‘ืึทืกืข ื ืึธืžืขืŸ -> ืฉื˜ืขืœืŸ ื“ืขื ื ืึธืžืขืŸ - habrDB.

ืžื™ืจ ืงืขื ืขืŸ ืœืึธื–ืŸ ื“ื™ ืจื•ืขืŸ ืคึผืึทืจืึทืžืขื˜ืขืจืก, ืžื™ื˜ ื“ื™ ื•ื™ืกื ืขื ืคื•ืŸ ื“ื™ืกื™ื™ื‘ืึทืœื™ื ื’ ื‘ืึทืงืึทืคึผ (ื‘ืึทื”ืึทืœื˜ืŸ ืจื™ื˜ืขื ืฉืึทืŸ ืฆื™ื™ึทื˜ - 0 ื˜ืขื’), ืžืึธื ื™ื˜ืึธืจื™ื ื’ ืื•ืŸ ืคืึธืจืฉื˜ืขืœื•ื ื’ ื™ื ืกื™ื™ืฅ, ื“ื•ืจืš ืคืขืœื™ืงื™ื™ึทื˜. ื“ืจื™ืงื˜ ื“ืขื ืงื ืขืคึผืœ ืฉืึทืคึฟืŸ ื“ื™ื™ื˜ืึทื‘ื™ื™ืก:
Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืคึฟืึธื“ืขื ื”ืึทื ื“ืœืขืจ

ื“ื™ ืœืขืฆื˜ืข ื‘ื™ื ืข ื•ื•ืขื˜ ื–ื™ื™ืŸ ื“ื™ ืึทื ื˜ื•ื•ื™ืงืœื•ื ื’ ืคื•ืŸ ืึท ืกืคึผืึทืจืง ืึทืจื‘ืขื˜, ื•ื•ืึธืก ื•ื•ืขื˜ ืคึผืจืึธืฆืขืก ื ื™ื™ึทืข ื“ืึทื˜ืŸ ื•ื•ืึธืก ืงื•ืžืขืŸ ืคึฟื•ืŸ Kafka ื™ืขื“ืขืจ ืฆื•ื•ื™ื™ ืกืขืงื•ื ื“ืขืก ืื•ืŸ ืึทืจื™ื™ึทืŸ ื“ื™ ืจืขื–ื•ืœื˜ืึทื˜ ืื™ืŸ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก.

ื•ื•ื™ ื“ืขืจืžืื ื˜ ืื•ื™ื‘ืŸ, ื˜ืฉืขืงืคึผื•ื™ื ืฅ ื–ืขื ืขืŸ ืึท ื”ืึทืจืฅ ืžืขืงืึทื ื™ื–ืึทื ืื™ืŸ SparkStreaming ื•ื•ืึธืก ืžื•ื–ืŸ ื–ื™ื™ืŸ ืงืึทื ืคื™ื’ื™ืขืจื“ ืฆื• ืขื ืฉื•ืจ ืฉื•ืœื“ ื˜ืึธืœืขืจืึทื ืฅ. ืžื™ืจ ื•ื•ืขืœืŸ ื ื•ืฆืŸ ื˜ืฉืขืงืคึผื•ื™ื ืฅ ืื•ืŸ, ืื•ื™ื‘ ื“ื™ ืคึผืจืึธืฆืขื“ื•ืจ ืคื™ื™ืœื–, ื“ื™ Spark Streaming ืžืึธื“ื•ืœืข ื•ื•ืขื˜ ื ืึธืจ ื“ืึทืจืคึฟืŸ ืฆื• ืฆื•ืจื™ืงืงื•ืžืขืŸ ืฆื• ื“ื™ ืœืขืฆื˜ืข ื˜ืฉืขืงืคึผื•ื™ื ื˜ ืื•ืŸ ื ืขืžืขื  ื–ื™ื› ื•ื•ื™ื“ืขืจ ื—ืฉื‘ื•ื ื•ืช ื“ืขืจืคื•ืŸ ืฆื• ืฆื•ืจื™ืงืงืจื™ื’ืŸ ื“ื™ ืคืึทืจืคืึทืœืŸ ื“ืึทื˜ืŸ.

ื˜ืฉืขืงืคึผื•ื™ื ื˜ ืงืขื ืขืŸ ื–ื™ื™ืŸ ืขื ื™ื™ื‘ืึทืœื“ ื“ื•ืจืš ื‘ืึทืฉื˜ืขื˜ื™ืงืŸ ืึท ื•ื•ืขื’ื•ื•ื™ื™ึทื–ืขืจ ืื•ื™ืฃ ืึท ืฉื•ืœื“-ื˜ืึธืœืขืจืึทื ื˜, ืคืึทืจืœืึธื–ืœืขืš ื˜ืขืงืข ืกื™ืกื˜ืขื (ืึทื–ืึท ื•ื•ื™ HDFS, S3, ืืื–"ื• ื•) ืื™ืŸ ื•ื•ืึธืก ื“ื™ ื˜ืฉืขืงืคึผื•ื™ื ื˜ ืื™ื ืคึฟืึธืจืžืึทืฆื™ืข ื•ื•ืขื˜ ื–ื™ื™ืŸ ืกื˜ืึธืจื“. ื“ืึธืก ืื™ื– ื’ืขื˜ืืŸ ืžื™ื˜, ืœืžืฉืœ:

streamingContext.checkpoint(checkpointDirectory)

ืื™ืŸ ืื•ื ื“ื–ืขืจ ื‘ื™ื™ึทืฉืคึผื™ืœ, ืžื™ืจ ื•ื•ืขืœืŸ ื ื•ืฆืŸ ื“ื™ ืคืืœื’ืขื ื“ืข ืฆื•ื’ืึทื ื’, ื ื™ื™ืžืœื™ ืื•ื™ื‘ ื˜ืฉืขืงืคึผื•ื™ื ื˜ื“ื™ืจืขืงื˜ืึธืจื™ ื™ื’ื–ื™ืกืฅ, ื“ืขืจ ืงืึธื ื˜ืขืงืกื˜ ื•ื•ืขื˜ ื–ื™ื™ืŸ ืจื™ืงืจื™ื™ื™ื˜ื™ื“ ืคึฟื•ืŸ ื“ื™ ื˜ืฉืขืงืคึผื•ื™ื ื˜ ื“ืึทื˜ืŸ. ืื•ื™ื‘ ื“ืขืจ ื•ื•ืขื’ื•ื•ื™ื™ึทื–ืขืจ ื˜ื•ื˜ ื ื™ืฉื˜ ืขืงืกื™ืกื˜ื™ืจืŸ (ื“"ื” ืขืงืกืึทืงื™ื•ื˜ืึทื“ ืคึฟืึทืจ ื“ื™ ืขืจืฉื˜ืขืจ ืžืึธืœ), ื“ืขืžืึธืœื˜ functionToCreateContext ืื™ื– ื’ืขืจื•ืคืŸ ืฆื• ืฉืึทืคึฟืŸ ืึท ื ื™ื™ึทืข ืงืึธื ื˜ืขืงืกื˜ ืื•ืŸ ืงืึทื ืคื™ื’ื™ืขืจ DStreams:

from pyspark.streaming import StreamingContext

context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)

ืžื™ืจ ืžืึทื›ืŸ ืึท DirectStream ื›ื™ื™ืคืขืฅ ืฆื• ืคืึทืจื‘ื™ื ื“ืŸ ืฆื• ื“ื™ "ื˜ืจืึทื ื–ืึทืงืฉืึทืŸ" ื˜ืขืžืข ื ื™ืฆืŸ ื“ื™ createDirectStream ืื•ืคึฟืŸ ืคื•ืŸ ื“ื™ KafkaUtils ื‘ื™ื‘ืœื™ืึธื˜ืขืง:

from pyspark.streaming.kafka import KafkaUtils
    
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 2)

broker_list = 'localhost:9092'
topic = 'transaction'

directKafkaStream = KafkaUtils.createDirectStream(ssc,
                                [topic],
                                {"metadata.broker.list": broker_list})

ืคึผืึทืจืกื™ื ื’ ื™ื ืงืึทืžื™ื ื’ ื“ืึทื˜ืŸ ืื™ืŸ JSON ืคึฟืึธืจืžืึทื˜:

rowRdd = rdd.map(lambda w: Row(branch=w['branch'],
                                       currency=w['currency'],
                                       amount=w['amount']))
                                       
testDataFrame = spark.createDataFrame(rowRdd)
testDataFrame.createOrReplaceTempView("treasury_stream")

ื ื™ืฆืŸ Spark SQL, ืžื™ืจ ืžืึทื›ืŸ ืึท ืคึผืฉื•ื˜ ื’ืจื•ืคึผื™ื ื’ ืื•ืŸ ื•ื•ื™ื™ึทื–ืŸ ื“ื™ ืจืขื–ื•ืœื˜ืึทื˜ ืื™ืŸ ื“ื™ ืงืึทื ืกืึธื•ืœ:

select 
    from_unixtime(unix_timestamp()) as curr_time,
    t.branch                        as branch_name,
    t.currency                      as currency_code,
    sum(amount)                     as batch_value
from treasury_stream t
group by
    t.branch,
    t.currency

ื‘ืึทืงื•ืžืขืŸ ื“ื™ ืึธื ืคึฟืจืขื’ ื˜ืขืงืกื˜ ืื•ืŸ ืœื•ื™ืคืŸ ืขืก ื“ื•ืจืš Spark SQL:

sql_query = get_sql_query()
testResultDataFrame = spark.sql(sql_query)
testResultDataFrame.show(n=5)

ืื•ืŸ ื“ืขืžืึธืœื˜ ืžื™ืจ ืจืึทื˜ืขื•ื•ืขืŸ ื“ื™ ืจื™ื–ืึทืœื˜ื™ื ื’ ืึทื’ื’ืจืขื’ืึทื˜ืขื“ ื“ืึทื˜ืŸ ืื™ืŸ ืึท ื˜ื™ืฉ ืื™ืŸ AWS RDS. ืฆื• ืจืึทื˜ืขื•ื•ืขืŸ ื“ื™ ืึทื’ื’ืจืขื’ืึทื˜ื™ืึธืŸ ืจืขื–ื•ืœื˜ืึทื˜ืŸ ืฆื• ืึท ื“ืึทื˜ืึทื‘ืึทืกืข ื˜ื™ืฉ, ืžื™ืจ ื•ื•ืขืœืŸ ื ื•ืฆืŸ ื“ื™ ืฉืจื™ื™ึทื‘ืŸ ืื•ืคึฟืŸ ืคื•ืŸ ื“ื™ DataFrame ื›ื™ื™ืคืขืฅ:

testResultDataFrame.write 
    .format("jdbc") 
    .mode("append") 
    .option("driver", 'org.postgresql.Driver') 
    .option("url","jdbc:postgresql://myhabrtest.ciny8bykwxeg.us-east-1.rds.amazonaws.com:5432/habrDB") 
    .option("dbtable", "transaction_flow") 
    .option("user", "habr") 
    .option("password", "habr12345") 
    .save()

ืขื˜ืœืขื›ืข ื•ื•ืขืจื˜ืขืจ ื•ื•ืขื’ืŸ ื‘ืึทืฉื˜ืขื˜ื™ืงืŸ ืึท ืงืฉืจ ืฆื• AWS RDS. ืžื™ืจ ื‘ืืฉืืคืŸ ื“ืขื ื‘ืึทื ื™ืฆืขืจ ืื•ืŸ ืคึผืึทืจืึธืœ ืคึฟืึทืจ ืขืก ืื™ืŸ ื“ื™ "ื“ื™ืคึผืœื•ื™ื™ื ื’ AWS PostgreSQL" ืฉืจื™ื˜. ืื™ืจ ื–ืึธืœ ื ื•ืฆืŸ ืขื ื“ืคึผื•ื™ื ื˜ ื•ื•ื™ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก ืกืขืจื•ื•ืขืจ URL, ื•ื•ืึธืก ืื™ื– ื’ืขื•ื•ื™ื–ืŸ ืื™ืŸ ื“ื™ ืงืึทื ืขืงื˜ื™ื•ื•ื™ื˜ื™ ืื•ืŸ ื–ื™ื›ืขืจื”ื™ื™ื˜ ืึธืคึผื˜ื™ื™ืœื•ื ื’:

Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืื™ืŸ ืกื“ืจ ืฆื• ืจื™ื›ื˜ื™ืง ืคืึทืจื‘ื™ื ื“ืŸ ืกืคึผืึทืจืง ืื•ืŸ ืงืึทืคืงืึท, ืื™ืจ ื–ืึธืœ ืœื•ื™ืคืŸ ื“ื™ ืึทืจื‘ืขื˜ ื“ื•ืจืš smark-submit ื ื™ืฆืŸ ื“ื™ ืึทืจื˜ืึทืคืึทืงื˜ spark-streaming-kafka-0-8_2.11. ืึทื“ื“ื™ื˜ื™ืึธื ืึทืœืœื™, ืžื™ืจ ื•ื•ืขืœืŸ ืื•ื™ืš ื ื•ืฆืŸ ืึท ืึทืจื˜ืึทืคืึทืงื˜ ืคึฟืึทืจ ื™ื ื˜ืขืจืึทืงื˜ื™ื ื’ ืžื™ื˜ ื“ื™ PostgreSQL ื“ืึทื˜ืึทื‘ื™ื™ืก; ืžื™ืจ ื•ื•ืขืœืŸ ืึทืจื™ื‘ืขืจืคื™ืจืŸ ื–ื™ื™ ื“ื•ืจืš -- ืคึผืึทืงืึทื“ื–ืฉืึทื–.

ืคึฟืึทืจ ื“ื™ ื‘ื™ื™ื’ื™ืงื™ื™ื˜ ืคื•ืŸ ื“ื™ ืฉืจื™ืคื˜, ืžื™ืจ ื•ื•ืขืœืŸ ืื•ื™ืš ืึทืจื™ื™ึทื ื ืขืžืขืŸ ื•ื•ื™ ืึทืจื™ื™ึทื ืฉืจื™ื™ึทื‘ ืคึผืึทืจืึทืžืขื˜ืขืจืก ื“ื™ ื ืึธืžืขืŸ ืคื•ืŸ ื“ื™ ืึธื ื–ืึธื’ ืกืขืจื•ื•ืขืจ ืื•ืŸ ื“ื™ ื˜ืขืžืข ืคื•ืŸ โ€‹โ€‹ื•ื•ืึธืก ืžื™ืจ ื•ื•ื™ืœืŸ ืฆื• ื‘ืึทืงื•ืžืขืŸ ื“ืึทื˜ืŸ.

ืึทื–ื•ื™, ืขืก ืื™ื– ืฆื™ื™ื˜ ืฆื• ืงืึทื˜ืขืจ ืื•ืŸ ืงืึธื ื˜ืจืึธืœื™ืจืŸ ื“ื™ ืคืึทื ื’ืงืฉืึทื ืึทืœื™ื˜ื™ ืคื•ืŸ ื“ื™ ืกื™ืกื˜ืขื:

spark-submit 
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2,
org.postgresql:postgresql:9.4.1207 
spark_job.py localhost:9092 transaction

ืึทืœืฅ ื”ืึธื˜ ืื•ื™ืกื’ืขืึทืจื‘ืขื˜! ื•ื•ื™ ืื™ืจ ืงืขื ืขืŸ ื–ืขืŸ ืื™ืŸ ื“ื™ ื‘ื™ืœื“ ืื•ื ื˜ืŸ, ื•ื•ืขืŸ ื“ื™ ืึทืคึผืœืึทืงื™ื™ืฉืึทืŸ ืื™ื– ืคืœื™ืกื ื“ื™ืง, ื ื™ื™ึทืข ืึทื’ื’ืจืขื’ืึทื˜ื™ืึธืŸ ืจืขื–ื•ืœื˜ืึทื˜ืŸ ื–ืขื ืขืŸ ืคึผืจืึธื“ื•ืงืฆื™ืข ื™ืขื“ืขืจ 2 ืกืขืงื•ื ื“ืขืก, ื•ื•ื™ื™ึทืœ ืžื™ืจ ืฉื˜ืขืœืŸ ื“ื™ ื‘ืึทื˜ืฉื™ื ื’ ืžืขื”ืึทืœืขืš ืฆื• 2 ืกืขืงื•ื ื“ืขืก ื•ื•ืขืŸ ืžื™ืจ ื‘ืืฉืืคืŸ ื“ืขื StreamingContext ื›ื™ื™ืคืขืฅ:

Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ื•ื•ื™ื™ึทื˜ืขืจ, ืžื™ืจ ืžืึทื›ืŸ ืึท ืคึผืฉื•ื˜ ืึธื ืคึฟืจืขื’ ืฆื• ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก ืฆื• ืงืึธื ื˜ืจืึธืœื™ืจืŸ ื“ื™ ื‘ื™ื™ึทื–ื™ื™ึทืŸ ืคื•ืŸ ืจืขืงืึธืจื“ืก ืื™ืŸ ื“ื™ ื˜ื™ืฉ ื˜ืจืึทื ืกืึทืงื˜ื™ืึธืŸ_ืคืœืึธื•ื•:

Apache Kafka ืื•ืŸ ืกื˜ืจื™ืžื™ื ื’ ื“ืึทื˜ืŸ ืคึผืจืึทืกืขืกื™ื ื’ ืžื™ื˜ Spark Streaming

ืกืึธืฃ

ื“ืขืจ ืึทืจื˜ื™ืงืœ ื”ืึธื˜ ื’ืขืงื•ืงื˜ ืื•ื™ืฃ ืึท ื‘ื™ื™ืฉืคึผื™ืœ ืคื•ืŸ ืกื˜ืจื™ื ืคึผืจืึทืกืขืกื™ื ื’ ืคื•ืŸ ืื™ื ืคึฟืึธืจืžืึทืฆื™ืข ื ื™ืฆืŸ Spark Streaming ืื™ืŸ ืงืึทื ื“ื–ืฉืึทื ื’ืงืฉืึทืŸ ืžื™ื˜ Apache Kafka ืื•ืŸ PostgreSQL. ืžื™ื˜ ื“ืขื ื•ื•ื•ึผืงืก ืคื•ืŸ ื“ืึทื˜ืŸ ืคึฟื•ืŸ ืคืึทืจืฉื™ื“ืŸ ืงื•ื•ืืœืŸ, ืขืก ืื™ื– ืฉื•ื•ืขืจ ืฆื• ืึธื•ื•ื•ืขืจืขืกื˜ืึทืžื™ื™ื˜ ื“ื™ ืคึผืจืึทืงื˜ื™ืฉ ื•ื•ืขืจื˜ ืคื•ืŸ Spark Streaming ืคึฟืึทืจ ืงืจื™ื™ื™ื˜ื™ื ื’ ืกื˜ืจื™ืžื™ื ื’ ืื•ืŸ ืคืึทืงื˜ื™ืฉ-ืฆื™ื™ื˜ ืึทืคึผืœืึทืงื™ื™ืฉืึทื ื–.

ืื™ืจ ืงืขื ืขืŸ ื’ืขืคึฟื™ื ืขืŸ ื“ื™ ืคื•ืœ ืžืงื•ืจ ืงืึธื“ ืื™ืŸ ืžื™ื™ืŸ ืจื™ืคึผืึทื–ืึทื˜ืึธืจื™ ื‘ื™ื™ึท ื’ื™ื˜ื”ื•ื‘.

ืื™ืš ื‘ื™ืŸ ืฆื•ืคืจื™ื“ืŸ ืฆื• ื“ื™ืกืงื•ื˜ื™ืจืŸ ื“ืขื ืึทืจื˜ื™ืงืœ, ืื™ืš ืงื•ืง ืคืึธืจื•ื™ืก ืฆื• ื“ื™ื™ืŸ ื‘ืึทืžืขืจืงื•ื ื’ืขืŸ, ืื•ืŸ ืื™ืš ืื•ื™ืš ื”ืึธืคึฟืŸ ืคึฟืึทืจ ืงืึทื ืกื˜ืจืึทืงื˜ื™ื•ื• ืงืจื™ื˜ื™ืง ืคื•ืŸ ืึทืœืข ืงืึทืจื™ื ื’ ืœื™ื™ืขื ืขืจ.

ืื™ืš ื•ื•ื™ื ื˜ืฉืŸ ืื™ืจ ื”ืฆืœื—ื”!

ืคึผืก. ื˜ื›ื™ืœืขืก, ืขืก ืื™ื– ื’ืขื•ื•ืขืŸ ืคึผืœืึทื ื ืขื“ ืฆื• ื ื•ืฆืŸ ืึท ื”ื™ื’ืข PostgreSQL ื“ืึทื˜ืึทื‘ื™ื™ืก, ืึธื‘ืขืจ ื•ื•ื™ื™ึทืœ ืคื•ืŸ ืžื™ื™ืŸ ืœื™ื‘ืข ืคึฟืึทืจ AWS, ืื™ืš ื‘ืึทืฉืœืึธืกืŸ ืฆื• ืึทืจื™ื‘ืขืจืคื™ืจืŸ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก ืฆื• ื“ื™ ื•ื•ืึธืœืงืŸ. ืื™ืŸ ื“ืขืจ ื•ื•ื™ื™ึทื˜ืขืจ ืึทืจื˜ื™ืงืœ ืื•ื™ืฃ ื“ืขื ื˜ืขืžืข, ืื™ืš ื•ื•ืขืœ ื•ื•ื™ื™ึทื–ืŸ ื•ื•ื™ ืฆื• ื™ื ืกื˜ืจื•ืžืขื ื˜ ื“ื™ ื’ืื ืฆืข ืกื™ืกื˜ืขื ื“ื™ืกืงืจื™ื™ื‘ื“ ืื•ื™ื‘ืŸ ืื™ืŸ AWS ื ื™ืฆืŸ AWS Kinesis ืื•ืŸ AWS EMR. ื’ื™ื™ ื“ื™ ื ื™ื™ึทืขืก!

ืžืงื•ืจ: www.habr.com

ืœื™ื™ื’ืŸ ืึท ื‘ืึทืžืขืจืงื•ื ื’