Apache Kafka thiab Streaming Data Processing nrog Spark Streaming
Nyob zoo, Habr! Niaj hnub no peb yuav tsim ib qho system uas yuav ua tiav Apache Kafka xov kwj siv Spark Streaming thiab sau cov txiaj ntsig ua tiav rau AWS RDS huab database.
Cia peb xav txog tias qee lub tuam txhab qiv nyiaj tau teeb tsa peb txoj haujlwm ntawm kev ua cov khoom xa tuaj "ntawm ya" thoob plaws tag nrho nws cov ceg. Qhov no tuaj yeem ua tiav rau lub hom phiaj ntawm kev ntsuas tam sim ntawm qhov qhib txiaj rau lub txhab nyiaj, txwv lossis cov txiaj ntsig nyiaj txiag rau kev lag luam, thiab lwm yam.
Yuav ua li cas siv cov ntaub ntawv no yam tsis siv cov khawv koob thiab cov khawv koob spell - nyeem hauv qab txiav! Mus!
Tom ntej no, siv txoj kev xa, peb xa lus mus rau lub server, rau lub ntsiab lus peb xav tau, hauv JSON hom:
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
value_serializer=lambda x:dumps(x).encode('utf-8'),
compression_type='gzip')
my_topic = 'transaction'
data = get_random_value()
try:
future = producer.send(topic = my_topic, value = data)
record_metadata = future.get(timeout=10)
print('--> The message has been sent to a topic:
{}, partition: {}, offset: {}'
.format(record_metadata.topic,
record_metadata.partition,
record_metadata.offset ))
except Exception as e:
print('--> It seems an Error occurred: {}'.format(e))
finally:
producer.flush()
Thaum khiav tsab ntawv, peb tau txais cov lus hauv qab no hauv lub davhlau ya nyob twg:
Qhov no txhais tau tias txhua yam ua haujlwm raws li peb xav tau - tus tsim khoom tsim thiab xa cov lus rau lub ntsiab lus peb xav tau.
Cov kauj ruam tom ntej yog rau nruab Spark thiab ua cov kab lus no.
Txhim kho Apache Spark
Apache txim yog universal thiab high-performance pawg xam platform.
Spark ua tau zoo dua li kev siv nrov ntawm MapReduce tus qauv thaum txhawb nqa ntau hom kev suav, suav nrog cov lus nug sib tham thiab kev ua haujlwm kwj. Kev nrawm ua lub luag haujlwm tseem ceeb thaum ua cov ntaub ntawv ntau, vim nws yog qhov ceev uas tso cai rau koj los ua haujlwm sib cuam tshuam yam tsis siv feeb lossis teev tos. Ib qho ntawm Spark lub zog loj tshaj plaws uas ua rau nws nrawm heev yog nws lub peev xwm los ua kev suav hauv nco.
Lub moj khaum no tau sau rau hauv Scala, yog li koj yuav tsum tau nruab nws ua ntej:
sudo apt-get install scala
Rub tawm Spark faib los ntawm lub vev xaib official:
Vim Qhov piv txwv no yog rau kev kawm nkaus xwb; peb yuav siv lub server dawb "tsawg kawg" (Free Tier):
Tom ntej no, peb muab zuam rau hauv Free Tier block, thiab tom qab ntawd peb yuav tau txais ib qho piv txwv ntawm t2.micro chav kawm - txawm tias tsis muaj zog, nws yog dawb thiab haum rau peb txoj haujlwm:
Tom ntej no tuaj ntau yam tseem ceeb: lub npe ntawm cov ntaub ntawv piv txwv, lub npe ntawm tus tswv siv thiab nws tus password. Cia peb lub npe piv txwv: myHabrTest, tus neeg siv tswv: habr, tus password: ib 12345 thiab nyem rau ntawm lub pob Next:
Nyob rau nplooj ntawv tom ntej no muaj cov tsis muaj lub luag haujlwm rau kev nkag tau ntawm peb cov ntaub ntawv server los ntawm sab nraud (Public accessibility) thiab chaw nres nkoj muaj:
Siv Spark SQL, peb ua ib pawg yooj yim thiab tso tawm qhov tshwm sim rau lub console:
select
from_unixtime(unix_timestamp()) as curr_time,
t.branch as branch_name,
t.currency as currency_code,
sum(amount) as batch_value
from treasury_stream t
group by
t.branch,
t.currency
Tau txais cov lus nug thiab khiav nws los ntawm Spark SQL:
Thiab tom qab ntawd peb khaws cov ntaub ntawv sib sau ua ke rau hauv lub rooj hauv AWS RDS. Txhawm rau txuag cov txiaj ntsig sib sau rau hauv lub rooj database, peb yuav siv txoj kev sau ntawv ntawm DataFrame object:
Txhua yam ua tiav! Raws li koj tuaj yeem pom hauv daim duab hauv qab no, thaum daim ntawv thov tab tom ua haujlwm, cov txiaj ntsig kev sib sau tshiab tau tso tawm txhua 2 vib nas this, vim tias peb teeb tsa lub sijhawm ua haujlwm rau 2 vib nas this thaum peb tsim cov khoom StreamingContext:
Tom ntej no, peb ua cov lus nug yooj yim rau cov ntaub ntawv los xyuas seb muaj cov ntaub ntawv nyob hauv lub rooj kev pauv_flow:
xaus
Kab lus no tau saib ib qho piv txwv ntawm kev ua cov ntaub ntawv siv Spark Streaming nrog Apache Kafka thiab PostgreSQL. Nrog rau kev loj hlob ntawm cov ntaub ntawv los ntawm ntau qhov chaw, nws yog ib qho nyuaj rau overestimate tus nqi tswv yim ntawm Spark Streaming rau tsim streaming thiab real-time daim ntaub ntawv.
Koj tuaj yeem pom tag nrho qhov chaws hauv kuv qhov chaw khaws cia ntawm GitHub.
Kuv zoo siab los tham txog tsab xov xwm no, Kuv tos ntsoov rau koj cov lus pom, thiab kuv kuj vam tias yuav muaj kev thuam los ntawm txhua tus nyeem ntawv saib xyuas.
Kuv thov kom koj ua tiav!
Ntawv. Thaum xub thawj nws tau npaj siv PostgreSQL database hauv zos, tab sis muab kuv txoj kev hlub rau AWS, kuv txiav txim siab txav cov ntaub ntawv mus rau huab. Hauv tsab xov xwm tom ntej ntawm lub ncauj lus no, kuv yuav qhia yuav ua li cas siv tag nrho cov txheej txheem tau piav qhia saum toj no hauv AWS siv AWS Kinesis thiab AWS EMR. Ua raws li xov xwm!