Open Source DataHub: LinkedIn's Metadata Search and Discovery Platform

Open Source DataHub: LinkedIn's Metadata Search and Discovery Platform

Kupeza zomwe mukufuna mwachangu ndikofunikira kwa kampani iliyonse yomwe imadalira zambiri kuti ipange zisankho zoyendetsedwa ndi data. Izi sizimangokhudza zokolola za ogwiritsa ntchito data (kuphatikiza akatswiri, opanga makina ophunzirira, asayansi a data, ndi mainjiniya a data), komanso zimakhudza mwachindunji zinthu zomwe zimadalira makina ophunzirira makina abwino (ML). Kuphatikiza apo, zomwe zimachitika pakukhazikitsa kapena kumanga nsanja zophunzirira makina zimadzutsa funso: njira yanu yodziwira zinthu zamkati, zitsanzo, ma metric, ma dataset ndi chiyani.

M'nkhaniyi tikambirana momwe tidasindikizira gwero la data pansi pa chilolezo chotseguka DataHub pakusaka ndi kutulukira kwa metadata, kuyambira masiku oyambirira a polojekiti Kuliko. LinkedIn imakhala ndi mtundu wake wa DataHub mosiyana ndi mtundu wotseguka. Tiyamba ndi kufotokoza chifukwa chake timafunikira malo awiri osiyana otukuka, kenako kukambirana njira zoyambirira zogwiritsira ntchito gwero lotseguka la WhereHows ndikufanizira mtundu wathu wamkati (wopanga) wa DataHub ndi mtundu womwe ulipo. GitHub. Tidzagawananso zambiri za yankho lathu latsopano lokhazikika pakukankhira ndi kulandira zosintha zaposachedwa kuti nkhokwe zonse zigwirizane. Pomaliza, tipereka malangizo amomwe mungayambitsire kugwiritsa ntchito DataHub yotseguka ndikukambirana mwachidule kamangidwe kake.

Open Source DataHub: LinkedIn's Metadata Search and Discovery Platform

WhereHows tsopano ndi DataHub!

Gulu la metadata la LinkedIn lomwe lidawonetsedwa kale DataHub (m'malo mwa WhereHows), nsanja ya LinkedIn yosaka ndi metadata, ndikugawana mapulani oti mutsegule. Posakhalitsa chilengezochi, tidatulutsa mtundu wa alpha wa DataHub ndikugawana ndi anthu ammudzi. Kuyambira pamenepo, takhala tikuthandizira posungiramo ndikugwira ntchito ndi ogwiritsa ntchito achidwi kuti tiwonjezere zomwe zafunsidwa ndikuthetsa mavuto. Ndife okondwa kulengeza za kutulutsidwa kwa boma DataHub pa GitHub.

Open Source Njira

WhereHows, LinkedIn's portal yoyambirira yopezera deta komanso komwe imachokera, idayamba ngati ntchito yamkati; gulu la metadata linatsegula source code mu 2016. Kuyambira nthawi imeneyo, gululi lakhala likusunga ma codebase awiri osiyana-imodzi yotsegula ndi imodzi yogwiritsira ntchito mkati mwa LinkedIn-monga sizinthu zonse zopangidwa ndi LinkedIn zomwe zinkagwiritsidwa ntchito nthawi zambiri kwa omvera ambiri. Kuphatikiza apo, WhereHows ali ndi zodalira zamkati (zomangamanga, malaibulale, ndi zina zotero) zomwe sizili zotseguka. M'zaka zotsatira, WhereHows adadutsa maulendo angapo obwerezabwereza ndi chitukuko, zomwe zinapangitsa kusunga ma codebase awiriwo kuti agwirizane kukhala vuto lalikulu. Gulu la metadata layesa njira zosiyanasiyana pazaka zambiri kuyesa kusunga chitukuko chamkati ndi chotseguka mu kulunzanitsa.

Yesani choyamba: "Open source poyamba"

Poyamba tidatsatira chitsanzo chachitukuko cha "open source first", pomwe chitukuko chochuluka chimapezeka pamalo otseguka ndipo zosintha zimapangidwira kutumizidwa mkati. Vuto ndi njira iyi ndikuti code nthawi zonse imakankhidwira ku GitHub poyamba isanawunikidwe kwathunthu mkati. Mpaka kusintha kupangidwa kuchokera ku malo otseguka komanso kutumizidwa kwatsopano mkati, sitidzapeza zovuta zopanga. Pakadapanda kutumizidwa bwino, zinalinso zovuta kudziwa wolakwa chifukwa kusintha kunachitika m'magulu.

Kuphatikiza apo, chitsanzochi chinachepetsa zokolola za gululi popanga zatsopano zomwe zimafuna kubwereza mwachangu, chifukwa zidakakamiza zosintha zonse kuti zikankhidwe kaye pamalo osungira otseguka kenako ndikukankhidwira kumalo osungira amkati. Kuchepetsa nthawi yokonza, kukonza kofunikira kapena kusintha kutha kuchitidwa m'nkhokwe yamkati poyamba, koma izi zidakhala vuto lalikulu pophatikiza zosinthazo m'malo otseguka chifukwa nkhokwe ziwirizi sizinalumikizidwe.

Mtunduwu ndiwosavuta kugwiritsa ntchito pamapulatifomu, malaibulale, kapena mapulojekiti azomangamanga kuposa mawebusayiti omwe ali ndi mawonekedwe onse. Kuphatikiza apo, mtunduwu ndi wabwino pama projekiti omwe amayamba gwero lotseguka kuyambira tsiku loyamba, koma WhereHows idamangidwa ngati pulogalamu yamkati yamkati. Zinali zovuta kwambiri kuchotseratu zonse zomwe zimadalira mkati, kotero tinkafunika kusunga foloko yamkati, koma kusunga foloko yamkati ndikupanga gwero lotseguka sikunathandize.

Kuyesera kwachiwiri: "Mkati choyamba"

**Monga kuyesa kwachiwiri, tidasamukira ku "chitukuko chamkati" chachitukuko, komwe chitukuko chochuluka chimachitika m'nyumba ndipo zosintha zimasinthidwa ku code yotseguka pafupipafupi. Ngakhale kuti chitsanzochi ndi choyenera kwambiri pazochitika zathu zogwiritsira ntchito, chimakhala ndi zovuta zake. Kukankhira mwachindunji kusiyana kulikonse kumalo otsegulira gwero ndikuyesa kuthetsa mikangano pambuyo pake ndi njira, koma imatenga nthawi. Madivelopa nthawi zambiri amayesa kuti asachite izi nthawi iliyonse akawunika ma code awo. Zotsatira zake, izi sizichitika pafupipafupi, m'magulu, motero zimakhala zovuta kuthetsa mikangano yophatikiza pambuyo pake.

Kachitatu zinagwira ntchito!

Zoyeserera ziwiri zomwe zalephera zomwe tatchulazi zidapangitsa kuti malo a WhereHows GitHub akhale osatha kwa nthawi yayitali. Gululi lidapitilizabe kukonza zomwe zidapangidwa komanso kapangidwe kake, kotero kuti mtundu wamkati wa WhereHows for LinkedIn udatsogola kwambiri kuposa mtundu wotsegulira. Idakhalanso ndi dzina latsopano - DataHub. Kutengera zoyeserera zomwe zidalephera kale, gululo lidaganiza zopanga yankho lanthawi yayitali.

Pantchito iliyonse yatsopano yotseguka, gulu lotseguka la LinkedIn limalangiza ndikuthandizira njira yachitukuko momwe ma module a projekiti amapangidwira poyera. Zinthu zakale zomwe zidasinthidwa zimatumizidwa kumalo osungira anthu onse ndikuwunikanso mkati mwa LinkedIn artifact pogwiritsa ntchito pempho la library yakunja (ELR). Kutsatira chitsanzo chachitukuko ichi sikwabwino kwa iwo omwe amagwiritsa ntchito gwero lotseguka, komanso kumapangitsa kuti pakhale zomangamanga zowonjezereka, zowonjezereka, komanso zowonongeka.

Komabe, ntchito yokhwima yakumbuyo monga DataHub idzafuna nthawi yayitali kuti ifike kuderali. Izi zikulepheretsanso mwayi wopeza mwayi wogwirira ntchito mokwanira musanafotokozere zonse zomwe zimadalira mkati. Ichi ndichifukwa chake tapanga zida zomwe zimatithandizira kuti tithandizire mwachangu komanso mosapweteka kwambiri. Yankholi limapindulitsa onse gulu la metadata (Wopanga DataHub) komanso gulu lotseguka. Zigawo zotsatirazi zikambirana njira yatsopanoyi.

Open Source Publishing Automation

Njira yaposachedwa kwambiri ya gulu la Metadata ku gwero lotseguka la DataHub ndikupanga chida chomwe chimangolumikizana ndi codebase yamkati ndi malo otseguka. Zapamwamba kwambiri za chida ichi ndi:

  1. Gwirizanitsani kachidindo ka LinkedIn kupita / kuchokera kugwero lotseguka, lofanana rsync.
  2. Kupanga mitu yachiphaso, yofanana ndi Apache Rat.
  3. Zimapanga zokha zolemba zotseguka kuchokera ku zolemba zamkati.
  4. Pewani kusintha kwamkati komwe kumaphwanya gwero lotseguka limamangidwa kuyesa kudalira.

Magawo otsatirawa apenda ntchito zomwe tazitchulazi zomwe zili ndi zovuta zosangalatsa.

Kulunzanitsa kwa code source

Mosiyana ndi gwero lotseguka la DataHub, lomwe ndi malo amodzi a GitHub, mtundu wa LinkedIn wa DataHub ndi kuphatikiza kwa nkhokwe zingapo (zotchedwa mkati. zopangidwa zambiri). Mawonekedwe a DataHub, laibulale yachitsanzo ya metadata, ntchito yosungiramo metadata yosungiramo zinthu, ndi ntchito zotsatsira zimakhala m'malo osiyanasiyana pa LinkedIn. Komabe, kuti zikhale zosavuta kwa ogwiritsa ntchito otsegula, tili ndi chosungira chimodzi cha mtundu wotseguka wa DataHub.

Open Source DataHub: LinkedIn's Metadata Search and Discovery Platform

Chithunzi 1: Kulumikizana pakati pa nkhokwe LinkedIn DataHub ndi nkhokwe imodzi DataHub gwero lotseguka

Kuthandizira kupanga, kukankha, ndi kukoka kayendedwe ka ntchito, chida chathu chatsopano chimapanga mapu amtundu wa mafayilo ogwirizana ndi fayilo iliyonse. Komabe, zida zogwirira ntchito zimafunikira masinthidwe oyambira ndipo ogwiritsa ntchito ayenera kupereka mamapu apamwamba monga momwe ziliri pansipa.

{
  "datahub-dao": [
    "${datahub-frontend}/datahub-dao"
  ],
  "gms/impl": [
    "${dataset-gms}/impl",
    "${user-gms}/impl"
  ],
  "metadata-dao": [
    "${metadata-models}/metadata-dao"
  ],
  "metadata-builders": [
    "${metadata-models}/metadata-builders"
  ]
}

Mapu amtundu wa module ndi JSON wosavuta yemwe makiyi ake ndi ma module omwe amayang'aniridwa pamalo otseguka gwero ndipo mayendedwe ake ndi mndandanda wamagawo omwe ali mu LinkedIn repositories. Module iliyonse yomwe mukufuna kulowa munkhokwe yotseguka imatha kudyetsedwa ndi ma module angapo aliwonse. Kuti muwonetse mayina amkati a nkhokwe mumagawo oyambira, gwiritsani ntchito kumasulira kwa chingwe mu kalembedwe ka Bash. Pogwiritsa ntchito fayilo ya mapu a module-level, zidazo zimapanga fayilo ya mapu amtundu wa fayilo posanthula mafayilo onse m'makalata ogwirizana nawo.

{
  "${metadata-models}/metadata-builders/src/main/java/com/linkedin/Foo.java":
"metadata-builders/src/main/java/com/linkedin/Foo.java",
  "${metadata-models}/metadata-builders/src/main/java/com/linkedin/Bar.java":
"metadata-builders/src/main/java/com/linkedin/Bar.java",
  "${metadata-models}/metadata-builders/build.gradle": null,
}

Mapu amtundu wa fayilo amapangidwa ndi zida; komabe, itha kusinthidwanso pamanja ndi wogwiritsa ntchito. Awa ndi mapu a 1:1 a fayilo ya gwero ya LinkedIn kupita ku fayilo yomwe ili pamalo otseguka. Pali malamulo angapo okhudzana ndi kupanga kokha kwa mayanjano a mafayilo:

  • Pankhani ya ma module angapo a gawo lomwe mukufuna kulowa mu gwero lotseguka, mikangano imatha kubuka, mwachitsanzo, zomwezo. Mtengo wa FQCN, yomwe ilipo mumitundu yopitilira imodzi. Monga njira yothanirana ndi mikangano, zida zathu zimasintha kukhala "womaliza amapambana".
  • "null" amatanthauza kuti fayilo yochokera si gawo la malo otseguka.
  • Pambuyo popereka kapena kutulutsa kotsegula kulikonse, mapuwa amasinthidwa ndipo chithunzithunzi chimapangidwa. Izi ndizofunikira kuti muzindikire zowonjezera ndi kufufutidwa kuchokera ku code code kuyambira pomaliza.

Kupanga zolemba zamalonda

Kupereka zipika za magwero otseguka amapangidwanso mwa kuphatikiza zipika za nkhokwe zamkati. Pansipa pali chitsanzo cha chipika chosonyeza mawonekedwe a chipika chopangidwa ndi chida chathu. Kudzipereka kumawonetsa momveka bwino kuti ndi mitundu iti ya nkhokwe zomwe zapakidwa muzochitazo ndipo zimapereka chidule cha chipikacho. Onani izi perekani pogwiritsa ntchito chitsanzo chenicheni cha chipika chopangidwa ndi zida zathu.

metadata-models 29.0.0 -> 30.0.0
    Added aspect model foo
    Fixed issue bar

dataset-gms 2.3.0 -> 2.3.4
    Added rest.li API to serve foo aspect

MP_VERSION=dataset-gms:2.3.4
MP_VERSION=metadata-models:30.0.0

Kuyesa kudalira

LinkedIn ili ndi kudalira kuyezetsa maziko, zomwe zimathandiza kuonetsetsa kuti kusintha kwa zinthu zambiri zamkati sikusokoneza msonkhano wa zinthu zambiri zodalira. Malo otsegula a DataHub sizinthu zambiri, ndipo sangakhale kudalira kwachindunji kwazinthu zambiri, koma mothandizidwa ndi kapu yazinthu zambiri zomwe zimatenga code source source DataHub, tikhoza kugwiritsabe ntchito kuyesa kudalira uku. Choncho, kusintha kulikonse (komwe pambuyo pake kungawonekere) kuzinthu zambiri zomwe zimadyetsa malo otseguka a DataHub repository zimayambitsa chochitika chomanga mu chipolopolo chochuluka. Chifukwa chake, kusintha kulikonse komwe kulephera kupanga cholembera kumalephera kuyesa musanapange chinthu choyambirira ndikubwezeredwa.

Ichi ndi njira yothandiza yomwe imathandiza kupewa kudzipereka kulikonse komwe kumaphwanya gwero lotseguka ndikuzindikira panthawi yochita. Popanda izi, zingakhale zovuta kudziwa kuti ndi chiyani chomwe chinapangitsa kuti malo otseguka alephereke, chifukwa timagwirizanitsa zosintha zamkati ku DataHub open source repository.

Kusiyana pakati pa Open source DataHub ndi mtundu wathu wopanga

Mpaka pano, takambirana njira yathu yolumikizira mitundu iwiri ya nkhokwe za DataHub, koma sitinafotokozebe zifukwa zomwe timafunikira mitsinje iwiri yoyambira. M'chigawo chino, tidzalemba kusiyana pakati pa anthu onse a DataHub ndi kupanga ma seva a LinkedIn, ndikufotokozera zifukwa za kusiyana kumeneku.

Chimodzi mwazosemphana ndi zomwe timapanga zimatengera ma code omwe sanatseguke, monga LinkedIn's Offspring (LinkedIn's internal dependency injection framework). Ana amagwiritsidwa ntchito kwambiri m'ma codebases amkati chifukwa ndi njira yabwino yoyendetsera masinthidwe osinthika. Koma si gwero lotseguka; chifukwa chake tidafunikira kupeza njira zina zotsegulira zopezeka ku DataHub.

Palinso zifukwa zina. Pamene tikupanga zowonjezera kumtundu wa metadata pazosowa za LinkedIn, zowonjezera izi zimakhala zachindunji ku LinkedIn ndipo sizingagwire ntchito kumadera ena. Mwachitsanzo, tili ndi zilembo zenizeni za ma ID otenga nawo mbali ndi mitundu ina yofananira. Chifukwa chake, tsopano sitinaphatikizepo zowonjezera izi kuchokera ku dataHub's open source metadata model. Pamene tikuchita ndi anthu ammudzi ndikumvetsetsa zosowa zawo, tidzagwiritsa ntchito mitundu yofanana yotseguka ya zowonjezerazi ngati pakufunika.

Kusavuta kugwiritsa ntchito komanso kusinthika kosavuta kwa gulu lotseguka lotseguka kunalimbikitsanso kusiyana pakati pa mitundu iwiri ya DataHub. Kusiyanasiyana kwa zomangamanga zopangira mitsinje ndi chitsanzo chabwino cha izi. Ngakhale mtundu wathu wamkati umagwiritsa ntchito njira yoyendetsera mitsinje yoyendetsedwa, tidasankha kugwiritsa ntchito makina omangidwira (standalone) amtundu wotseguka chifukwa amapewa kupanga kudalira kwina kwazinthu.

Chitsanzo china cha kusiyana kwake ndikukhala ndi GMS imodzi (Generalized Metadata Store) potsegula gwero lotsegula osati ma GMS angapo. GMA (Generalized Metadata Architecture) ndi dzina la zomangamanga kumbuyo kwa DataHub, ndipo GMS ndi malo osungira metadata muzochitika za GMA. GMA ndi kamangidwe kosinthika komwe kamakulolani kuti mugawane chilichonse chopangidwa ndi data (monga ma dataset, ogwiritsa ntchito, ndi zina zambiri) m'malo ake osungira metadata, kapena kusunga zomanga zingapo mu sitolo imodzi ya metadata malinga ngati kaundula wokhala ndi mapu a data mkati mwake. GMS yasinthidwa. Kuti tigwiritse ntchito mosavuta, tidasankha chitsanzo chimodzi cha GMS chomwe chimasunga mitundu yonse yosiyanasiyana mu DataHub yotseguka.

Mndandanda wathunthu wa kusiyana pakati pa machitidwe awiriwa waperekedwa mu tebulo ili m'munsimu.

Zambiri Zamalonda
LinkedIn DataHub
Open Source DataHub

Zopanga Zothandizira Zothandizira
1) Ma data 2) Ogwiritsa 3) Mayeso 4) Mawonekedwe a ML 5) Ma chart 6) Ma Dashboards
1) Ma dataset 2) Ogwiritsa

Malo Othandizira a Metadata a Datasets
1) Ambry 2) 3) Dalids 4) Espresso 5) HDFS 6) Hive 7) Kafka 8) MongoDB 9) MySQL 10) Oracle 11) Pinot 12) Nsomba 12) Nyanja 13) Teradata 13) Vector 14) Venice
Hive Kafka RDBMS

Pub-sub
LinkedIn Kafka
Confluent Kafka

Stream Processing
anakwanitsa
Zophatikizidwa (zoyimira)

Jakisoni Wodalira & Kusintha Kwamphamvu
LinkedIn Offspring
Spring

Kupanga Zida
Ligradle (Chovala chamkati cha Gradle cha LinkedIn)
Gradlew

CI/CD
CRT (LinkedIn's internal CI/CD)
Travis CI ndi Docker likulu

Masitolo a Metadata
Ma GMS angapo amagawidwa: 1) Dataset GMS 2) Wogwiritsa GMS 3) Metric GMS 4) Mbali ya GMS 5) Tchati/Dashboard GMS
GMS Imodzi ya: 1) Ma Dataset 2) Ogwiritsa

Microservices muzotengera za Docker

Docker imathandizira kutumiza ndi kugawa ntchito ndi kusungirako katundu. Chigawo chilichonse chautumiki mu DataHub ndi gwero lotseguka, kuphatikiza zida zamagulu monga Kafka, Elasticsearch, @Alirezatalischioriginal ΠΈ MySQL, ili ndi chithunzi chake cha Docker. Kupanga zida za Docker zomwe tidagwiritsa ntchito Docker Kulemba.

Open Source DataHub: LinkedIn's Metadata Search and Discovery Platform

Chithunzi 2: Zomangamanga DataHub *Open source**

Mutha kuwona mapangidwe apamwamba a DataHub pachithunzi pamwambapa. Kupatula magawo azomangamanga, ili ndi zotengera zinayi za Docker:

datahub-gms: ntchito yosungirako metadata

datahub-frontend: ntchito Play, kutumikira mawonekedwe a DataHub.

datahub-mce-consumer: ntchito Kafka Mitsinje, yomwe imagwiritsa ntchito mtsinje wa kusintha kwa metadata (MCE) ndikusintha sitolo ya metadata.

datahub-mae-consumer: ntchito Kafka Mitsinje, yomwe imagwiritsa ntchito mtsinje wa metadata audit event stream (MAE) ndipo imapanga index index ndi graph database.

Open source repository zolemba ndi positi yoyamba yabulogu ya DataHub ali ndi zambiri zatsatanetsatane za ntchito zosiyanasiyana.

CI/CD pa DataHub ndi gwero lotseguka

Malo otseguka a DataHub akugwiritsa ntchito Travis CI kwa kuphatikiza kosalekeza ndi Docker likulu kuti azitumiza mosalekeza. Onse awiri ali ndi kuphatikiza kwa GitHub ndipo ndi kosavuta kukhazikitsa. Pazinthu zambiri zotseguka zopangidwa ndi anthu ammudzi kapena makampani azinsinsi (mwachitsanzo. Kusintha), Zithunzi za Docker zimapangidwa ndikutumizidwa ku Docker Hub kuti anthu ammudzi azigwiritsa ntchito. Chithunzi chilichonse cha Docker chopezeka mu Docker Hub chitha kugwiritsidwa ntchito mosavuta ndi lamulo losavuta kukoka docker.

Ndi kudzipereka kulikonse kumalo otsegulira a DataHub, zithunzi zonse za Docker zimamangidwa zokha ndikutumizidwa ku Docker Hub ndi tag "yaposachedwa". Ngati Docker Hub idakonzedwa ndi ena kutchula nthambi zowonetsera nthawi zonse, ma tag onse omwe ali pamalo otseguka amatulutsidwanso ndi mayina ofananira mu Docker Hub.

Kugwiritsa ntchito DataHub

Kukhazikitsa DataHub ndiyosavuta ndipo ili ndi njira zitatu zosavuta:

  1. Phatikizani malo otsegulira ndikuyendetsa zotengera zonse za Docker ndi docker-compose pogwiritsa ntchito docker-compose script kuti muyambe mwachangu.
  2. Tsitsani deta yachitsanzo yomwe yaperekedwa m'malo osungiramo ntchito pogwiritsa ntchito chida cha mzere wa lamulo chomwe chimaperekedwanso.
  3. Sakatulani DataHub mu msakatuli wanu.

Kutsatiridwa Mwachangu Macheza a Gitter adakonzeranso mafunso ofulumira. Ogwiritsa ntchito amathanso kupanga zovuta mwachindunji munkhokwe ya GitHub. Chofunika kwambiri, timalandira ndikuyamikira mayankho onse ndi malingaliro!

Zimakonzekera zam'tsogolo

Pakadali pano, zida zilizonse kapena microservice yotsegulira DataHub imamangidwa ngati chidebe cha Docker, ndipo dongosolo lonse limapangidwa pogwiritsa ntchito. kuyimbira. Popeza kutchuka ndi kufalikira Kubernetes, tikufunanso kupereka yankho la Kubernetes posachedwa.

Tikukonzekeranso kupereka yankho la turnkey potumiza DataHub pagulu lamtambo la anthu monga Azure, AWS kapena Google Cloud. Popeza chilengezo chaposachedwa cha kusamuka kwa LinkedIn kupita ku Azure, izi zigwirizana ndi zomwe gulu la metadata limafunikira mkati.

Pomaliza, zikomo kwa onse oyamba kutengera DataHub mdera lotseguka lomwe adavotera ma alphas a DataHub ndipo adatithandiza kuzindikira zovuta ndikuwongolera zolemba.

Source: www.habr.com

Kuwonjezera ndemanga