Open Source DataHub: LinkedIn's Metadata Kutsvaga uye Discovery Platform

Open Source DataHub: LinkedIn's Metadata Kutsvaga uye Discovery Platform

Kutsvaga iyo data yaunoda nekukurumidza kwakakosha kune chero kambani inovimba neakawanda data kuita sarudzo dzinofambiswa nedata. Izvi hazvingokanganisa kugadzirwa kwevashandisi vedata (kusanganisira vanoongorora, vanogadzira muchina, vesainzi vedata, uye mainjiniya edata), asi zvakare zvine chekuita nemhedzisiro yezvigadzirwa zvinoenderana nemhando yekudzidza muchina (ML) pombi. Pamusoro pezvo, maitiro ekushandisa kana kuvaka mapuratifomu ekudzidza muchina anomutsa mubvunzo: ndeipi nzira yako yemukati yekuwana maficha, modhi, metrics, dataset, nezvimwe.

Muchikamu chino tichataura nezve mabudiro atakaita sosi yedata pasi perezinesi rakavhurika DataHub mune yedu metadata yekutsvaga uye yekuwana chikuva, kutanga kubva pamazuva ekutanga eprojekiti WhereHows. LinkedIn inochengetedza yayo vhezheni yeDataHub zvakasiyana kubva kune yakavhurwa sosi vhezheni. Tichatanga nekutsanangura kuti nei tichida nharaunda mbiri dzakasiyana dzekusimudzira, tozokurukura nzira dzekutanga dzekushandisa yakavhurika sosi WhereHows uye enzanisa yedu yemukati (yekugadzira) vhezheni yeDataHub neshanduro iri pa. GitHub. Isu tinozogoverawo ruzivo nezve yedu nyowani otomatiki mhinduro yekusundidzira uye kugamuchira yakavhurika sosi zvigadziriso kuchengetedza ese marepositori ari muchibvumirano. Chekupedzisira, isu tinopa mirairo yekuti ungatanga sei kushandisa yakavhurika sosi DataHub uye kukurukura muchidimbu mavakirwo ayo.

Open Source DataHub: LinkedIn's Metadata Kutsvaga uye Discovery Platform

WhereHows ikozvino DataHub!

LinkedIn's metadata timu yakamboratidzwa DataHub (mutsivi weHowHows), yekutsvaga kweLinkedIn uye metadata yekuwana chikuva, uye zvirongwa zvakagovaniswa zvekuivhura. Nguva pfupi yapfuura chiziviso ichi, takaburitsa alpha vhezheni yeDataHub uye tikaigovana nenharaunda. Kubva ipapo, isu takaramba tichibatsira kune inochengeterwa uye tichishanda nevashandisi vanofarira kuwedzera zvakanyanya kukumbirwa maficha uye kugadzirisa matambudziko. Isu tinofara kuzivisa kuburitswa kwepamutemo DataHub paGitHub.

Open Source Approaches

WhereHows, LinkedIn's yepakutanga portal yekutsvaga data uye kwainobva, yakatanga sechirongwa chemukati; timu yemetadata yakavhura source code muna 2016. Kubva ipapo, timu yagara yakachengetedza macodebase maviri akasiyana-imwe yeyakavhurika sosi uye imwe yeIn LinkedIn yekushandisa mukati-sezvo zvisiri zvese zvigadzirwa zvakagadzirwa zveLinkedIn makesi ekushandisa aiwanzoshanda kune vateereri vakawanda. Pamusoro pezvo, WhereHows ine zvimwe zvinotsamira mukati (infrastructure, library, etc.) izvo zvisiri open source. Mumakore akatevera, WhereHows akapfuura nemukudzokororwa kwakawanda uye kutenderera kwekusimudzira, zvichiita kuti kuchengeta iwo maviri macodebases mukuwiriranisa dambudziko hombe. Chikwata chemetadata chakaedza nzira dzakasiyana mumakore apfuura kuyedza kuchengetedza yemukati uye yakavhurika sosi kusimudzira mukuwirirana.

Chekutanga edza: "Vhura sosi kutanga"

Isu takatanga tatevera "yakavhurika sosi kutanga" modhi yekusimudzira, uko budiriro zhinji inoitika munzvimbo yakavhurika sosi repository uye shanduko dzinoitirwa kutumirwa kwemukati. Dambudziko nemaitiro aya nderekuti kodhi inogara ichisundirwa kuGitHub kutanga isati yanyatsoongororwa mukati. Kusvikira shanduko dzaitwa kubva kune yakavhurika sosi repository uye kutsva kwemukati kutumirwa kwaitwa, isu hatizowana chero nyaya dzekugadzira. Kana pakange pasina kufambiswa zvakanaka, zvaive zvakaomawo kuona kuti ndiani akonzeresa nekuti shanduko dzaiitwa mumabhechi.

Pamusoro pezvo, iyi modhi yakaderedza kugadzirwa kwechikwata pakugadzira zvinhu zvitsva izvo zvinoda kukurumidza kudzokororwa, sezvo zvakamanikidza shanduko dzese kuti dzitange kusundirwa munzvimbo yakavhurika sosi repository yobva yasundirwa kune yemukati repository. Kuti uderedze nguva yekugadzira, gadziriso inodiwa kana shanduko inogona kuitwa mudura remukati kutanga, asi iri rakava dambudziko hombe kana rasvika pakubatanidza shanduko idzo mudura rakavhurika repository nekuti matura maviri akange asisina kuwiriraniswa.

Iyi modhi iri nyore kushandisa kune yakagovaniswa mapuratifomu, maraibhurari, kana mapurojekiti ezvivakwa pane kune yakazara-inoratidzirwa tsika pawebhu application. Pamusoro pezvo, iyi modhi yakanakira mapurojekiti anotanga akavhurika sosi kubva pazuva rekutanga, asi WhereHows yakavakwa seyakakwana yemukati webhu application. Zvakanga zvakaoma chaizvo kubvisa zvese zvemukati zvinotsamira, saka taifanira kuchengeta forogo yemukati, asi kuchengeta forogo yemukati uye kugadzira kazhinji yakavhurika sosi hazvina kunyatso shanda.

Kuedza kwechipiri: "Inner kutanga"

** Sekuedza kwechipiri, takatamira kune "yemukati yekutanga" modhi yekuvandudza, uko budiriro zhinji inoitika mumba uye shanduko dzinoitwa kune yakavhurika sosi kodhi nguva nenguva. Kunyangwe iyi modhi yakanyatsokodzera kune yedu kesi yekushandisa, ine matambudziko ekuzvarwa. Kunanga kusundira misiyano yese kune yakavhurika sosi repository uyezve kuyedza kugadzirisa kusawirirana gare gare isarudzo, asi inotora nguva. Vagadziri muzviitiko zvakawanda vanoedza kusaita izvi pese pavanoongorora kodhi yavo. Nekuda kweizvozvo, izvi zvichaitwa zvishoma kazhinji, mumabhechi, uye nekudaro zvinoita kuti zvinyanye kunetsa kugadzirisa kubatanidza kusawirirana gare gare.

Nguva yechitatu yakashanda!

Iwo maviri akatadza kuedza ataurwa pamusoro akakonzera kuti WhereHows GitHub repository irambe iri kunze kwenguva kwenguva refu. Chikwata chakaramba chichivandudza maficha echigadzirwa uye zvivakwa, kuitira kuti iyo yemukati vhezheni yeHowHows yeLinkedIn yakave yepamusoro kupfuura iyo yakavhurwa sosi vhezheni. Yakanga iine zita idzva - DataHub. Kubva pane zvakambokundikana zvakaedza, timu yakafunga kugadzira scalable, yenguva refu mhinduro.

Kune chero purojekiti nyowani yakavhurika sosi, LinkedIn's open source timu inopa zano uye inotsigira modhi yekusimudzira umo mamodule epurojekiti anogadzirwa zvachose mune yakavhurika sosi. Zvinyorwa zvakadhindwa zvinoiswa kunzvimbo inochengeterwa veruzhinji zvobva zvadzoserwa mune yemukati LinkedIn artifact uchishandisa. chikumbiro cheraibhurari chekunze (ELR). Kutevera iyi modhi yekusimudzira haina kunaka chete kune avo vanoshandisa yakavhurika sosi, asi zvakare inoguma mune imwe modular, yakawedzera, uye pluggable architecture.

Nekudaro, yakakura yekumashure-yekupedzisira application yakadai seDataHub inoda nguva yakakura kuti isvike iyi nyika. Izvi zvinodzivirirawo mukana wekuvhura kuvhurwa kwekushanda kwakazara kusati kwaitwa zvese zvemukati zvakabviswa zvizere. Ndosaka takagadzira maturusi anotibatsira kuti tiite mipiro yakavhurika nekukurumidza uye nekurwadziwa kushoma. Iyi mhinduro inobatsira vese metadata timu (DataHub developer) uye yakavhurika sosi nharaunda. Zvikamu zvinotevera zvichakurukura nzira itsva iyi.

Open Source Publishing Automation

Iyo Metadata timu yazvino maitiro kune yakavhurika sosi DataHub ndeyekugadzira chishandiso chino wiriranisa otomatiki iyo yemukati codebase uye yakavhurika sosi repository. Mamiriro epamusoro echishandiso ichi anosanganisira:

  1. Batanidza LinkedIn kodhi ku/kubva kune yakavhurika sosi, yakafanana rsync.
  2. Kugadzira musoro werezinesi, wakafanana ne Apache Rat.
  3. Gadzira otomatiki yakavhurika sosi dhizaini kubva mukati mekuita matanda.
  4. Dzivirira shanduko yemukati inotyora yakavhurika sosi inovaka ne kuvimba kwekuongorora.

Zvikamu zvidiki zvinotevera zvichanyura mumabasa ataurwa pamusoro ane matambudziko anonakidza.

Kwakabva kodhi kuwiriranisa

Kusiyana neyakavhurika sosi vhezheni yeDataHub, inova imwechete GitHub repository, iyo LinkedIn vhezheni yeDataHub musanganiswa weakawanda marepositori (anonzi mukati. multiproducts) Iyo DataHub interface, metadata modhi raibhurari, metadata warehouse backend sevhisi, uye mabasa ekutepfenyura anogara mune akasiyana repositori pa LinkedIn. Nekudaro, kuita kuti zvive nyore kune yakavhurika sosi vashandisi, isu tine imwechete repository yeiyo yakavhurika sosi vhezheni yeDataHub.

Open Source DataHub: LinkedIn's Metadata Kutsvaga uye Discovery Platform

Mufananidzo 1: Kuwiriranisa pakati pezvinyorwa LinkedIn DataHub uye imwe repository DataHub open source

Kutsigira otomatiki kuvaka, kusunda, uye kudhonza workflows, chishandiso chedu chitsva chinogadzira otomatiki faira-level mepu inoenderana nechero sosi faira. Nekudaro, iyo Toolkit inoda kugadziridzwa kwekutanga uye vashandisi vanofanirwa kupa yepamusoro-level module mepu sezvakaratidzwa pazasi.

{
  "datahub-dao": [
    "${datahub-frontend}/datahub-dao"
  ],
  "gms/impl": [
    "${dataset-gms}/impl",
    "${user-gms}/impl"
  ],
  "metadata-dao": [
    "${metadata-models}/metadata-dao"
  ],
  "metadata-builders": [
    "${metadata-models}/metadata-builders"
  ]
}

Iyo module-level mepu iri nyore JSON iyo makiyi ayo ari anotariswa mamodule mune yakavhurika sosi repository uye kukosha ndiko rondedzero yemamodule emodule mune LinkedIn repositories. Chero yakanangwa module mune yakavhurika sosi repository inogona kudyiswa nechero nhamba yemasource modules. Kuti uratidze mazita emukati ezvinyorwa mune zvinyorwa modules, shandisa tambo interpolation muBash style. Uchishandisa faira remepu yemodule-level, maturusi anogadzira faira-level yemepu faira nekutarisa mafaera ese ari mune akabatana madhairekitori.

{
  "${metadata-models}/metadata-builders/src/main/java/com/linkedin/Foo.java":
"metadata-builders/src/main/java/com/linkedin/Foo.java",
  "${metadata-models}/metadata-builders/src/main/java/com/linkedin/Bar.java":
"metadata-builders/src/main/java/com/linkedin/Bar.java",
  "${metadata-models}/metadata-builders/build.gradle": null,
}

Iyo faira level mepu inogadzirwa otomatiki nezvishandiso; zvisinei, inogona zvakare kuvandudzwa nemaoko nemushandisi. Iyi i1: 1 mepu yeLinkedIn source faira kune faira mune yakavhurwa sosi repository. Pane mitemo yakati wandei yakabatana neiyi otomatiki kugadzirwa kwemafaira kushamwaridzana:

  • Panyaya yeakawanda sosi mamodule eiyo inotangwa module mune yakavhurika sosi, kukakavara kunogona kumuka, semuenzaniso zvakafanana. FQCN, iripo mune anopfuura imwe sosi module. Sehurongwa hwekugadzirisa kusawirirana, maturusi edu anogara kune "wekupedzisira anohwina" sarudzo.
  • "null" zvinoreva kuti iyo sosi faira haisi chikamu cheyakavhurika sosi repository.
  • Mushure mega yega yakavhurwa sosi kutumira kana kudhirowa, iyi mepu inovandudzwa otomatiki uye mufananidzo unogadzirwa. Izvi zvinodikanwa kuti uone mawedzero uye kudzima kubva kusource code kubva pachiitiko chekupedzisira.

Kugadzira matanda ekuita

Commit logs for open source commits anogadzirwawo otomatiki nekubatanidza matanda ekuisa emukati repositori. Pazasi pane muenzaniso wekuita danda kuratidza chimiro chegiyodhi inogadzirwa nechokushandisa chedu. Kuzvipira kunoratidza zvakajeka kuti ndedzipi vhezheni dzeiyo sosi repositories dzakaiswa mune icho chibvumirano uye inopa pfupiso yegidhi rekuita. Tarisa iyi commit tichishandisa muenzaniso chaiwo wegogi rekuita rakagadzirwa neturusi rekushandisa.

metadata-models 29.0.0 -> 30.0.0
    Added aspect model foo
    Fixed issue bar

dataset-gms 2.3.0 -> 2.3.4
    Added rest.li API to serve foo aspect

MP_VERSION=dataset-gms:2.3.4
MP_VERSION=metadata-models:30.0.0

Dependency test

LinkedIn ine dependency test infrastructure, iyo inobatsira kuve nechokwadi kuti shanduko kune yemukati yakawanda yezvigadzirwa haiputsi kuungana kweanovimba akawanda. Iyo yakavhurika sosi DataHub repository haisi yakawanda-chigadzirwa, uye haigone kuve yakananga kutsamira kune chero yakawanda-chigadzirwa, asi nerubatsiro rweakawanda-chigadzirwa wrapper inotora yakavhurika sosi DataHub source kodhi, isu tinogona kushandisa iyi yekutsamira kuyedzwa. Saka, chero shanduko (iyo inogona kuzofumurwa gare gare) kune chero yezvakawanda zvinodyisa yakavhurika sosi DataHub repository inokonzeresa chiitiko chekuvaka mugoko rakawanda. Naizvozvo, chero shanduko inotadza kuvaka chigadzirwa chekuputira inotadza bvunzo isati yaita chigadzirwa chepakutanga uye inodzoserwa.

Iyi inzira inobatsira inobatsira kudzivirira chero kuzvipira kwemukati kunotyora yakavhurika sosi kuvaka uye kuiona panguva yekuzvipira. Pasina izvi, zvingave zvakaoma kuona kuti ndechipi chisungo chemukati chakakonzera kuti yakavhurika sosi repository ivake, nekuti isu tinounganidza shanduko yemukati kuDataHub yakavhurika sosi repository.

Misiyano pakati peyakavhurika sosi DataHub uye yedu yekugadzira vhezheni

Kusvika panguva ino, takakurukura mhinduro yedu yekuwiriranisa mavhezheni maviri eDataHub repositori, asi isu hatisati tatsanangura zvikonzero nei tichida hova mbiri dzakasiyana dzekusimudzira pakutanga. Muchikamu chino, tichanyora misiyano pakati peruzhinji vhezheni yeDataHub uye shanduro yekugadzira pane LinkedIn maseva, uye tsanangura zvikonzero zvekusiyana uku.

Imwe bviro yekusawirirana kunobva pakuti vhezheni yedu yekugadzira ine zvinoenderana nekodhi iyo isati yavhurwa sosi, senge LinkedIn's Offspring (LinkedIn's yemukati dependency jekiseni). Offspring inoshandiswa zvakanyanya mumakodhesi emukati nekuti ndiyo nzira inosarudzika yekugadzirisa dhizaini yekumisikidza. Asi haisi yakavhurika sosi; saka taida kutsvaga yakavhurika sosi dzimwe nzira kune yakavhurika sosi DataHub.

Pane zvimwe zvikonzero zvakare. Sezvo isu tichigadzira mawedzero kune metadata modhi yezvido zveLinkedIn, aya mawedzero anowanzo nyatso kuenderana neLinkedIn uye anogona kusashanda zvakananga kune dzimwe nharaunda. Semuyenzaniso, isu tine mavara chaiwo emaID evatori vechikamu nedzimwe mhando dzemetadata dzinoenderana. Saka, isu tabvisa aya ekuwedzera kubva kuDataHub's open source metadata modhi. Sezvo isu tichibatana nenharaunda uye tichinzwisisa zvavanoda, isu tichashanda pane zvakafanana yakavhurika sosi shanduro dzeizvi edzedzero pazvinenge zvichidikanwa.

Kureruka kwekushandisa uye nyore kuchinjika kune yakavhurika sosi nharaunda zvakare yakafuridzira mamwe misiyano pakati peiviri shanduro dzeDataHub. Misiyano murukova yekugadzirisa zvivakwa muenzaniso wakanaka weizvi. Kunyangwe yedu yemukati vhezheni inoshandisa yakagadziriswa rukova yekugadzirisa chimiro, isu takasarudza kushandisa yakavakirwa-mukati (yakamira) kurukova kugadzirisa kune yakavhurika sosi vhezheni nekuti inodzivirira kugadzira kumwe kutsamira kwezvivakwa.

Mumwe muenzaniso wemusiyano kuve neGMS imwechete (Generalized Metadata Store) mune yakavhurika sosi kuita kwete akawanda maGMS. GMA (Generalized Metadata Architecture) izita rekumashure-yekupedzisira architecture yeDataHub, uye GMS ndiyo metadata chitoro mumamiriro eGMA. GMA chivakwa chinochinjika chinokutendera kugovera yega yega data kuvaka (semu dataset, vashandisi, nezvimwewo) muchitoro chayo chemetadata, kana kuchengetedza akawanda data anovaka muchitoro chimwe chemetadata chero bedzi registry ine iyo data data mepu mukati. GMS inovandudzwa. Kuti zvive nyore kushandisa, isu takasarudza imwe chete GMS muenzaniso inochengeta ese akasiyana data anovaka mune yakavhurika sosi DataHub.

Rondedzero yakazara yemisiyano pakati pemashandisirwo maviri inopiwa mutafura iri pazasi.

Product Features
LinkedIn DataHub
Vhura Source DataHub

Inotsigirwa Data Constructs
1) Zvinyorwa 2) Vashandisi 3) Metrics 4) ML Zvimiro 5) Machati 6) Dashboards
1) Datasets 2) Vashandisi

Inotsigirwa Metadata Source yeDatasets
1) Ambry 2) Couchbase 3) Dalids 4) espresso 5) HDFS 6) Hive 7) Kafka 8) MongoDB 9) MySQL 10) Oracle 11) Pinot 12) Presto 12) Iva 13) Teradata 13) Vector 14) Venice
Hive Kafka RDBMS

Pub-sub
LinkedIn Kafka
Confluent Kafka

Stream Processing
vakakwanisa
Yakaiswa (yakamira)

Dependency Injection & Dynamic Configuration
LinkedIn Offspring
chitubu

Kuvaka Tooling
Ligradle (LinkedIn's yemukati Gradle wrapper)
Gradlew

CI / CD
CRT (LinkedIn yemukati CI/CD)
TravisCI uye Docker hub

Metadata Stores
Distributed multiple GMS: 1) Dataset GMS 2) User GMS 3) Metric GMS 4) Feature GMS 5) Chati/Dashboard GMS
Imwe GMS ye: 1) Datasets 2) Vashandisi

Microservices muDocker midziyo

Docker inorerutsa kutumira uye kugovera application ne containerization. Yese chikamu chesevhisi muDataHub yakavhurika sosi, inosanganisira zvikamu zvezvivakwa zvakaita seKafka, Elasticsearch, neo4j ΠΈ MySQL, ine yavo Docker mufananidzo. Kuronga midziyo yeDocker yataishandisa Docker Kudzora.

Open Source DataHub: LinkedIn's Metadata Kutsvaga uye Discovery Platform

Mufananidzo 2: Architecture DataHub *Open source**

Iwe unogona kuona iyo yepamusoro-level architecture yeDataHub mumufananidzo uri pamusoro. Kunze kwezvivakwa zvezvivakwa, ine ina dzakasiyana Docker midziyo:

datahub-gms: metadata yekuchengetedza sevhisi

datahub-mberi: application tamba, kushanda iyo DataHub interface.

datahub-mce-consumer: application Kafka Streams, iyo inoshandisa metadata shanduko chiitiko (MCE) rwizi uye inovandudza metadata chitoro.

datahub-mae-mutengi: application Kafka Streams, iyo inoshandisa metadata yekuongorora chiitiko rwizi (MAE) uye inogadzira index yekutsvaga uye graph database.

Open source repository zvinyorwa uye yepakutanga DataHub blog post ine ruzivo rwakadzama pamusoro pemabasa emasevhisi akasiyana siyana.

CI/CD paDataHub yakavhurika sosi

Iyo yakavhurika sosi DataHub repository inoshandisa TravisCI yekuenderera mberi nekubatanidzwa uye Docker hub kuti urambe uchitumirwa. Ose ari maviri ane yakanaka GitHub yekubatanidza uye ari nyore kumisikidza. Kune akawanda akavhurika sosi zvivakwa zvakagadziridzwa nenharaunda kana akazvimirira makambani (e.g. Kubvumirana), Docker mifananidzo inogadzirwa uye inoiswa kuDocker Hub kuitira nyore kushandiswa nenharaunda. Chero mufananidzo weDocker unowanikwa muDocker Hub unogona kushandiswa zviri nyore nekuraira kuri nyore docker dhonza.

Nekuzvipira kwese kuDataHub yakavhurika sosi repository, mifananidzo yese yeDocker inovakwa otomatiki uye inoiswa kuDocker Hub ine "izvino" tag. Kana Docker Hub yakagadziriswa nevamwe kutumidza matavi ekutaura nguva dzose, ese ma tag mune yakavhurika sosi repository anoburitswa zvakare ane anowirirana tag mazita muDocker Hub.

Kushandisa DataHub

Kugadzira DataHub iri nyore uye ine matanho matatu ari nyore:

  1. Clone iyo yakavhurika sosi repository uye mhanyisa zvese Docker midziyo ine docker-nyora uchishandisa yakapihwa docker-nyora script kuti utange nekukurumidza.
  2. Dhawunirodha iyo data yemuenzaniso yakapihwa mune repository uchishandisa yekuraira mutsara chishandiso chakapihwa zvakare.
  3. Bhurawuza DataHub mubrowser yako.

Active Tracked Gitter chat zvakare yakagadzirirwa mibvunzo inokurumidza. Vashandisi vanogonawo kugadzira nyaya zvakananga muGitHub repository. Chinonyanya kukosha, tinogamuchira uye tinokoshesa mhinduro dzese uye mazano!

Zvirongwa zvemangwana

Parizvino, zvese zvivakwa kana microservice yeakavhurika sosi DataHub inovakwa seDocker mudziyo, uye iyo yese sisitimu inorongwa uchishandisa. docker-compose. Zvichipa mukurumbira uye kupararira Kubernetes, tinodawo kupa Kubernetes yakavakirwa mhinduro munguva pfupi iri kutevera.

Isu tinorongawo kupa mhinduro yeturnkey yekuisa DataHub pane yeruzhinji Cloud sevhisi senge Azure, AWS kana Google Cloud. Tichifunga nezvekuzivisa kwazvino kwekutama kweLinkedIn kuenda kuAzure, izvi zvichaenderana nezvinodiwa zvemukati metadata timu.

Chekupedzisira asi chisiri chidiki, tinotenda kune vese vekutanga kutora DataHub munharaunda yakavhurika sosi vakayera DataHub alphas uye vakatibatsira kuona nyaya nekuvandudza zvinyorwa.

Source: www.habr.com

Voeg