Dive muDelta Lake: Schema Enforcement uye Evolution

Mhoro, Habr! Ndinokupa kutariswa kweshanduro yechinyorwa "Kunyura muDelta Lake: Schema Enforcement & Evolution" vanyori Burak Yavuz, Brenner Heintz naDenny Lee, iyo yakagadzirirwa mukutarisira kutanga kwekosi. "Data Engineer" kubva kuOTUS.

Dive muDelta Lake: Schema Enforcement uye Evolution

Data, sezvakaitika kwatiri, inogara ichiunganidza uye kubuda. Kuti tirambe takadaro, mafungiro edu enyika anofanira kujairana neruzivo rutsva, mamwe acho ane mativi matsvaβ€”nzira itsva dzokuona nadzo zvinhu zvatakanga tisingazivi nezvazvo kare. Aya mamodheru epfungwa haana kunyanya kusiyana kubva kumatafura schemas anoona maitiro atinoisa nekugadzirisa ruzivo rutsva.

Izvi zvinotisvitsa kunyaya ye schema management. Sezvo matambudziko ebhizinesi uye zvinodiwa zvinoshanduka nekufamba kwenguva, ndizvo zvinoitawo chimiro che data rako. Delta Lake inoita kuti zvive nyore kuunza zviyero zvitsva sezvo data inochinja. Vashandisi vanokwanisa kuwana semantics yakapusa kubata tafura yavo schemas. Zvishandiso izvi zvinosanganisira Schema Enforcement, inodzivirira vashandisi kubva pakusvibisa matafura avo nezvikanganiso kana zvisina basa data, uye Schema Evolution, iyo inobvumira makoramu matsva edatha akakosha kuti angowedzerwe kunzvimbo dzakakodzera. Muchikamu chino, tichanyura zvakadzama mukushandisa zvishandiso izvi.

Kunzwisisa Tafura Schemas

Imwe neimwe DataFrame muApache Spark ine schema inotsanangura chimiro che data, senge data mhando, makoramu, uye metadata. NeDelta Lake, tafura schema inochengetwa muJSON fomati mukati meiyo transaction log.

Chii chinonzi chirongwa chekuita?

Schema Enforcement, inozivikanwawo seSchema Validation, inzira yekuchengetedza muDelta Lake inova nechokwadi chemhando yedata nekuramba marekodhi asingaenderane ne schema yetafura. Semuenzi ari pamberi peresitorendi yakakurumbira yekuchengetera chete, anotarisa kana koramu yega yega yedata yakapinda mutafura iri murondedzero inoenderana yemakoramu anotarisirwa (nemamwe mazwi, kana paine "kuchengetedzwa" kweimwe neimwe yadzo. ), uye inoramba chero zvinyorwa zvine makoramu asiri murondedzero.

Ko schema enforcement inoshanda sei?

Delta Lake inoshandisa schema-on-write yekutarisa, zvinoreva kuti zvese zvitsva zvinonyora patafura zvinotariswa kuti zvinoenderana neiyo tafura tafura schema panguva yekunyora. Kana iyo schema isingaenderane, Delta Lake inobvisa kutengeserana zvachose (hapana data yakanyorwa) uye inosimudza kusarudzika kuzivisa mushandisi nezvekusawirirana.
Delta Lake inoshandisa mitemo inotevera kuona kana rekodhi ichienderana netafura. Inonyorwa DataFrame:

  • haikwanise kuva nemamwe makoramu asiri muchirongwa chetafura yechinangwa. Sezvineiwo, zvese zvakanaka kana iyo inouya data isina zvachose makoramu kubva patafura - aya makoramu anongopihwa zvisina maturo.
  • haigone kuve nemhando dzedhata dzekoramu dzakasiyana nemhando dzedata dzemakoramu mutafura inotangwa. Kana iyo tafura yetafura yakanangwa iine data yeStringType, asi iyo inoenderana dhata muDataFrame ine IntegerType data, schema enforcement inokanda kunze uye kudzivirira kunyora kushanda kuti kurege kuitika.
  • haigone kuve nemazita emakoramu anosiyana chete kana. Izvi zvinoreva kuti haugone kuve nemakoramu anonzi 'Foo' uye 'foo' anotsanangurwa mutafura imwechete. Nepo Spark inogona kushandiswa mune kesi-sensitive kana kesi-isinganzwi (default) modhi, Delta Lake inochengetedza kesi asi haina hanya mukati mekuchengetedza schema. Parquet inobata nyaya paunenge uchichengeta uye uchidzosa ruzivo rwekoramu. Kuti tidzivise zvikanganiso zvinogona kuitika, huwori hwedata, kana kurasikirwa kwedata (chimwe chinhu chatakasangana nacho paDatabricks), takasarudza kuwedzera ichi chinogumira.

Kuenzanisira izvi, ngatitarisei zvinoitika mukodhi iri pazasi patinoyedza kuwedzera mamwe makoramu achangobva kugadzirwa patafura yeDelta Lake iyo isati yagadzirwa kuti ivagamuchire.

# Π‘Π³Π΅Π½Π΅Ρ€ΠΈΡ€ΡƒΠ΅ΠΌ DataFrame ссуд, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ ΠΌΡ‹ Π΄ΠΎΠ±Π°Π²ΠΈΠΌ Π² Π½Π°ΡˆΡƒ Ρ‚Π°Π±Π»ΠΈΡ†Ρƒ Delta Lake
loans = sql("""
            SELECT addr_state, CAST(rand(10)*count as bigint) AS count,
            CAST(rand(10) * 10000 * count AS double) AS amount
            FROM loan_by_state_delta
            """)

# ВывСсти ΠΈΡΡ…ΠΎΠ΄Π½ΡƒΡŽ схСму DataFrame
original_loans.printSchema()

root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
 
# ВывСсти Π½ΠΎΠ²ΡƒΡŽ схСму DataFrame
loans.printSchema()
 
root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
  |-- amount: double (nullable = true) # new column
 
# ΠŸΠΎΠΏΡ‹Ρ‚ΠΊΠ° Π΄ΠΎΠ±Π°Π²ΠΈΡ‚ΡŒ Π½ΠΎΠ²Ρ‹ΠΉ DataFrame (с Π½ΠΎΠ²Ρ‹ΠΌ столбцом) Π² ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΡƒΡŽ Ρ‚Π°Π±Π»ΠΈΡ†Ρƒ
loans.write.format("delta") 
           .mode("append") 
           .save(DELTALAKE_PATH)

Returns:

A schema mismatch detected when writing to the Delta table.
 
To enable schema migration, please set:
'.option("mergeSchema", "true")'
 
Table schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
 
Data schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
-- amount: double (nullable = true)
 
If Table ACLs are enabled, these options will be ignored. Please use the ALTER TABLE command for changing the schema.

Panzvimbo pekuwedzera otomatiki makoramu, Delta Lake inoisa schema uye inomira kunyora. Kubatsira kuona kuti ndeipi koramu (kana seti yemakoramu) iri kukonzera mutsauko, Spark inoburitsa ese ari maviri schemas kubva kune stack trace yekuenzanisa.

Chii chakanakira kumanikidza schema?

Nekuti schema enforcement icheki yakaomesesa, chishandiso chakanakisa kushandisa semuchengeti wegedhi kune yakachena, yakashandurwa zvizere data seti yakagadzirira kugadzirwa kana kudyiwa. Inowanzo shandiswa kumatafura anodyisa data zvakananga:

  • Machine kudzidza algorithms
  • BI dashboards
  • Data analytics uye maturusi ekuona
  • Chero yekugadzira sisitimu inoda yakanyatso kurongeka, yakasimba typed semantic schemas.

Kuti vagadzirire data ravo reichi chipingamupinyi chekupedzisira, vashandisi vazhinji vanoshandisa yakapusa "multi-hop" yekuvaka iyo zvishoma nezvishoma inosuma chimiro mumatafura avo. Kuti udzidze zvakawanda pamusoro peizvi, unogona kutarisa chinyorwa Kugadzira-giredhi muchina kudzidza neDelta Lake.

Ehe, schema enforcement inogona kushandiswa chero kupi mupombi yako, asi yeuka kuti kutenderera kune tafura mune iyi kesi kunogona kushungurudza nekuti, semuenzaniso, wakanganwa kuti wakawedzera imwe koramu kune iri kuuya data.

Kudzivirira data dilution

Parizvino unogona kunge uchinetseka kuti, chii chiri kunetsa? Mushure mezvose, dzimwe nguva kukanganisa kusingatarisirwi kwe "schema mismatch" kunogona kukukwidza mukufambiswa kwebasa, kunyanya kana uri mutsva kuDelta Lake. Wadii kungorega schema ichichinja sezvinodiwa kuti ndikwanise kunyora yangu DataFrame zvisinei kuti chii?

Sokutaura kunoita chirevo chekare, β€œchikamu chekudzivirira chinokodzera pondo yekurapa. Pane imwe nguva, kana ukasangwarira kumanikidza schema yako, data data inoenderana nyaya dzinosimudza misoro yakashata - zvinoita senge homogeneous dhata masosi anogona kunge aine makesi emupendero, makoramu akaora, mepu isina kurongeka, kana zvimwe zvinhu zvinotyisa zvekurota nezvazvo mukati. hope dzinotyisa. Nzira yakanakisa ndeyekumisa vavengi ava pagedhi - ne schema enforcement - uye kubata navo muchiedza, kwete gare gare pavanotanga kuvanda murima rakadzika rekodhi yako yekugadzira.

Kusimbisa schema kunokupa iwe vimbiso yekuti schema yetafura yako haizochinji kunze kwekunge wabvumidza shanduko. Izvi zvinodzivirira kuderedzwa kwedata, izvo zvinogona kuitika kana makoramu matsva achiwedzerwa kakawanda zvekuti matafura aimbokosha, akatsikirirwa anorasikirwa nezvaanoreva uye kubatsira nekuda kwekunyudzwa kwedata. Nekukukurudzira kuti uve nemaune, kuseta zviyero zvepamusoro, uye kutarisira mhando yepamusoro, schema enforcement inoita chaizvo izvo zvakagadzirirwa kuita-kukubatsira kuti urambe wakangwarira uye maspredishiti ako akachena.

Kana pane kumwe kufunga unofunga kuti iwe zvechokwadi inoda wedzera koramu nyowani - hapana dambudziko, pazasi pane imwe-mutsara gadziriso. Mhinduro ndeyekushanduka kwedunhu!

Chii chinonzi schema evolution?

Schema evolution chinhu chinobvumira vashandisi kushandura zviri nyore tafura yazvino schema zvinoenderana nedata rinochinja nekufamba kwenguva. Inonyanya kushandiswa paunenge uchiita append kana kunyorazve otomatiki kugadzirisa schema kuti ubatanidze imwe kana anopfuura makoramu matsva.

Schema evolution inoshanda sei?

Kutevedzera muenzaniso kubva muchikamu chakapfuura, vagadziri vanogona kushandisa nyore schema evolution kuwedzera makoramu matsva ayo aimborambwa nekuda kwekusawirirana kwe schema. Circuit evolution inoshandiswa nekuwedzera .option('mergeSchema', 'true') kuchikwata chako cheSpark .write ΠΈΠ»ΠΈ .writeStream.

# Π”ΠΎΠ±Π°Π²ΡŒΡ‚Π΅ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ mergeSchema
loans.write.format("delta") 
           .option("mergeSchema", "true") 
           .mode("append") 
           .save(DELTALAKE_SILVER_PATH)

Kuti uone girafu, mhanya unotevera Spark SQL mubvunzo

# Π‘ΠΎΠ·Π΄Π°ΠΉΡ‚Π΅ Π³Ρ€Π°Ρ„ΠΈΠΊ с Π½ΠΎΠ²Ρ‹ΠΌ столбцом, Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€Π΄ΠΈΡ‚ΡŒ, Ρ‡Ρ‚ΠΎ запись ΠΏΡ€ΠΎΡˆΠ»Π° ΡƒΡΠΏΠ΅ΡˆΠ½ΠΎ
%sql
SELECT addr_state, sum(`amount`) AS amount
FROM loan_by_state_delta
GROUP BY addr_state
ORDER BY sum(`amount`)
DESC LIMIT 10

Dive muDelta Lake: Schema Enforcement uye Evolution
Neimwe nzira, unogona kuseta iyi sarudzo yeiyo yese Spark chikamu nekuwedzera spark.databricks.delta.schema.autoMerge = True kune Spark kumisikidza. Asi shandisa izvi nekuchenjerera, sezvo schema enforcement haichakuyambira iwe kune usingazivi schema kusawirirana.

Nekubatanidza parameter muchikumbiro mergeSchema, makoramu ese aripo muDataFrame asi asiri mutafura yakanangwa anowedzerwa otomatiki kumagumo e schema sechikamu chekunyora kutengeserana. Minda yeNested inogonawo kuwedzerwa uye izvi zvichawedzerwawo kumagumo emakoramu anoenderana chimiro.

Date mainjiniya uye data masayendisiti anogona kushandisa sarudzo iyi kuwedzera makoramu matsva (zvichida metric ichangobva kuteverwa kana yemwedzi wekutengesa performance column) kumatafura ekugadzira ekugadzira muchina aripo pasina kutyora mamodheru aripo zvichibva pamakoramu ekare.

Aya anotevera marudzi ekuchinja schema anotenderwa sechikamu che schema shanduko panguva yekuwedzera tafura kana kunyorazve:

  • Kuwedzera makoramu matsva (iyi ndiyo inonyanya kuitika)
  • Kuchinja mhando dzedata kubva kuNullType -> chero imwe mhando kana kusimudzira kubva kuByteType -> ShortType -> IntegerType

Dzimwe shanduko dzisingabvumidzwe mukati me schema evolution inoda kuti schema uye data inyorwezve nekuwedzera .option("overwriteSchema", "true"). Semuenzaniso, kana iyo koramu "Foo" pakutanga yaive yakazara uye schema nyowani yaive mhando yetambo data, ipapo ese Parquet(data) mafaera aizoda kunyorwazve. Shanduko dzakadaro dzinosanganisira:

  • kudzima chikamu
  • kushandura iyo data yerudzi rwekoramu iripo (mu-nzvimbo)
  • kutumidzazve makoramu anosiyana chete kana (semuenzaniso, "Foo" uye "foo").

Chekupedzisira, nekuburitswa kunotevera kweSpark 3.0, yakajeka DDL ichatsigirwa zvizere (ichishandisa ALTER TABLE), ichibvumira vashandisi kuita zvinotevera zviito patafura schemas:

  • kuwedzera mbiru
  • kuchinja column comments
  • kuseta zvimiro zvetafura zvinodzora maitiro etafura, sekuseta hurefu hwenguva iyo log yekutengeserana inochengetwa.

Chii chakanakira kushanduka kwedunhu?

Schema evolution inogona kushandiswa chero nguva iwe funga shandura schema yetafura yako (kusiyana neapo iwe wakawedzera netsaona makoramu kune yako DataFrame isingafanirwe kunge iripo). Iyi ndiyo nzira iri nyore yekufambisa schema yako nekuti inongowedzera iwo chaiwo makoramu mazita uye mhando dzedata pasina kuazivisa pachena.

mhedziso

Schema enforcement inoramba chero makoramu matsva kana mamwe schema shanduko isingaenderane netafura yako. Nekumisa nekuchengetedza iyi miitiro yakakwira, vanoongorora uye mainjiniya vanogona kuvimba kuti data ravo rine mwero wepamusoro wekuvimbika, vachitaurirana zvakajeka uye zvakajeka, zvichivabvumira kuita zvirinani bhizinesi sarudzo.

Nekune rimwe divi, schema evolution inopindirana nekusimbisa nekurerutsa zvinonzi otomatiki schema shanduko. Mushure mezvose, hazvifanirwe kuve zvakaoma kuwedzera mbiru.

Iko kushandiswa kwekumanikidzirwa kwechirongwa ndeye yang, uko kushanduka kwechirongwa ndeye yin. Kana zvikashandiswa pamwechete, izvi zvinoita kuti kudzvinyirirwa kweruzha uye kuisa chiratidzo kuve nyore kupfuura nakare kose.

Tinodawo kutenda Mukul Murthy naPranav Anand nemipiro yavo kuchinyorwa chino.

Zvimwe zvinyorwa munhevedzano iyi:

Dive muDelta Lake: Kusunungura iyo Transaction Log

Zvinyorwa pamusoro pehurukuro

Kugadzira-giredhi muchina kudzidza neDelta Lake

Chii chinonzi data dziva?

Tsvaga zvimwe nezvekosi

Source: www.habr.com

Voeg