Dhia rau hauv lub pas dej Delta: Schema Enforcement thiab Evolution

Hlo Habr! Kuv nthuav qhia rau koj mloog cov lus txhais ntawm tsab xov xwm "Diving rau hauv Delta Lake: Schema Enforcement & Evolution" Cov kws sau ntawv Burak Yavuz, Brenner Heintz thiab Denny Lee, uas tau npaj rau qhov kev cia siab ntawm qhov pib ntawm chav kawm Data Engineer los ntawm OTUS.

Dhia rau hauv lub pas dej Delta: Schema Enforcement thiab Evolution

Cov ntaub ntawv, zoo li peb qhov kev paub dhau los, yog pheej tsub zuj zuj thiab hloov zuj zus. Txhawm rau kom ua tiav, peb cov qauv kev xav ntawm lub ntiaj teb yuav tsum hloov mus rau cov ntaub ntawv tshiab, qee qhov muaj qhov tshiab-txoj kev tshiab ntawm kev soj ntsuam tej yam uas peb tsis paub txog ua ntej. Cov qauv kev puas siab ntsws no tsis txawv ntau ntawm cov lus schemas uas txiav txim siab seb peb cais thiab ua cov ntaub ntawv tshiab li cas.

Qhov no coj peb mus rau qhov teeb meem ntawm kev tswj schema. Raws li kev sib tw ua lag luam thiab cov kev xav tau hloov pauv lub sijhawm, cov qauv ntawm koj cov ntaub ntawv zoo li cas. Delta Lake ua rau nws yooj yim los qhia kev ntsuas tshiab raws li cov ntaub ntawv hloov pauv. Cov neeg siv muaj kev nkag tau yooj yim semantics los tswj lawv cov lus schemas. Cov cuab yeej no suav nrog Schema Enforcement, uas tiv thaiv cov neeg siv los ntawm kev tsis txhob txwm ua phem rau lawv cov ntxhuav nrog kev ua yuam kev lossis cov ntaub ntawv tsis tsim nyog, thiab Schema Evolution, uas tso cai rau cov kab ntawv tshiab ntawm cov ntaub ntawv muaj txiaj ntsig tau muab ntxiv rau qhov chaw tsim nyog. Hauv tsab xov xwm no, peb yuav nkag mus tob rau hauv kev siv cov cuab yeej no.

Nkag siab Table Schemas

Txhua DataFrame hauv Apache Spark muaj ib lub tswv yim uas txhais cov ntaub ntawv, xws li cov ntaub ntawv hom, kab, thiab metadata. Nrog Delta Lake, lub rooj schema yog khaws cia hauv JSON hom hauv kev sib pauv log.

Txoj cai tswjfwm yog dab tsi?

Schema Enforcement, tseem hu ua Schema Validation, yog ib txoj hauv kev ruaj ntseg hauv Delta Lake uas ua kom cov ntaub ntawv zoo los ntawm kev tsis lees paub cov ntaub ntawv uas tsis sib xws nrog lub rooj schema. Zoo li tus tswv tsev ntawm lub rooj zaum pem hauv ntej ntawm lub tsev noj mov nrov-tsuas yog, nws xyuas seb txhua kab ntawm cov ntaub ntawv nkag mus rau hauv lub rooj yog nyob rau hauv cov npe sib txuas ntawm cov kab uas xav tau (hauv lwm lo lus, seb puas muaj "kev tshwj tseg" rau txhua tus ntawm lawv. ), thiab tsis lees paub cov ntaub ntawv nrog txhua kab uas tsis nyob hauv daim ntawv.

Txoj cai tswjfwm schema ua haujlwm li cas?

Delta Lake siv schema-on-write checking, uas txhais tau hais tias tag nrho cov ntawv tshiab sau rau lub rooj raug tshuaj xyuas rau kev sib raug zoo nrog lub hom phiaj lub rooj schema ntawm lub sijhawm sau. Yog tias qhov schema tsis sib haum, Delta Lake rho tawm qhov kev hloov pauv tag nrho (tsis muaj ntaub ntawv sau) thiab tsa qhov kev zam kom ceeb toom rau tus neeg siv ntawm qhov tsis sib xws.
Delta Lake siv cov cai hauv qab no los txiav txim siab seb cov ntaub ntawv puas haum rau lub rooj. Sau tau DataFrame:

  • tsis tuaj yeem muaj cov kab ntxiv uas tsis nyob hauv lub hom phiaj lub rooj schema. Hloov pauv, txhua yam zoo yog tias cov ntaub ntawv nkag los tsis muaj tag nrho cov kab lus los ntawm lub rooj - cov kab no tsuas yog muab cov txiaj ntsig null.
  • tsis tuaj yeem muaj kab ke cov ntaub ntawv uas txawv ntawm cov ntaub ntawv hom kab hauv kab lus. Yog tias kab lus phiaj xwm muaj cov ntaub ntawv StringType, tab sis cov kab sib txuas hauv DataFrame muaj cov ntaub ntawv IntegerType, schema tub ceev xwm yuav pov ib qho kev zam thiab tiv thaiv kev sau ntawv los ntawm qhov chaw.
  • tsis tuaj yeem muaj cov npe kab uas txawv tsuas yog hauv rooj plaub. Qhov no txhais tau hais tias koj tsis tuaj yeem muaj kab npe 'Foo' thiab 'foo' txhais nyob rau hauv tib lub rooj. Thaum Spark tuaj yeem siv rau hauv cov ntaub ntawv-rhiab lossis case-insensitive (default) hom, Delta Lake yog cov ntaub ntawv khaws cia tab sis tsis hnov ​​​​qab nyob rau hauv schema cia. Parquet yog cov ntaub ntawv rhiab thaum khaws cia thiab xa cov ntaub ntawv rov qab. Txhawm rau kom tsis txhob muaj qhov yuam kev, cov ntaub ntawv kev noj nyiaj txiag, lossis cov ntaub ntawv poob (uas peb tus kheej tau ntsib ntawm Databricks), peb txiav txim siab ntxiv qhov kev txwv no.

Txhawm rau piav qhia qhov no, cia peb saib seb yuav ua li cas tshwm sim hauv cov cai hauv qab no thaum peb sim ntxiv qee cov kab uas tau tsim tshiab rau hauv Delta Lake lub rooj uas tseem tsis tau teeb tsa los txais lawv.

# Π‘Π³Π΅Π½Π΅Ρ€ΠΈΡ€ΡƒΠ΅ΠΌ DataFrame ссуд, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ ΠΌΡ‹ Π΄ΠΎΠ±Π°Π²ΠΈΠΌ Π² Π½Π°ΡˆΡƒ Ρ‚Π°Π±Π»ΠΈΡ†Ρƒ Delta Lake
loans = sql("""
            SELECT addr_state, CAST(rand(10)*count as bigint) AS count,
            CAST(rand(10) * 10000 * count AS double) AS amount
            FROM loan_by_state_delta
            """)

# ВывСсти ΠΈΡΡ…ΠΎΠ΄Π½ΡƒΡŽ схСму DataFrame
original_loans.printSchema()

root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
 
# ВывСсти Π½ΠΎΠ²ΡƒΡŽ схСму DataFrame
loans.printSchema()
 
root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
  |-- amount: double (nullable = true) # new column
 
# ΠŸΠΎΠΏΡ‹Ρ‚ΠΊΠ° Π΄ΠΎΠ±Π°Π²ΠΈΡ‚ΡŒ Π½ΠΎΠ²Ρ‹ΠΉ DataFrame (с Π½ΠΎΠ²Ρ‹ΠΌ столбцом) Π² ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΡƒΡŽ Ρ‚Π°Π±Π»ΠΈΡ†Ρƒ
loans.write.format("delta") 
           .mode("append") 
           .save(DELTALAKE_PATH)

Returns:

A schema mismatch detected when writing to the Delta table.
 
To enable schema migration, please set:
'.option("mergeSchema", "true")'
 
Table schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
 
Data schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
-- amount: double (nullable = true)
 
If Table ACLs are enabled, these options will be ignored. Please use the ALTER TABLE command for changing the schema.

Hloov chaw cia li ntxiv cov kab tshiab, Delta Lake ua rau lub tswv yim thiab nres sau ntawv. Txhawm rau pab txiav txim siab seb kab twg (lossis cov kab ke) ua rau qhov tsis sib xws, Spark outputs ob schemas los ntawm pawg kab sib piv.

Dab tsi yog qhov txiaj ntsig ntawm kev tswj hwm schema?

Vim hais tias kev tswj hwm schema yog ib qho kev kuaj xyuas nruj, nws yog ib qho cuab yeej zoo siv los ua tus neeg saib xyuas kom huv, hloov pauv cov ntaub ntawv uas npaj txhij rau kev tsim khoom lossis kev noj. Feem ntau siv rau cov ntxhuav uas ncaj qha pub cov ntaub ntawv:

  • Tshuab kawm algorithms
  • BI dashboards
  • Cov ntaub ntawv analytics thiab visualization cov cuab yeej
  • Txhua qhov kev tsim khoom uas yuav tsum tau ua kom muaj kev sib koom ua ke, muaj zog ntaus semantic schemas.

Txhawm rau npaj lawv cov ntaub ntawv rau qhov teeb meem kawg no, ntau tus neeg siv siv qhov yooj yim "multi-hop" architecture uas maj mam qhia cov qauv rau hauv lawv cov ntxhuav. Yog xav paub ntxiv txog qhov no, koj tuaj yeem tshawb xyuas cov lus Kev kawm-qib tshuab kev kawm nrog Delta Lake.

Tau kawg, kev tswj hwm schema tuaj yeem siv nyob txhua qhov chaw hauv koj lub raj xa dej, tab sis nco ntsoov tias kev xa mus rau lub rooj hauv qhov no tuaj yeem ntxhov siab vim, piv txwv li, koj tsis nco qab tias koj ntxiv lwm kab rau cov ntaub ntawv tuaj.

Tiv thaiv cov ntaub ntawv dilution

Los ntawm tam sim no koj yuav xav tsis thoob, dab tsi yog tag nrho cov fuss txog? Tom qab tag nrho, qee zaum qhov kev npaj txhij txog "schema mismatch" yuam kev tuaj yeem ua rau koj mus rau hauv koj cov dej num, tshwj xeeb tshaj yog tias koj tshiab rau Delta Lake. Vim li cas ho tsis cia tus schema hloov raws li xav tau kom kuv thiaj li sau tau kuv DataFrame txawm li cas los xij?

Raws li cov lus qub hais tias, "ib ooj ntawm kev tiv thaiv yog tsim nyog ib phaus kho." Qee lub sij hawm, yog tias koj tsis saib xyuas los tswj koj lub tswv yim, cov ntaub ntawv hom kev sib raug zoo teeb meem yuav rov qab lawv lub taub hau dab tuag - zoo li cov ntaub ntawv raw cov ntaub ntawv yuav muaj cov rooj plaub ntug, cov kab corrupted, malformed mappings, lossis lwm yam txaus ntshai ua npau suav txog hauv npau suav phem. Txoj hauv kev zoo tshaj plaws yog kom tsis txhob cov yeeb ncuab ntawm lub rooj vag - nrog kev tswj hwm schema - thiab nrog lawv nyob hauv qhov kaj, tsis yog tom qab thaum lawv pib lurking hauv qhov tsaus ntuj ntawm koj cov cai ntau lawm.

Kev tswj hwm tus txheej txheem muab kev lees paub rau koj tias koj lub rooj schema yuav tsis hloov yog tias koj pom zoo qhov kev hloov. Qhov no tiv thaiv cov ntaub ntawv dilution, uas tuaj yeem tshwm sim thaum cov kab tshiab tau ntxiv ntau zaus uas yav tas los muaj txiaj ntsig, cov ntxhuav compressed poob lawv lub ntsiab lus thiab muaj txiaj ntsig vim cov ntaub ntawv dej nyab. Los ntawm kev txhawb kom koj txhob txwm ua, teeb tsa cov qauv siab, thiab cia siab tias yuav ua tau zoo, schema tub ceev xwm ua raws nraim li nws tau tsim los ua - pab koj nyob twj ywm thiab koj cov ntaub ntawv huv si.

Yog tias xav txog ntxiv koj txiav txim siab tias koj tiag tiag yuav tsum tau ntxiv ib kab tshiab - tsis muaj teeb meem, hauv qab no yog ib kab kho. Kev daws yog qhov hloov pauv ntawm lub voj voog!

schema evolution yog dab tsi?

Schema evolution yog qhov tshwj xeeb uas tso cai rau cov neeg siv tau yooj yim hloov cov lus qhia tam sim no raws li cov ntaub ntawv hloov pauv lub sijhawm. Nws yog feem ntau siv thaum ua ib qho append lossis rewrite ua haujlwm kom hloov kho cov schema kom suav nrog ib lossis ntau kab tshiab.

schema evolution ua haujlwm li cas?

Ua raws li qhov piv txwv los ntawm ntu dhau los, cov neeg tsim khoom tuaj yeem yooj yim siv schema evolution ntxiv cov kab tshiab uas yav tas los tsis lees paub vim yog qhov tsis sib xws ntawm schema. Circuit evolution yog qhib los ntawm kev ntxiv .option('mergeSchema', 'true') rau koj pab neeg Spark .write ΠΈΠ»ΠΈ .writeStream.

# Π”ΠΎΠ±Π°Π²ΡŒΡ‚Π΅ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ mergeSchema
loans.write.format("delta") 
           .option("mergeSchema", "true") 
           .mode("append") 
           .save(DELTALAKE_SILVER_PATH)

Txhawm rau saib daim duab, khiav cov lus nug hauv qab no Spark SQL

# Π‘ΠΎΠ·Π΄Π°ΠΉΡ‚Π΅ Π³Ρ€Π°Ρ„ΠΈΠΊ с Π½ΠΎΠ²Ρ‹ΠΌ столбцом, Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€Π΄ΠΈΡ‚ΡŒ, Ρ‡Ρ‚ΠΎ запись ΠΏΡ€ΠΎΡˆΠ»Π° ΡƒΡΠΏΠ΅ΡˆΠ½ΠΎ
%sql
SELECT addr_state, sum(`amount`) AS amount
FROM loan_by_state_delta
GROUP BY addr_state
ORDER BY sum(`amount`)
DESC LIMIT 10

Dhia rau hauv lub pas dej Delta: Schema Enforcement thiab Evolution
Xwb, koj tuaj yeem teeb qhov kev xaiv no rau tag nrho Spark kev sib tham los ntawm kev ntxiv spark.databricks.delta.schema.autoMerge = True mus rau Spark configuration. Tab sis siv qhov no nrog ceev faj, raws li kev tswj hwm schema yuav tsis ceeb toom koj rau qhov tsis txaus siab schema inconsistencies.

Los ntawm suav nrog parameter hauv qhov kev thov mergeSchema, tag nrho cov kab uas muaj nyob rau hauv DataFrame tab sis tsis nyob rau hauv lub hom phiaj lub rooj yuav cia li ntxiv mus rau qhov kawg ntawm lub schema raws li ib feem ntawm kev sau ntawv pauv. Nested teb kuj tuaj yeem ntxiv thiab cov no tseem yuav ntxiv rau qhov kawg ntawm cov kab ke sib txuas.

Hnub tim engineers thiab cov kws tshawb fawb cov ntaub ntawv tuaj yeem siv qhov kev xaiv no los ntxiv cov kab ntawv tshiab (tej zaum ib qho kev ntsuas tsis ntev los no los yog lub hlis no kev muag khoom kem) rau lawv cov kev kawm tshuab uas twb muaj lawm tsis tau rhuav tshem cov qauv uas twb muaj lawm raws li cov kab qub.

Cov hom kev hloov pauv hauv qab no tau tso cai ua ib feem ntawm schema evolution thaum lub rooj ntxiv lossis rov sau dua:

  • Ntxiv cov kab tshiab (qhov no yog qhov xwm txheej feem ntau)
  • Hloov cov ntaub ntawv hom los ntawm NullType -> lwm yam lossis txhawb los ntawm ByteType -> ShortType -> IntegerType

Lwm qhov kev hloov pauv tsis pub dhau schema evolution xav kom cov schema thiab cov ntaub ntawv rov sau dua los ntawm kev ntxiv .option("overwriteSchema", "true"). Piv txwv li, nyob rau hauv rooj plaub uas lub kem "Foo" yog Ameslikas ib tug integer thiab cov tshiab schema yog ib txoj hlua ntaub ntawv hom, ces tag nrho cov ntaub ntawv Parquet (cov ntaub ntawv) yuav tsum tau rewritten. Cov kev hloov no suav nrog:

  • rho tawm ib kab
  • hloov cov ntaub ntawv hom ntawm kab uas twb muaj lawm (hauv-qhov chaw)
  • renaming kab uas txawv tsuas yog nyob rau hauv cov ntaub ntawv (piv txwv li, "Foo" thiab "foo")

Thaum kawg, nrog rau qhov kev tso tawm tom ntej ntawm Spark 3.0, qhia meej DDL yuav tau txais kev txhawb nqa tag nrho (siv ALTER TABLE), tso cai rau cov neeg siv ua cov haujlwm hauv qab no ntawm lub rooj schemas:

  • ntxiv cov kab
  • hloov kab lus
  • teeb tsa lub rooj khoom uas tswj lub rooj coj tus cwj pwm, xws li teeb tsa lub sijhawm ntev ntawm kev hloov pauv cov ntaub ntawv khaws cia.

Dab tsi yog qhov txiaj ntsig ntawm circuit evolution?

Schema evolution tuaj yeem siv thaum twg los tau npaj hloov lub schema ntawm koj lub rooj (raws li tsis yog thaum koj yuam kev ntxiv kab rau koj DataFrame uas yuav tsum tsis txhob muaj). Qhov no yog txoj hauv kev yooj yim tshaj plaws los hloov koj lub tswv yim vim tias nws cia li ntxiv cov npe kab thiab cov ntaub ntawv yam tsis tas yuav tsum tshaj tawm meej.

xaus

Schema tub ceev xwm tsis lees paub cov kab tshiab lossis lwm yam kev hloov pauv uas tsis sib haum nrog koj lub rooj. Los ntawm kev teeb tsa thiab tswj cov qauv siab, cov kws tshuaj ntsuam thiab cov kws tsim qauv tuaj yeem ntseeg tau tias lawv cov ntaub ntawv muaj kev ncaj ncees siab tshaj plaws, sib txuas lus kom meej thiab meej, tso cai rau lawv txiav txim siab ua lag luam zoo dua.

Ntawm qhov tod tes, schema evolution ntxiv kev tswj hwm los ntawm kev ua kom yooj yim liam tsis siv neeg schema hloov. Tom qab tag nrho, nws yuav tsum tsis txhob yuav nyuaj ntxiv ib kem.

Qhov yuam kev ntawm lub tswv yim yog yang, qhov twg evolution ntawm lub tswv yim yog yin. Thaum siv ua ke, cov yam ntxwv no ua rau lub suab nrov thiab cov teeb liab kho tau yooj yim dua li puas tau.

Peb kuj xav ua tsaug rau Mukul Murthy thiab Pranav Anand rau lawv txoj kev koom tes hauv tsab xov xwm no.

Lwm cov lus hauv no series:

Dhia rau hauv Delta Lake: Unpacking the Transaction Log

Cov ntawv txheeb

Kev kawm-qib tshuab kev kawm nrog Delta Lake

Dab tsi yog lub pas dej data?

Xav paub ntau ntxiv txog chav kawm

Tau qhov twg los: www.hab.com

Ntxiv ib saib