I-Delta Lake Dive: Ukuphoqelela Nokuziphendukela Kwezinhlelo

Sawubona, Habr! Ngethula ekunakeni kwakho ukuhunyushwa kwalesi sihloko "Diving into Delta Lake: Schema Enforcement & Evolution" ababhali uBurak Yavuz, uBrenner Heintz kanye noDenny Lee, elungiselelwe kulindelwe ukuqala kwezifundo Unjiniyela Wedatha kusuka ku-OTUS.

I-Delta Lake Dive: Ukuphoqelela Nokuziphendukela Kwezinhlelo

Idatha, njengokuhlangenwe nakho kwethu, ihlezi iqongelela futhi iyavela. Ukuze siqhubeke, amamodeli ethu engqondo omhlaba kufanele avumelane nedatha entsha, enye yayo equkethe ubukhulu obusha—izindlela ezintsha zokubheka izinto esasingazi lutho ngazo ngaphambili. Lawa mamodeli engqondo awahlukile kakhulu kuma-schema ethebula anquma ukuthi sihlukanisa kanjani futhi silucubungule kanjani ulwazi olusha.

Lokhu kusiletha odabeni lokuphathwa kwe-schema. Njengoba izinselelo zebhizinisi nezimfuneko zishintsha ngokuhamba kwesikhathi, kanjalo nesakhiwo sedatha yakho. I-Delta Lake yenza kube lula ukwethula izilinganiso ezintsha njengoba idatha ishintsha. Abasebenzisi banokufinyelela kuma-semantics alula ukuze baphathe izikimu zabo zethebula. Lawa mathuluzi ahlanganisa Ukuphoqelelwa Kwe-Schema, okuvikela abasebenzisi ekungcoliseni amathebula abo ngokungenhloso ngamaphutha noma idatha engadingekile, kanye ne-Schema Evolution, evumela amakholomu amasha edatha ebalulekile ukuthi engezwe ngokuzenzakalelayo ezindaweni ezifanele. Kulesi sihloko, sizongena sijule ekusebenziseni la mathuluzi.

Ukuqonda Izikimu Zethebula

I-DataFrame ngayinye ku-Apache Spark iqukethe i-schema echaza uhlobo lwedatha, njengezinhlobo zedatha, amakholomu, nemethadatha. NgeDelta Lake, i-schema yethebula igcinwa ngefomethi ye-JSON ngaphakathi kwelogi yokwenziwe.

Kuyini ukugcinwa kwesikimu?

I-Schema Enforcement, eyaziwa nangokuthi Ukuqinisekiswa Kwe-Schema, iyindlela yokuvikela e-Delta Lake eqinisekisa ikhwalithi yedatha ngokwenqaba amarekhodi angafani ne-schema sethebula. Njengomsingathi ophambi kwedeskithophu yokudlela edumile yokubhuka kuphela, uhlola ukuthi ikholomu ngayinye yedatha efakwe etafuleni isohlwini oluhambisanayo lwamakholomu alindelekile (ngamanye amazwi, ukuthi kukhona yini "ukubhuka" kwaleyo naleyo. ), futhi yenqaba noma imaphi amarekhodi anamakholomu angekho ohlwini.

Ngabe ukusebenza kwe-schema kusebenza kanjani?

I-Delta Lake isebenzisa ukuhlola kwe-schema-on-write, okusho ukuthi konke okusha okubhalela kuthebula kuhlolelwa ukuhambisana ne-schema yethebula eliqondiwe ngesikhathi sokubhala. Uma i-schema singahambisani, i-Delta Lake ihoxisa umsebenzi ngokuphelele (ayikho idatha ebhaliwe) futhi iphakamisa okuhlukile ukwazisa umsebenzisi ngokungahambisani.
I-Delta Lake isebenzisa imithetho elandelayo ukuze inqume ukuthi irekhodi liyahambisana yini nethebula. I-Writeable DataFrame:

  • ayikwazi ukuqukatha amakholomu engeziwe angekho ku-schema sethebula eliqondiwe. Ngokuphambene, konke kuhamba kahle uma idatha engenayo ingenawo wonke amakholomu asuka kuthebula - lawa makholomu azomane anikezwe amanani angenalutho.
  • ayikwazi ukuba nezinhlobo zedatha yekholomu ezehlukile ezinhlotsheni zedatha zamakholomu kuthebula eliqondiwe. Uma ikholomu yethebula eliqondiwe iqukethe idatha ye-StringType, kodwa ikholomu ehambisanayo ku-DataFrame iqukethe idatha ye-IntegerType, ukusetshenziswa kwe-schema kuzokwenza okuhlukile futhi kuvimbele umsebenzi wokubhala ukuthi wenzeke.
  • ayikwazi ukuqukatha amagama ekholomu ahluka kuphela uma kwenzeka. Lokhu kusho ukuthi awukwazi ukuba namakholomu anegama elithi 'Foo' nelithi 'foo' kuthebula elifanayo. Nakuba i-Spark ingasetshenziswa kumodi ezwelayo noma engazweli (okuzenzakalelayo), i-Delta Lake ilondoloza izimo kodwa ayizweli ngaphakathi kwesitoreji se-schema. I-Parquet izwela kakhulu lapho ugcina futhi ubuyisela ulwazi lwekholomu. Ukuze sigweme amaphutha okungenzeka, ukonakala kwedatha, noma ukulahleka kwedatha (into thina mathupha esihlangabezane nayo kwa-Databricks), sinqume ukungeza lo mkhawulo.

Ukukhombisa lokhu, ake sibheke ukuthi kwenzekani kukhodi engezansi uma sizama ukwengeza amakholomu asanda kukhiqizwa etafuleni le-Delta Lake elingakalungiselelwa ukuwamukela.

# Сгенерируем DataFrame ссуд, который мы добавим в нашу таблицу Delta Lake
loans = sql("""
            SELECT addr_state, CAST(rand(10)*count as bigint) AS count,
            CAST(rand(10) * 10000 * count AS double) AS amount
            FROM loan_by_state_delta
            """)

# Вывести исходную схему DataFrame
original_loans.printSchema()

root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
 
# Вывести новую схему DataFrame
loans.printSchema()
 
root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
  |-- amount: double (nullable = true) # new column
 
# Попытка добавить новый DataFrame (с новым столбцом) в существующую таблицу
loans.write.format("delta") 
           .mode("append") 
           .save(DELTALAKE_PATH)

Returns:

A schema mismatch detected when writing to the Delta table.
 
To enable schema migration, please set:
'.option("mergeSchema", "true")'
 
Table schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
 
Data schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
-- amount: double (nullable = true)
 
If Table ACLs are enabled, these options will be ignored. Please use the ALTER TABLE command for changing the schema.

Esikhundleni sokwengeza amakholomu amasha ngokuzenzakalelayo, i-Delta Lake ibeka i-schema futhi iyeke ukubhala. Ukusiza ukucacisa ukuthi iyiphi ikholomu (noma isethi yamakholomu) ebangela umehluko, i-Spark ikhipha womabili ama-schema kusukela ekulandeleni isitaki ukuze kuqhathaniswe.

Iyini inzuzo yokuphoqelela i-schema?

Ngenxa yokuthi ukusetshenziswa kwe-schema kuwukuhlola okuqinile, kuyithuluzi elihle kakhulu ongalisebenzisa njengonogada kusethi yedatha ehlanzekile, eguquleke ngokugcwele elungele ukukhiqizwa noma ukusetshenziswa. Ivamise ukusetshenziswa kumathebula aphakela idatha ngokuqondile:

  • Ama-algorithms wokufunda komshini
  • BI amadeshibhodi
  • Ukuhlaziya idatha namathuluzi okubuka
  • Noma iyiphi isistimu yokukhiqiza edinga ama-schema e-semantic ahlelwe kakhulu, athayiphiwe ngokuqinile.

Ukulungiselela idatha yabo yalesi sithiyo sokugcina, abasebenzisi abaningi basebenzisa i-architecture elula "ye-multi-hop" eyethula kancane kancane isakhiwo kumathebula abo. Ukuze ufunde kabanzi ngalokhu, ungabheka isihloko Ukufunda komshini webanga lokukhiqiza nge-Delta Lake.

Yebo, ukusetshenziswa kwe-schema kungasetshenziswa noma yikuphi kupayipi lakho, kodwa khumbula ukuthi ukusakaza kuthebula kulokhu kungase kukhungathekise ngoba, ngokwesibonelo, ukhohlwe ukuthi ungeze enye ikholomu kudatha engenayo.

Ivimbela ukuhlanjululwa kwedatha

Manje ungase uzibuze, ngabe yini le engaka? Phela, kwesinye isikhathi iphutha elingalindelekile elithi "schema mismatch" lingakukhuphula ekuhambeni kwakho komsebenzi, ikakhulukazi uma umusha eDelta Lake. Kungani ungavumeli i-schema ishintshe njengoba kudingeka ukuze ngikwazi ukubhala i-DataFrame yami noma ngabe yini?

Njengoba isisho sakudala sisho, “isilinganiso sokuzivikela silingana nekhilogremu elilodwa lokwelapha.” Kwesinye isikhathi, uma unganakekeli ukuphoqelela i-schema sakho, izinkinga zokusebenzisana kohlobo lwedatha zizovusa amakhanda amabi - imithombo yedatha eluhlaza ebonakala ifana ingase iqukethe amacala, amakholomu owonakele, amamephu angalungile, noma ezinye izinto ezithusayo ongaphupha ngazo amaphupho amabi. Indlela engcono kakhulu ukumisa lezi zitha esangweni - ngokusetshenziswa kwe-schema - futhi ubhekane nazo ekukhanyeni, kunokuba kamuva lapho ziqala ukucasha ekujuleni okumnyama kwekhodi yakho yokukhiqiza.

Ukuphoqelela i-schema kukunikeza isiqinisekiso sokuthi i-schema yethebula lakho ngeke ishintshe ngaphandle kokuthi ugunyaze ushintsho. Lokhu kuvimbela ukuhlanjululwa kwedatha, okungenzeka lapho amakholomu amasha engezwa njalo kangangokuthi amathebula ayigugu ngaphambili, acindezelwe alahlekelwa incazelo yawo nokuba wusizo ngenxa yokuminyanisa idatha. Ngokukukhuthaza ukuthi wenze ngamabomu, ubeke izindinganiso eziphakeme, futhi ulindele ikhwalithi ephezulu, ukusetshenziswa kwe-schema kwenza khona kanye lokho okwakuklanyelwe ukukwenza—ukukusiza ukuthi uhlale unonembeza futhi amaspredishithi akho ehlanzekile.

Uma ekucubunguleni okwengeziwe unquma ukuthi ngempela kudingeka engeza ikholomu entsha - akunankinga, ngezansi ukulungiswa komugqa owodwa. Isixazululo ukuvela kwesifunda!

Iyini i-schema evolution?

Ukuvela kwe-schema isici esivumela abasebenzisi ukuthi baguqule kalula i-schema yethebula lamanje ngokuya ngedatha eguqukayo ngokuhamba kwesikhathi. Isetshenziswa kakhulu uma kwenziwa isengezo noma umsebenzi wokubhala kabusha ukulungisa ngokuzenzakalelayo i-schema ukuze ifake ikholomu eyodwa noma amaningi amasha.

I-schema evolution isebenza kanjani?

Ngokulandela isibonelo esivela esigabeni sangaphambilini, abathuthukisi bangasebenzisa kalula ukuvela kwe-schema ukuze bengeze amakholomu amasha ayenqatshelwe ngaphambilini ngenxa yokungahambelani kwe-schema. Ukuguquka kwesiyingi kwenziwa kusebenze ngokungeza .option('mergeSchema', 'true') eqenjini lakho le-Spark .write или .writeStream.

# Добавьте параметр mergeSchema
loans.write.format("delta") 
           .option("mergeSchema", "true") 
           .mode("append") 
           .save(DELTALAKE_SILVER_PATH)

Ukuze ubuke igrafu, sebenzisa umbuzo olandelayo we-Spark SQL

# Создайте график с новым столбцом, чтобы подтвердить, что запись прошла успешно
%sql
SELECT addr_state, sum(`amount`) AS amount
FROM loan_by_state_delta
GROUP BY addr_state
ORDER BY sum(`amount`)
DESC LIMIT 10

I-Delta Lake Dive: Ukuphoqelela Nokuziphendukela Kwezinhlelo
Kungenjalo, ungasetha le nketho kuyo yonke iseshini ye-Spark ngokungeza spark.databricks.delta.schema.autoMerge = True ekucushweni kwe-Spark. Kodwa sebenzisa lokhu ngokuqapha, njengoba ukusetshenziswa kwe-schema ngeke kusakwazisa ngokungahambisani kwe-schema okungahlosiwe.

Ngokufaka ipharamitha esicelweni mergeSchema, wonke amakholomu akhona ku-DataFrame kodwa engekho kuthebula eliqondiwe engezwa ngokuzenzakalelayo ekupheleni kwe-schema njengengxenye yokubhala umsebenzi. Izinkambu ezifakwe isidleke nazo zingangezwa futhi lezi zizokwengezwa ekugcineni kwamakholomu esakhiwo ahambisanayo.

Onjiniyela bedethi nososayensi bedatha bangasebenzisa le nketho ukuze bengeze amakholomu amasha (mhlawumbe imethrikhi elandelelwe kamuva noma ikholomu yokusebenza kokuthengisa yale nyanga) kumathebula abo akhona okukhiqiza umshini ngaphandle kokuphula amamodeli akhona asuselwe kumakholomu amadala.

Izinhlobo ezilandelayo zezinguquko ze-schema zivunyelwe njengengxenye yokuvela kwe-schema ngesikhathi sokwengezwa kwethebula noma kubhalwa kabusha:

  • Ukwengeza amakholomu amasha (lesi isimo esivame kakhulu)
  • Ukushintsha izinhlobo zedatha ku-NullType -> noma yiluphi olunye uhlobo noma ukukhangisa kusuka ku-ByteType -> ShortType -> IntegerType

Ezinye izinguquko ezingavunyelwe ngaphakathi kokuvela kwe-schema zidinga ukuthi i-schema nedatha ibhalwe kabusha ngokwengeza .option("overwriteSchema", "true"). Isibonelo, esimweni lapho ikholomu ethi "Foo" ekuqaleni ibiyinani eliphelele futhi i-schema esisha siwuhlobo lwedatha yeyunithi yezinhlamvu, khona-ke wonke amafayela e-Parquet(idatha) kuzodingeka abhalwe kabusha. Izinguquko ezinjalo zihlanganisa:

  • ukususa ikholomu
  • ukushintsha uhlobo lwedatha lwekholomu ekhona (endaweni)
  • ukuqamba kabusha amakholomu ahluka kuphela uma kwenzeka (isibonelo, "Foo" kanye "no-foo")

Ekugcineni, ngokukhishwa okulandelayo kwe-Spark 3.0, i-DDL ecacile izosekelwa ngokugcwele (kusetshenziswa i-ALTER TABLE), okuvumela abasebenzisi ukwenza izenzo ezilandelayo kuma-schema ethebula:

  • ukwengeza amakholomu
  • ukushintsha amazwana ekholomu
  • ukusetha izakhiwo zethebula ezilawula ukuziphatha kwethebula, njengokusetha ubude besikhathi ukugcinwa kwelogi yokwenziwe.

Iyini inzuzo yokuziphendukela kwesifunda?

Ukuvela kwe-schema kungasetshenziswa noma nini lapho ukhona hlose shintsha i-schema setafula lakho (okungafani nalapho ungeze amakholomu ngephutha ku-DataFrame yakho okungafanele abe khona). Lena indlela elula kakhulu yokuthutha i-schema sakho ngoba sengeza ngokuzenzakalela amagama ekholomu alungile nezinhlobo zedatha ngaphandle kokuthi sikumemezele ngokusobala.

isiphetho

Ukugcinwa kwe-schema kwenqaba noma yimaphi amakholomu amasha noma ezinye izinguquko ze-schema ezingahambisani nethebula lakho. Ngokusetha nokugcina lawa mazinga aphezulu, abahlaziyi nonjiniyela bangathemba ukuthi idatha yabo inezinga eliphakeme kakhulu lobuqotho, bayikhulume ngokucacile nangokucacile, okubavumela ukuthi benze izinqumo ezingcono zebhizinisi.

Ngakolunye uhlangothi, ukuvela kwe-schema kuhambisana nokuphoqelela ngokwenza kube lula kusolwa izinguquko ze-schema ezizenzakalelayo. Phela, akufanele kube nzima ukwengeza ikholomu.

Ukusetshenziswa okuphoqelekile kwesikimu yi-yang, lapho ukuvela kohlelo kuyi-yin. Uma zisetshenziswa ndawonye, ​​lezi zici zenza ukucisha umsindo nokushuna isignali kube lula kunangaphambili.

Sithanda futhi ukubonga uMkul Murthy noPranav Anand ngegalelo labo kulesi sihloko.

Ezinye izindatshana kulolu chungechunge:

Gxumela echibini le-Delta: Ukuqaqa Ilogi Yokwenziwayo

Izihloko ngesihloko

Ukufunda komshini webanga lokukhiqiza nge-Delta Lake

Yini ichibi ledatha?

Funda kabanzi mayelana nesifundo

Source: www.habr.com

Engeza amazwana