Sawubona, Habr! Ngethula ekunakeni kwakho ukuhunyushwa kwalesi sihloko nguBurak Yavuz, uBrenner Heintz kanye noDenny Lee, elungiselelwe kulindelwe ukwethulwa kwezifundo kusuka ku-OTUS.

Idatha, njengokuhlangenwe nakho kwethu, ihlale iqongelela futhi iyashintsha. Ukuze siqhubeke, amamodeli ethu engqondo omhlaba kufanele avumelane nedatha entsha, enye yayo equkethe ubukhulu obusha—izindlela ezintsha zokubheka izinto esasingazi lutho ngazo ngaphambili. Lawa mamodeli engqondo awafani nezikimu ezikumaspredishithi ezinquma ukuthi siluhlukanisa kanjani futhi silucubungule kanjani ulwazi olusha.
Lokhu kusiletha odabeni lokuphathwa kwe-schema. Njengoba imigomo nezidingo zebhizinisi zishintsha ngokuhamba kwesikhathi, kanjalo nesakhiwo sedatha yakho. I-Delta Lake yenza kube lula ukwethula ubukhulu obusha njengoba idatha ishintsha. Abasebenzisi banokufinyelela ku-semantics elula yokuphatha izikimu zabo zetafula. Lawa mathuluzi ahlanganisa i-Schema Enforcement, evikela abasebenzisi ukuthi bangahlobanisi amathebula abo ngamaphutha noma idatha engadingekile, kanye ne-Schema Evolution, enezela ngokuzenzakalelayo amakholomu amasha aqukethe idatha ebalulekile ezindaweni ezifanele. Kulesi sihloko, sizongena sijule ekusebenziseni la mathuluzi.
Ukuqonda izikimu zethebula
I-DataFrame ngayinye ku-Apache Spark iqukethe i-schema esichaza ifomu ledatha, njengezinhlobo zedatha, amakholomu, nemethadatha. Nge-Delta Lake, i-schema yethebula igcinwa ngefomethi ye-JSON ngaphakathi kwelogi yokwenziwe.
Kuyini ukugcinwa kwesikimu?
I-Schema Enforcement, eyaziwa nangokuthi Ukuqinisekiswa Kwe-Schema, iyindlela yokuvikela e-Delta Lake eqinisekisa ikhwalithi yedatha ngokwenqaba amarekhodi angahambisani ne-schema yethebula. Njengomsingathi endaweni yokudlela edumile owamukela kuphela ukubhukha, I-Schema Enforcement ihlola ukuthi ikholomu ngayinye yedatha efakwe kuthebula isohlwini oluhambisanayo lwamakholomu alindelekile (ngamanye amazwi, ukuthi ingabe kukhona "ukubhuka" kweqembu ngalinye) futhi yenqaba noma yimaphi amarekhodi anamakholomu angekho ohlwini.
Ngabe ukusebenza kwe-schema kusebenza kanjani?
I-Delta Lake isebenzisa ukuhlola kwe-schema-on-write, okusho ukuthi konke okusha okubhalela kuthebula kuhlolelwa ukuhambisana ne-schema yethebula eliqondiwe ngesikhathi sokubhala. Uma i-schema singahambisani, i-Delta Lake ihoxisa umsebenzi ngokuphelele (ayikho idatha ebhaliwe) futhi iphakamisa okuhlukile ukwazisa umsebenzisi ngokungahambisani.
I-Delta Lake isebenzisa imithetho elandelayo ukuze inqume ukuthi irekhodi liyahambisana yini nethebula. I-DataFrame iyabhalwa:
- Ayikwazi ukuqukatha amakholomu engeziwe angekho ku-schema sethebula eliqondiwe. Ngokuphambene, kuhle uma idatha engenayo ingenayo yonke ikholomu yethebula—lawo makholomu azomane anikezwe amanani angenalutho.
- ayikwazi ukuba nezinhlobo zedatha yekholomu ezehlukile ezinhlotsheni zedatha yekholomu kuthebula eliqondiwe. Uma ikholomu kuthebula eliqondiwe iqukethe idatha ye-StringType, kodwa ikholomu ehambisanayo ku-DataFrame iqukethe idatha ye-IntegerType, ukusetshenziswa kwe-schema kuzokhipha okuhlukile futhi kuvimbele umsebenzi wokubhala ukuthi ungenzeki.
- ayikwazi ukuqukatha amagama ekholomu ahluka kuphela uma kwenzeka. Lokhu kusho ukuthi awukwazi ukuba namakholomu anegama elithi 'Foo' nelithi 'foo' kuthebula elifanayo. Nakuba i-Spark ingasetshenziswa ku-case-sensitive noma ku-case-sensitive (ngokuzenzakalelayo), i-Delta Lake igcina icala kodwa ayizweli lapho igcina i-schema. I-Parquet izwela kakhulu uma igcina futhi ithola ulwazi lwekholomu. Ukuze sigweme amaphutha angaba khona, ukonakala kwedatha, noma ukulahleka kwedatha (esihlangabezane nakho mathupha kwa-Databricks), sinqume ukungeza lo mkhawulo.
Ukukhombisa lokhu, ake sibheke ukuthi kwenzekani kukhodi engezansi uma sizama ukwengeza amakholomu asanda kukhiqizwa etafuleni le-Delta Lake elingakalungiselelwa ukuwamukela.
# Сгенерируем DataFrame ссуд, который мы добавим в нашу таблицу Delta Lake
loans = sql("""
SELECT addr_state, CAST(rand(10)*count as bigint) AS count,
CAST(rand(10) * 10000 * count AS double) AS amount
FROM loan_by_state_delta
""")
# Вывести исходную схему DataFrame
original_loans.printSchema()
root
|-- addr_state: string (nullable = true)
|-- count: integer (nullable = true)
# Вывести новую схему DataFrame
loans.printSchema()
root
|-- addr_state: string (nullable = true)
|-- count: integer (nullable = true)
|-- amount: double (nullable = true) # new column
# Попытка добавить новый DataFrame (с новым столбцом) в существующую таблицу
loans.write.format("delta")
.mode("append")
.save(DELTALAKE_PATH)
Returns:
A schema mismatch detected when writing to the Delta table.
To enable schema migration, please set:
'.option("mergeSchema", "true")'
Table schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
Data schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
-- amount: double (nullable = true)
If Table ACLs are enabled, these options will be ignored. Please use the ALTER TABLE command for changing the schema.Esikhundleni sokwengeza amakholomu amasha ngokuzenzakalelayo, i-Delta Lake iphoqelela i-schema futhi iyeke ukubhala. Ukusiza ukunquma ukuthi iyiphi ikholomu (noma amasethi amakholomu) abangela umehluko, i-Spark ibonisa womabili ama-schema ukusuka ekulandeleni isitaki ukuze kuqhathaniswe.
Iyini inzuzo yokuphoqelela uhlelo?
Ngenxa yokuthi ukusetshenziswa kwe-schema kuwukuhlola okuqinile, kuyithuluzi elihle kakhulu lokusebenzisa njengonogada kudathasethi ehlanzekile, eguquleke ngokugcwele elungele ukukhiqizwa noma ukusetshenziswa. Ngokuvamile lisetshenziswa kumathebula aphakela idatha ngokuqondile:
- Ama-algorithms wokufunda komshini
- BI amadeshibhodi
- Ukuhlaziya idatha namathuluzi okubuka
- Noma iyiphi isistimu yokukhiqiza edinga ama-schema e-semantic ahlelwe ngokuqinile, athayiphiwe ngokuqinile.
Ukulungiselela idatha yabo yalesi sithiyo sokugcina, abasebenzisi abaningi basebenzisa i-architecture elula "ye-multi-hop" eyethula kancane kancane isakhiwo kumathebula abo. Ukuze uthole ukwaziswa okwengeziwe, ungafunda isihloko
Vele, ungasebenzisa ukuphoqelelwa kwe-schema noma yikuphi kupayipi lakho, kodwa khumbula ukuthi ukubhala etafuleni kulokhu kungakhungathekisa, isibonelo, ngoba ukhohlwe ukuthi ungeze enye ikholomu kudatha engenayo.
Ukuvimbela uketshezi lwedatha
Kuleli qophelo, ungahle uzibuze ukuthi yonke le ngxabano imayelana nani? Phela, kwesinye isikhathi iphutha elingalindelekile elithi "schema mismatch" lingakukhuphula ekuhambeni kwakho komsebenzi, ikakhulukazi uma umusha eDelta Lake. Kungani ungavumeli i-schema ishintshe njengoba kudingeka ukuze ngibhale i-DataFrame yami noma ngabe yini?
Njengoba isisho sakudala sisho, "i-ounce yokuvimbela ifanele iphawundi lokwelapha." Kwesinye isikhathi, uma unganakekeli ukuphoqelela i-schema sakho, izinkinga zokusebenzisana kohlobo lwedatha zizovusa amakhanda amabi—imithombo yedatha eluhlaza ebonakala ifana ingase iqukathe izehlakalo ezisemaphethelweni, amakholomu awonakele, amamephu angalungile, noma amanye amaphupho amabi. Indlela engcono kakhulu ukumisa lezi zitha esangweni—ngokusetshenziswa kwe-schema—futhi ubhekane nazo obala, kunokuba kamuva, lapho ziqala ukucasha ekujuleni okumnyama kwekhodi yakho yokukhiqiza.
Ukugcinwa kwe-schema kuqinisekisa ukuthi i-schema sethebula lakho ngeke sishintshe ngaphandle kokuthi ugunyaze ngokusobala ushintsho. Lokhu kuvimbela ukuhlanjululwa kwedatha, okungenzeka lapho amakholomu amasha engezwa njalo kangangokuthi amathebula ahlangene abalulekile alahlekelwa incazelo yawo nokuba wusizo ngenxa yedatha eningi. Ngokukukhuthaza ukuthi wenze ngamabomu, ubeke izindinganiso eziphakeme, futhi ulindele ikhwalithi ephezulu, ukusetshenziswa kwe-schema kwenza khona kanye lokho obekuhloselwe ukukwenza—ukukusiza ugcine ubuqotho futhi ugcine amatafula akho ehlanzekile.
Uma ngokucabangela okwengeziwe unquma ukuthi wenza ngempela kudingeka Ukwengeza ikholomu entsha akuyona inkinga; ukulungiswa komugqa owodwa kunikezwa ngezansi. Isixazululo ukuvela kwe-schema!
Iyini i-schema evolution?
Ukuvela kwe-schema isici esivumela abasebenzisi ukuthi baguqule kalula i-schema yamanje yethebula ukuze ivumelane nokushintsha kwedatha ngokuhamba kwesikhathi. Isetshenziswa kakhulu ngesikhathi sokufaka noma ukubhala kabusha ukusebenza ukuze ulungise ngokuzenzakalelayo i-schema ukuze ifake ikholomu eyodwa noma amaningi amasha.
I-schema evolution isebenza kanjani?
Ngokulandela isibonelo esivela esigabeni sangaphambilini, onjiniyela bangasebenzisa kalula ukuvela kwe-schema ukuze bengeze amakholomu amasha ayenqatshelwe ngaphambilini ngenxa yokungathobeli kwe-schema. Ukuvela kwe-schema kwenziwa kusebenze ngokungeza .option('mergeSchema', 'true') eqenjini lakho le-Spark .write или .writeStream.
# Добавьте параметр mergeSchema
loans.write.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save(DELTALAKE_SILVER_PATH)Ukuze ubuke igrafu, sebenzisa umbuzo olandelayo we-Spark SQL
# Создайте график с новым столбцом, чтобы подтвердить, что запись прошла успешно
%sql
SELECT addr_state, sum(`amount`) AS amount
FROM loan_by_state_delta
GROUP BY addr_state
ORDER BY sum(`amount`)
DESC LIMIT 10 
Kungenjalo, ungasetha le nketho kuyo yonke iseshini ye-Spark ngokungeza spark.databricks.delta.schema.autoMerge = True ekucushweni kwe-Spark. Kodwa-ke, sebenzisa lokhu ngokuqaphela, njengoba ukusetshenziswa kwe-schema ngeke kusakuxwayisa mayelana nokungahambisani kwe-schema okungahlosiwe.
Ngokufaka ipharamitha esicelweni mergeSchemaWonke amakholomu akhona ku-DataFrame kodwa engekho kuthebula eliqondiwe afakwa ngokuzenzakalelayo ku-schema ngesikhathi sokuloba. Izinkambu ezifakwe isidleke nazo zingangezwa, futhi zizophinde zengezwe kumakholomu ahambisanayo esakhiweni.
Onjiniyela bedatha nososayensi bangasebenzisa le nketho ukuze bengeze amakholomu amasha (mhlawumbe imethrikhi elandelelwe kamuva noma ikholomu yezibalo zokuthengisa zale nyanga) kumathebula abo okukhiqiza omshini wokufunda akhona ngaphandle kokuphula amamodeli akhona ngokusekelwe kumakholomu amadala.
Izinhlobo ezilandelayo zezinguquko ze-schema zivunyelwe njengengxenye yokuvela kwe-schema ngesikhathi sokufakwa kwethebula noma kubhalwa kabusha:
- Ukwengeza amakholomu amasha (lesi isimo esivame kakhulu)
- Ukushintsha izinhlobo zedatha ku-NullType -> noma yiluphi olunye uhlobo noma ukukhangisa kusuka ku-ByteType -> ShortType -> IntegerType
Ezinye izinguquko ezingavunyelwe ngaphakathi kokuvela kwe-schema zidinga i-schema nedatha ukuthi kubhalwe ngaphezulu ngokungeza .option("overwriteSchema", "true")Isibonelo, uma ikholomu ye-"Foo" ekuqaleni ibiyinani eliphelele, futhi i-schema esisha siwuchungechunge lwedatha, khona-ke wonke amafayela e-Parquet (idatha) kuzodingeka abhalwe kabusha. Izinguquko ezinjalo zihlanganisa:
- ukususa ikholomu
- ukushintsha uhlobo lwedatha lwekholomu ekhona (endaweni)
- ukuqamba kabusha amakholomu ahluka kuphela uma kwenzeka (isb. "Foo" kanye "no-foo")
Ekugcineni, ngokukhishwa okulandelayo, i-Spark 3.0, i-DDL ecacile (esebenzisa i-ALTER TABLE) izosekelwa ngokugcwele, okuvumela abasebenzisi ukwenza izenzo ezilandelayo kuma-schema ethebula:
- ukwengeza amakholomu
- ukushintsha amazwana ekholomu
- Ukusetha izici zethebula ezinquma ukuthi ithebula liziphatha kanjani, njengokusetha isikhathi sokugcinwa kwelogi yokwenziwe.
Iyini inzuzo ye-schema evolution?
Ukuvela kwe-schema kungasetshenziswa noma nini lapho ukhona uhlose ukwenza njalo Shintsha i-schema sethebula lakho (okungafani nokwengeza ngephutha amakholomu ku-DataFrame yakho okungafanele abe khona). Lena indlela elula kakhulu yokuthutha i-schema sakho ngoba sengeza ngokuzenzakalela amagama ekholomu alungile nezinhlobo zedatha ngaphandle kokuthi sikuveze ngokusobala.
isiphetho
Ukugcinwa kwe-schema kwenqaba noma yimaphi amakholomu amasha noma ezinye izinguquko ze-schema ezingahambisani nethebula lakho. Ngokusetha nokugcina lawa mazinga aphezulu, abahlaziyi nonjiniyela bangaqiniseka ukuthi idatha yabo inezinga eliphezulu lobuqotho, becabanga ngayo ngokucacile nangokufingqiwe, okubenza bakwazi ukwenza izinqumo ezisebenza ngempumelelo kakhulu zebhizinisi.
Ngakolunye uhlangothi, ukuvela kohlelo kuhambisana nokuphoqelelwa ngokwenza kube lula kusolwa Izinguquko ze-schema ezizenzakalelayo. Phela, ukwengeza ikholomu akufanele kube nzima kangako.
Ukuphoqelelwa kwesekethe yi-yin ye-yang to circuit evolution. Uma zisetshenziswa ndawonye, lezi zici zenza ukucisha umsindo nokushuna isignali kube lula kunangaphambili.
Sithanda futhi ukubonga uMkul Murthy noPranav Anand ngegalelo labo kulesi sihloko.
Ezinye izindatshana kulolu chungechunge:

Izihloko ezihlobene
Source: www.habr.com
