Lowani mu Delta Lake: Schema Enforcement ndi Evolution

Pa Habr! Ndikupereka kwa inu kumasulira kwa nkhaniyi "Kulowa mu Delta Lake: Schema Enforcement & Evolution" olemba Burak Yavuz, Brenner Heintz ndi Denny Lee, omwe adakonzedwa poyembekezera kuyamba kwa maphunzirowa. Data Engineer kuchokera ku OTUS.

Lowani mu Delta Lake: Schema Enforcement ndi Evolution

Deta, monga zomwe takumana nazo, zikuchulukirachulukira ndikusinthika. Kuti tipitirizebe kutero, malingaliro athu a dziko lapansi ayenera kusintha kuti agwirizane ndi deta yatsopano, yomwe ina ili ndi miyeso yatsopano—njira zatsopano zowonera zinthu zomwe sitinali kuzidziŵa kale. Zitsanzo zamaganizidwezi sizosiyana kwambiri ndi makonzedwe a patebulo omwe amatsimikizira momwe timagawira ndi kukonza zidziwitso zatsopano.

Izi zimatifikitsa ku nkhani ya kasamalidwe ka schema. Momwe zovuta zamabizinesi ndi zofunikira zimasinthira pakapita nthawi, momwemonso momwe data yanu imasinthira. Delta Lake imapangitsa kukhala kosavuta kuwonetsa miyeso yatsopano ngati kusintha kwa data. Ogwiritsa ntchito amatha kugwiritsa ntchito semantics yosavuta kuti azitha kuyang'anira ma schema awo patebulo. Zida zimenezi zikuphatikiza Schema Enforcement, yomwe imateteza ogwiritsa ntchito kuti asawononge matebulo awo mosadziwa ndi zolakwika kapena deta yosafunikira, ndi Schema Evolution, yomwe imalola kuti mizere yatsopano ya data yamtengo wapatali ionjezedwe pamalo oyenera. M'nkhaniyi, tizama mozama pakugwiritsa ntchito zida izi.

Kumvetsetsa Table Schemas

DataFrame iliyonse mu Apache Spark imakhala ndi schema yomwe imatanthawuza mawonekedwe a data, monga mitundu ya data, mizati, ndi metadata. Ndi Delta Lake, schema ya tebulo imasungidwa mumtundu wa JSON mkati mwa chipika chogulitsira.

Kodi kukhazikitsa dongosolo ndi chiyani?

Schema Enforcement, yomwe imadziwikanso kuti Schema Validation, ndi njira yachitetezo ku Delta Lake yomwe imatsimikizira kuti deta ili bwino pokana zolemba zomwe sizikugwirizana ndi schema ya tebulo. Monga woyang'anira alendo pa desiki lakutsogolo la malo odyera odziwika okha, amawunika ngati gawo lililonse lazomwe zalowetsedwa patebulo lili pamndandanda wofananira wa magawo omwe amayembekezeredwa (mwanjira ina, ngati pali "kusungitsa" kwa chilichonse cha iwo. ), ndikukana zolemba zilizonse zomwe sizili pamndandanda.

Kodi schema imagwira ntchito bwanji?

Delta Lake imagwiritsa ntchito schema-on-write checking, zomwe zikutanthauza kuti zonse zatsopano zomwe zimalemba patebulo zimafufuzidwa kuti zigwirizane ndi schema ya tebulo lomwe mukufuna panthawi yolemba. Ngati schema sichikugwirizana, Delta Lake imachotsa ntchitoyo kwathunthu (palibe deta yolembedwa) ndipo imadzutsa chosiyana kuti adziwitse wogwiritsa ntchito zosagwirizana.
Delta Lake imagwiritsa ntchito malamulo otsatirawa kuti adziwe ngati mbiri ikugwirizana ndi tebulo. Writeable DataFrame:

  • sichingakhale ndi zigawo zina zomwe sizili mu schema ya tebulo lomwe mukufuna. Mosiyana ndi izi, zonse zili bwino ngati zomwe zikubwera zilibe mizati yonse kuchokera patebulo - mizati iyi idzangoperekedwa zopanda pake.
  • sangakhale ndi mitundu ya data yomwe ili yosiyana ndi mitundu ya data yamagulu omwe ali muzolowera. Ngati gawo latebulo lomwe mukufuna lili ndi data ya StringType, koma gawo lofananira mu DataFrame lili ndi data ya IntegerType, kukakamiza kwa schema kungapangitse kuti ntchitoyo isachitike.
  • sichingakhale ndi mayina a magawo omwe amasiyana pokhapokha. Izi zikutanthauza kuti simungakhale ndi zigawo zotchedwa 'Foo' ndi 'foo' zomwe zafotokozedwa patebulo lomwelo. Ngakhale Spark itha kugwiritsidwa ntchito movutikira kapena mopanda vuto (losasinthika), Delta Lake ndiyoteteza milandu koma imakhala yosakhudzidwa mkati mwa schema yosungirako. Parquet imakhala yovuta kwambiri posunga ndi kubweza zambiri zazagawo. Kuti tipewe zolakwika zomwe zingachitike, kuwonongeka kwa data, kapena kutayika kwa data (chinthu chomwe tidakumana nacho pa Databricks), tidaganiza zowonjezera izi.

Kuti tifotokoze izi, tiyeni tiwone zomwe zimachitika mu code ili m'munsiyi pamene tiyesa kuwonjezera zipilala zatsopano ku tebulo la Delta Lake lomwe silinakonzedwe kuti livomereze.

# Сгенерируем DataFrame ссуд, который мы добавим в нашу таблицу Delta Lake
loans = sql("""
            SELECT addr_state, CAST(rand(10)*count as bigint) AS count,
            CAST(rand(10) * 10000 * count AS double) AS amount
            FROM loan_by_state_delta
            """)

# Вывести исходную схему DataFrame
original_loans.printSchema()

root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
 
# Вывести новую схему DataFrame
loans.printSchema()
 
root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
  |-- amount: double (nullable = true) # new column
 
# Попытка добавить новый DataFrame (с новым столбцом) в существующую таблицу
loans.write.format("delta") 
           .mode("append") 
           .save(DELTALAKE_PATH)

Returns:

A schema mismatch detected when writing to the Delta table.
 
To enable schema migration, please set:
'.option("mergeSchema", "true")'
 
Table schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
 
Data schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
-- amount: double (nullable = true)
 
If Table ACLs are enabled, these options will be ignored. Please use the ALTER TABLE command for changing the schema.

M'malo mongowonjezera mizati yatsopano, Delta Lake imayika schema ndikusiya kulemba. Pofuna kudziwa kuti ndi gawo liti (kapena gulu la magawo) lomwe likuyambitsa kusiyanako, Spark amatulutsa ma schema onse awiri kuchokera pamndandanda wofananira.

Kodi ubwino wokhazikitsa schema ndi chiyani?

Chifukwa kukakamiza kwa schema ndi cheke chokhazikika, ndi chida chabwino kwambiri chogwiritsa ntchito ngati mlonda wapazipata zoyera, zosinthidwa bwino zomwe zakonzeka kupanga kapena kugwiritsidwa ntchito. Amagwiritsidwa ntchito pamatebulo omwe amadyetsa data mwachindunji:

  • Makina ophunzirira makina
  • Zithunzi za BI
  • Kusanthula kwa data ndi zida zowonera
  • Dongosolo lililonse lopanga lomwe limafunikira ma schema okhazikika, olembedwa mwamphamvu.

Kukonzekera deta yawo pavuto lomalizali, ogwiritsa ntchito ambiri amagwiritsa ntchito zomangamanga zosavuta za "multi-hop" zomwe zimalowetsa pang'onopang'ono dongosolo mu matebulo awo. Kuti mudziwe zambiri za izi, mukhoza kufufuza nkhaniyi Kuphunzira pamakina opangidwa ndi Delta Lake.

Zachidziwikire, kukakamiza kwa schema kumatha kugwiritsidwa ntchito paliponse pamapaipi anu, koma kumbukirani kuti kukhamukira patebulo pankhaniyi kungakhale kokhumudwitsa chifukwa, mwachitsanzo, mwayiwala kuti mudawonjezera gawo lina ku data yomwe ikubwera.

Kupewa kuchepetsedwa kwa data

Pofika pano mwina mukudabwa kuti, mkanganowo ndi chiyani? Kupatula apo, nthawi zina cholakwika cha "schema mismatch" chosayembekezereka chimakupangitsani kuti muyende bwino, makamaka ngati ndinu watsopano ku Delta Lake. Bwanji osalola kuti schema isinthe momwe ikufunikira kuti ndilembe DataFrame yanga zivute zitani?

Monga momwe mwambi wakale umanenera, “kupewa kuchira n’kofunika kwambiri.” Nthawi ina, ngati simusamala kukakamiza schema yanu, zovuta zofananira zamtundu wa data zimadzetsa mitu yawo yoyipa - zowoneka ngati zofananira zamasamba zitha kukhala ndi m'mphepete, mizati yowonongeka, mapu osasinthika, kapena zinthu zina zowopsa zomwe mungalote. maloto oipa. Njira yabwino ndikuyimitsa adaniwa pachipata - ndikukhazikitsa schema - ndikuthana nawo pakuwala, m'malo mochedwa akayamba kubisalira mumdima wa code yanu yopanga.

Kukhazikitsa schema kumakupatsani chitsimikizo kuti schema ya tebulo lanu sisintha pokhapokha mutavomereza kusintha. Izi zimalepheretsa kuchepetsedwa kwa data, komwe kumatha kuchitika pamene mizati yatsopano ikuwonjezedwa pafupipafupi kotero kuti matebulo ofunikira, oponderezedwa amataya tanthauzo lake komanso zothandiza chifukwa cha kusefukira kwa data. Pokulimbikitsani kuti mukhale dala, khalani ndi miyezo yapamwamba, ndikuyembekeza zamtundu wapamwamba, kukakamiza kwa schema kumachita ndendende zomwe zidapangidwira - kukuthandizani kuti mukhale osamala komanso ma spreadsheets anu akhale oyera.

Ngati mutaganiziranso zambiri mumaganiza kuti ndinudi muyenera onjezani mzere watsopano - palibe vuto, m'munsimu muli kukonza mzere umodzi. Yankho lake ndi kusinthika kwa dera!

Kodi schema evolution ndi chiyani?

Chisinthiko cha schema ndi mawonekedwe omwe amalola ogwiritsa ntchito kusintha mosavuta tebulo lamakono malinga ndi deta yomwe imasintha pakapita nthawi. Imagwiritsidwa ntchito nthawi zambiri pochita zowonjezera kapena kulembanso ntchito kuti isinthe schema kuti ikhale ndi gawo limodzi kapena zingapo zatsopano.

Kodi schema evolution imagwira ntchito bwanji?

Potsatira chitsanzo cha gawo lapitalo, opanga amatha kugwiritsa ntchito schema kusinthika mosavuta kuti awonjezere mizati yatsopano yomwe idakanidwa kale chifukwa cha kusagwirizana kwa schema. Circuit evolution imayendetsedwa ndi kuwonjezera .option('mergeSchema', 'true') ku timu yanu ya Spark .write или .writeStream.

# Добавьте параметр mergeSchema
loans.write.format("delta") 
           .option("mergeSchema", "true") 
           .mode("append") 
           .save(DELTALAKE_SILVER_PATH)

Kuti muwone graph, yesani funso lotsatira la Spark SQL

# Создайте график с новым столбцом, чтобы подтвердить, что запись прошла успешно
%sql
SELECT addr_state, sum(`amount`) AS amount
FROM loan_by_state_delta
GROUP BY addr_state
ORDER BY sum(`amount`)
DESC LIMIT 10

Lowani mu Delta Lake: Schema Enforcement ndi Evolution
Kapenanso, mutha kukhazikitsa izi pagawo lonse la Spark powonjezera spark.databricks.delta.schema.autoMerge = True ku kasinthidwe ka Spark. Koma gwiritsani ntchito izi mosamala, popeza kutsata ma schema sikudzakuchenjezaninso za kusagwirizana kwa schema mwangozi.

Mwa kuphatikiza parameter mu pempho mergeSchema, mizati yonse yomwe ilipo mu DataFrame koma osati patebulo lolunjika imawonjezedwa kumapeto kwa schema ngati gawo la zolembera. Minda yosungidwa ikhoza kuwonjezeredwa ndipo izi zidzawonjezedwa kumapeto kwa mizati yofananira.

Akatswiri a madeti ndi asayansi a data angagwiritse ntchito njirayi kuti awonjezere zigawo zatsopano (mwina metric yomwe yatsatiridwa posachedwapa kapena ndondomeko yamalonda ya mwezi uno) pamatebulo awo opangira makina ophunzirira popanda kuswa zitsanzo zomwe zilipo kale kutengera ndime zakale.

Mitundu yotsatirayi yakusintha kwa schema imaloledwa ngati gawo la chisinthiko cha schema pakuwonjezera tebulo kapena kulembanso:

  • Kuwonjezera zigawo zatsopano (izi ndizochitika zofala kwambiri)
  • Kusintha mitundu ya data kuchokera ku NullType -> mtundu wina uliwonse kapena kukwezedwa kuchokera ku ByteType -> ShortType -> IntegerType

Zosintha zina zosaloledwa mkati mwa kusinthika kwa schema zimafuna kuti schema ndi deta zilembedwenso powonjezera .option("overwriteSchema", "true"). Mwachitsanzo, ngati gawo la "Foo" poyambirira linali lathunthu ndipo schema yatsopano inali mtundu wa data, ndiye kuti mafayilo onse a Parquet (data) amayenera kulembedwanso. Zosintha zotere zikuphatikiza:

  • kuchotsa gawo
  • kusintha mtundu wa data wandalama yomwe ilipo (pamalo)
  • kutchulanso zigawo zomwe zimasiyana pokhapokha (mwachitsanzo, "Foo" ndi "foo")

Pomaliza, ndi kutulutsidwa kotsatira kwa Spark 3.0, DDL yowonekera idzathandizidwa mokwanira (pogwiritsa ntchito ALTER TABLE), kulola ogwiritsa ntchito kuchita izi pazida za tebulo:

  • kuwonjezera mizati
  • kusintha ndemanga
  • kuyika zinthu za tebulo zomwe zimawongolera machitidwe a tebulo, monga kuyika kutalika kwa nthawi yomwe chipika chamalonda chasungidwa.

Ubwino wa chisinthiko cha dera ndi chiyani?

Kusintha kwa Schema kumatha kugwiritsidwa ntchito nthawi iliyonse yomwe muli funa sinthani schema patebulo lanu (mosiyana ndi pomwe mudawonjezera mwangozi zipilala ku DataFrame zomwe siziyenera kukhalapo). Iyi ndiye njira yosavuta yosamutsira schema yanu chifukwa imangowonjezera mayina olondola ndi mitundu ya data popanda kuwauza momveka bwino.

Pomaliza

Kukhazikitsa kwa schema kumakana mizati yatsopano kapena zosintha zina za schema zomwe sizikugwirizana ndi tebulo lanu. Pokhazikitsa ndi kusunga miyezo yapamwambayi, akatswiri ndi akatswiri amatha kukhulupirira kuti deta yawo ili ndi umphumphu wapamwamba kwambiri, kuyankhulana momveka bwino komanso momveka bwino, kuwalola kupanga zisankho zabwino zamalonda.

Kumbali ina, kusinthika kwa schema kumakwaniritsa kukakamiza mwa kuphweka akuti zosintha za schema zokha. Kupatula apo, zisakhale zovuta kuwonjezera ndime.

Kugwiritsa ntchito mokakamizidwa kwa chiwembucho ndi yang, komwe kusinthika kwa dongosololi ndi yin. Zikagwiritsidwa ntchito limodzi, izi zimapangitsa kuti phokoso likhale losavuta komanso kuwongolera ma sign kukhala kosavuta kuposa kale.

Tikufunanso kuthokoza Mukul Murthy ndi Pranav Anand chifukwa cha zopereka zawo m'nkhaniyi.

Nkhani zina m'nkhani ino:

Lowani mu Delta Lake: Kutsegula Logi ya Transaction

Nkhani Zogwirizana nazo

Kuphunzira pamakina opangidwa ndi Delta Lake

Kodi nyanja ya data ndi chiyani?

Dziwani zambiri za maphunzirowa

Source: www.habr.com

Kuwonjezera ndemanga