ืฆืœื•ืœ ืœืชื•ืš ืื’ื Delta: Schema Enforcement and Evolution

ื”ื™ื™ ื”ื‘ืจ! ืื ื™ ืžืฆื™ื’ ืœืชืฉื•ืžืช ืœื‘ืš ืืช ื”ืชืจื’ื•ื ืฉืœ ื”ืžืืžืจ "ืฆืœื™ืœื” ืœืชื•ืš ืื’ื ื“ืœืชื: ืื›ื™ืคืช ืกื›ื™ืžื” ื•ืื‘ื•ืœื•ืฆื™ื”" ื”ืžื—ื‘ืจื™ื ื‘ื•ืจืง ื™ื‘ื•ื–, ื‘ืจื ืจ ื”ื™ื™ื ืฅ ื•ื“ื ื™ ืœื™, ืฉื”ื•ื›ืŸ ืœืงืจืืช ืชื—ื™ืœืช ื”ืงื•ืจืก ืžื”ื ื“ืก ื ืชื•ื ื™ื ืž-OTUS.

ืฆืœื•ืœ ืœืชื•ืš ืื’ื Delta: Schema Enforcement and Evolution

ื ืชื•ื ื™ื, ื›ืžื• ื”ื ื™ืกื™ื•ืŸ ืฉืœื ื•, ืžืฆื˜ื‘ืจื™ื ื•ืžืชืคืชื—ื™ื ื›ืœ ื”ื–ืžืŸ. ื›ื“ื™ ืœืขืžื•ื“ ื‘ืงืฆื‘, ื”ืžื•ื“ืœื™ื ื”ืžื ื˜ืœื™ื™ื ืฉืœื ื• ืฉืœ ื”ืขื•ืœื ื—ื™ื™ื‘ื™ื ืœื”ืกืชื’ืœ ืœื ืชื•ื ื™ื ื—ื“ืฉื™ื, ืฉื—ืœืงื ืžื›ื™ืœื™ื ืžื™ืžื“ื™ื ื—ื“ืฉื™ื - ื“ืจื›ื™ื ื—ื“ืฉื•ืช ืœื”ืชื‘ื•ื ืŸ ื‘ื“ื‘ืจื™ื ืฉืœื ื”ื™ื” ืœื ื• ืžื•ืฉื’ ืœื’ื‘ื™ื”ื ืงื•ื“ื ืœื›ืŸ. ื”ืžื•ื“ืœื™ื ื”ืžื ื˜ืœื™ื™ื ื”ืœืœื• ืื™ื ื ืฉื•ื ื™ื ื‘ื”ืจื‘ื” ืžืกื›ื™ืžื•ืช ื”ื˜ื‘ืœื” ืฉืงื•ื‘ืขื•ืช ื›ื™ืฆื“ ืื ื• ืžืงื˜ืœื’ื™ื ื•ืžืขื‘ื“ื™ื ืžื™ื“ืข ื—ื“ืฉ.

ื–ื” ืžื‘ื™ื ืื•ืชื ื• ืœืกื•ื’ื™ื™ืช ื ื™ื”ื•ืœ ื”ืกื›ื™ืžื”. ื›ื›ืœ ืฉื”ืืชื’ืจื™ื ื•ื”ื“ืจื™ืฉื•ืช ื”ืขืกืงื™ื•ืช ืžืฉืชื ื•ืช ืขื ื”ื–ืžืŸ, ื›ืš ื’ื ืžื‘ื ื” ื”ื ืชื•ื ื™ื ืฉืœืš ืžืฉืชื ื™ื. Delta Lake ืžืงืœ ืขืœ ื”ืฆื’ืช ืžื“ื™ื“ื•ืช ื—ื“ืฉื•ืช ื›ืืฉืจ ื”ื ืชื•ื ื™ื ืžืฉืชื ื™ื. ืœืžืฉืชืžืฉื™ื ื™ืฉ ื’ื™ืฉื” ืœืกืžื ื˜ื™ืงื” ืคืฉื•ื˜ื” ื›ื“ื™ ืœื ื”ืœ ืืช ืกื›ื™ืžื•ืช ื”ื˜ื‘ืœื” ืฉืœื”ื. ื”ื›ืœื™ื ื”ืœืœื• ื›ื•ืœืœื™ื Schema Enforcement, ืฉืžื’ื ื” ืขืœ ื”ืžืฉืชืžืฉื™ื ืžืœื–ื”ื ืืช ื”ื˜ื‘ืœืื•ืช ืฉืœื”ื ื‘ืฉื•ื’ื’ ืื• ื‘ื ืชื•ื ื™ื ืžื™ื•ืชืจื™ื, ื•- Schema Evolution, ื”ืžืืคืฉืจืช ืœื”ื•ืกื™ืฃ ืขืžื•ื“ื•ืช ื—ื“ืฉื•ืช ืฉืœ ื ืชื•ื ื™ื ื™ืงืจื™ ืขืจืš ื‘ืื•ืคืŸ ืื•ื˜ื•ืžื˜ื™ ืœืžื™ืงื•ืžื™ื ื”ืžืชืื™ืžื™ื. ื‘ืžืืžืจ ื–ื”, ื ืฆืœื•ืœ ืขืžื•ืง ื™ื•ืชืจ ืœืฉื™ืžื•ืฉ ื‘ื›ืœื™ื ืืœื”.

ื”ื‘ื ืช ืกื›ื™ืžื•ืช ื˜ื‘ืœื”

ื›ืœ DataFrame ื‘- Apache Spark ืžื›ื™ืœ ืกื›ื™ืžื” ื”ืžื’ื“ื™ืจื” ืืช ืฆื•ืจืช ื”ื ืชื•ื ื™ื, ื›ื’ื•ืŸ ืกื•ื’ื™ ื ืชื•ื ื™ื, ืขืžื•ื“ื•ืช ื•ืžื˜ื ื ืชื•ื ื™ื. ืขื Delta Lake, ืกื›ื™ืžืช ื”ื˜ื‘ืœื” ืžืื•ื—ืกื ืช ื‘ืคื•ืจืžื˜ JSON ื‘ืชื•ืš ื™ื•ืžืŸ ื”ืขืกืงืื•ืช.

ืžื”ื™ ืื›ื™ืคืช ืชื›ื ื™ืช?

Schema Enforcement, ื”ื™ื“ื•ืข ื’ื ื‘ืฉื Schema Validation, ื”ื•ื ืžื ื’ื ื•ืŸ ืื‘ื˜ื—ื” ื‘-Delta Lake ื”ืžื‘ื˜ื™ื— ืื™ื›ื•ืช ื ืชื•ื ื™ื ืขืœ ื™ื“ื™ ื“ื—ื™ื™ืช ืจืฉื•ืžื•ืช ืฉืื™ื ืŸ ืชื•ืืžื•ืช ืœืกื›ื™ืžืช ื”ื˜ื‘ืœื”. ื›ืžื• ื”ืžืืจื—ืช ื‘ื“ืœืคืง ื”ืงื‘ืœื” ืฉืœ ืžืกืขื“ื” ืคื•ืคื•ืœืจื™ืช ืœื”ื–ืžื ื•ืช ื‘ืœื‘ื“, ื”ื™ื ื‘ื•ื“ืงืช ื”ืื ื›ืœ ืขืžื•ื“ืช ื ืชื•ื ื™ื ื”ืžื•ื–ื ืช ืœื˜ื‘ืœื” ื ืžืฆืืช ื‘ืจืฉื™ืžืช ื”ืขืžื•ื“ื•ืช ื”ืฆืคื•ื™ื•ืช (ื‘ืžื™ืœื™ื ืื—ืจื•ืช, ื”ืื ื™ืฉ "ื”ื–ืžื ื”" ืœื›ืœ ืื—ืช ืžื”ืŸ ). ื•ื“ื•ื—ื” ื›ืœ ืจืฉื•ืžื” ืขื ืขืžื•ื“ื•ืช ืฉืื™ื ืŸ ื‘ืจืฉื™ืžื”.

ื›ื™ืฆื“ ืคื•ืขืœืช ืื›ื™ืคืช ื”ืกื›ื™ืžื”?

Delta Lake ืžืฉืชืžืฉ ื‘ื‘ื“ื™ืงืช schema-on-write, ื›ืœื•ืžืจ ื›ืœ ื”ื›ืชื™ื‘ื” ื”ื—ื“ืฉื” ืœื˜ื‘ืœื” ื ื‘ื“ืงืช ืœื’ื‘ื™ ืชืื™ืžื•ืช ืขื ืกื›ื™ืžืช ื˜ื‘ืœืช ื”ื™ืขื“ ื‘ื–ืžืŸ ื”ื›ืชื™ื‘ื”. ืื ื”ืกื›ื™ืžื” ืื™ื ื” ืขืงื‘ื™ืช, Delta Lake ืžื‘ื˜ืœ ืืช ื”ืขืกืงื” ืœื—ืœื•ื˜ื™ืŸ (ืœื ื ื›ืชื‘ื™ื ื ืชื•ื ื™ื) ื•ืžืขืœื” ื—ืจื™ื’ ื›ื“ื™ ืœื”ื•ื“ื™ืข โ€‹โ€‹ืœืžืฉืชืžืฉ ืขืœ ืื™ ื”ืขืงื‘ื™ื•ืช.
Delta Lake ืžืฉืชืžืฉ ื‘ื›ืœืœื™ื ื”ื‘ืื™ื ื›ื“ื™ ืœืงื‘ื•ืข ืื ืจืฉื•ืžื” ืชื•ืืžืช ืœื˜ื‘ืœื”. DataFrame ืœื›ืชื™ื‘ื”:

  • ืœื ื™ื›ื•ืœ ืœื”ื›ื™ืœ ืขืžื•ื“ื•ืช ื ื•ืกืคื•ืช ืฉืื™ื ืŸ ื‘ืกื›ื™ืžื” ืฉืœ ื˜ื‘ืœืช ื”ื™ืขื“. ืœืขื•ืžืช ื–ืืช, ื”ื›ืœ ื‘ืกื“ืจ ืื ื”ื ืชื•ื ื™ื ื”ื ื›ื ืกื™ื ืื™ื ื ืžื›ื™ืœื™ื ืœื—ืœื•ื˜ื™ืŸ ืืช ื›ืœ ื”ืขืžื•ื“ื•ืช ืžื”ื˜ื‘ืœื” - ืœืขืžื•ื“ื•ืช ืืœื• ืคืฉื•ื˜ ื™ื•ืงืฆื• ืขืจื›ื™ null.
  • ืœื ื™ื›ื•ืœ ืœื›ืœื•ืœ ืกื•ื’ื™ ื ืชื•ื ื™ ืขืžื•ื“ื•ืช ืฉื•ื ื™ื ืžืกื•ื’ื™ ื”ื ืชื•ื ื™ื ืฉืœ ื”ืขืžื•ื“ื•ืช ื‘ื˜ื‘ืœืช ื”ื™ืขื“. ืื ืขืžื•ื“ืช ื˜ื‘ืœืช ื”ื™ืขื“ ืžื›ื™ืœื” ื ืชื•ื ื™ StringType, ืืš ื”ืขืžื•ื“ื” ื”ืžืชืื™ืžื” ื‘-DataFrame ืžื›ื™ืœื” ื ืชื•ื ื™ IntegerType, ืื›ื™ืคืช ื”ืกื›ื™ืžื” ืชื–ืจื•ืง ื—ืจื™ื’ื” ื•ืชืžื ืข ืืช ืคืขื•ืœืช ื”ื›ืชื™ื‘ื”.
  • ืœื ื™ื›ื•ืœ ืœื”ื›ื™ืœ ืฉืžื•ืช ืขืžื•ื“ื•ืช ืฉื•ื ื™ื ืจืง ื‘ืžืงืจื”. ืžืฉืžืขื•ืช ื”ื“ื‘ืจ ื”ื™ื ืฉืœื ื ื™ืชืŸ ืœื”ื’ื“ื™ืจ ืขืžื•ื“ื•ืช ื‘ืฉื 'Fo' ื•-'foo' ื‘ืื•ืชื” ื˜ื‘ืœื”. ื‘ืขื•ื“ ืฉื ื™ืชืŸ ืœื”ืฉืชืžืฉ ื‘-Spark ื‘ืžืฆื‘ ืชืœื•ื™ ืจื™ืฉื™ื•ืช ืื• ืœื ืชืœื•ื™ ืจื™ืฉื™ื•ืช (ื‘ืจื™ืจืช ืžื—ื“ืœ), Delta Lake ืฉื•ืžืจ ืขืœ ืจื™ืฉื™ื•ืช ืืš ืื™ื ื• ืจื’ื™ืฉ ื‘ืชื•ืš ืื—ืกื•ืŸ ื”ืกื›ื™ืžื”. ืคืจืงื˜ ื”ื•ื ืจื’ื™ืฉ ืœืื•ืชื™ื•ืช ื’ื“ื•ืœื•ืช ื‘ืขืช ืื—ืกื•ืŸ ื•ื”ื—ื–ืจืช ืžื™ื“ืข ืขืžื•ื“ื”. ื›ื“ื™ ืœืžื ื•ืข ืฉื’ื™ืื•ืช ืืคืฉืจื™ื•ืช, ืคื’ื™ืขื” ื‘ื ืชื•ื ื™ื ืื• ืื•ื‘ื“ืŸ ื ืชื•ื ื™ื (ืžืฉื”ื• ืฉื—ื•ื•ื™ื ื• ื‘ืื•ืคืŸ ืื™ืฉื™ ื‘-Databricks), ื”ื—ืœื˜ื ื• ืœื”ื•ืกื™ืฃ ืžื’ื‘ืœื” ื–ื•.

ื›ื“ื™ ืœื”ืžื—ื™ืฉ ื–ืืช, ื‘ื•ืื• ื ืกืชื›ืœ ืขืœ ืžื” ืฉืงื•ืจื” ื‘ืงื•ื“ ืœืžื˜ื” ื›ืฉืื ื—ื ื• ืžื ืกื™ื ืœื”ื•ืกื™ืฃ ื›ืžื” ืขืžื•ื“ื•ืช ืฉื ื•ืฆืจื• ืœืื—ืจื•ื ื” ืœื˜ื‘ืœื” ืฉืœ Delta Lake ืฉืขื“ื™ื™ืŸ ืœื ืžื•ื’ื“ืจืช ืœืงื‘ืœ ืื•ืชืŸ.

# ะกะณะตะฝะตั€ะธั€ัƒะตะผ DataFrame ัััƒะด, ะบะพั‚ะพั€ั‹ะน ะผั‹ ะดะพะฑะฐะฒะธะผ ะฒ ะฝะฐัˆัƒ ั‚ะฐะฑะปะธั†ัƒ Delta Lake
loans = sql("""
            SELECT addr_state, CAST(rand(10)*count as bigint) AS count,
            CAST(rand(10) * 10000 * count AS double) AS amount
            FROM loan_by_state_delta
            """)

# ะ’ั‹ะฒะตัั‚ะธ ะธัั…ะพะดะฝัƒัŽ ัั…ะตะผัƒ DataFrame
original_loans.printSchema()

root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
 
# ะ’ั‹ะฒะตัั‚ะธ ะฝะพะฒัƒัŽ ัั…ะตะผัƒ DataFrame
loans.printSchema()
 
root
  |-- addr_state: string (nullable = true)
  |-- count: integer (nullable = true)
  |-- amount: double (nullable = true) # new column
 
# ะŸะพะฟั‹ั‚ะบะฐ ะดะพะฑะฐะฒะธั‚ัŒ ะฝะพะฒั‹ะน DataFrame (ั ะฝะพะฒั‹ะผ ัั‚ะพะปะฑั†ะพะผ) ะฒ ััƒั‰ะตัั‚ะฒัƒัŽั‰ัƒัŽ ั‚ะฐะฑะปะธั†ัƒ
loans.write.format("delta") 
           .mode("append") 
           .save(DELTALAKE_PATH)

Returns:

A schema mismatch detected when writing to the Delta table.
 
To enable schema migration, please set:
'.option("mergeSchema", "true")'
 
Table schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
 
Data schema:
root
-- addr_state: string (nullable = true)
-- count: long (nullable = true)
-- amount: double (nullable = true)
 
If Table ACLs are enabled, these options will be ignored. Please use the ALTER TABLE command for changing the schema.

ื‘ืžืงื•ื ืœื”ื•ืกื™ืฃ ืขืžื•ื“ื•ืช ื—ื“ืฉื•ืช ื‘ืื•ืคืŸ ืื•ื˜ื•ืžื˜ื™, Delta Lake ื›ื•ืคื” ืกื›ื™ืžื” ื•ืžืคืกื™ืง ืœื›ืชื•ื‘. ื›ื“ื™ ืœืขื–ื•ืจ ืœืงื‘ื•ืข ืื™ื–ื• ืขืžื•ื“ื” (ืื• ืงื‘ื•ืฆืช ืขืžื•ื“ื•ืช) ื’ื•ืจืžืช ืœืื™ ื”ื”ืชืืžื”, Spark ืžื•ืฆื™ื ืืช ืฉืชื™ ื”ืกื›ืžื•ืช ืž-Stack Trace ืœืฆื•ืจืš ื”ืฉื•ื•ืื”.

ืžื” ื”ื™ืชืจื•ืŸ ื‘ืื›ื™ืคืช ืกื›ื™ืžื”?

ืžื›ื™ื•ื•ืŸ ืฉืื›ื™ืคืช ืกื›ื™ืžื” ื”ื™ื ื‘ื“ื™ืงื” ืงืคื“ื ื™ืช ืœืžื“ื™, ื”ื™ื ื›ืœื™ ืžืฆื•ื™ืŸ ืœืฉื™ืžื•ืฉ ื›ืฉื•ืžืจ ืกืฃ ืœืžืขืจืš ื ืชื•ื ื™ื ื ืงื™, ืฉืขื‘ืจ ื˜ืจื ืกืคื•ืจืžืฆื™ื” ืžืœืื”, ืžื•ื›ืŸ ืœื™ื™ืฆื•ืจ ืื• ืœืฆืจื™ื›ื”. ืžื™ื•ืฉื ื‘ื“ืจืš ื›ืœืœ ืขืœ ื˜ื‘ืœืื•ืช ืฉืžื–ื™ื ื•ืช ื ืชื•ื ื™ื ื™ืฉื™ืจื•ืช:

  • ืืœื’ื•ืจื™ืชืžื™ื ืฉืœ ืœืžื™ื“ืช ืžื›ื•ื ื”
  • ืœื•ื—ื•ืช ืžื—ื•ื•ื ื™ื ืฉืœ BI
  • ื›ืœื™ ื ื™ืชื•ื— ื ืชื•ื ื™ื ื•ื”ื“ืžื™ื”
  • ื›ืœ ืžืขืจื›ืช ื™ื™ืฆื•ืจ ื”ื“ื•ืจืฉืช ืกื›ืžื•ืช ืกืžื ื˜ื™ื•ืช ืžื•ื‘ื ื•ืช ืžืื•ื“, ืžื•ืงืœื“ื•ืช ื—ื–ืงื•ืช.

ื›ื“ื™ ืœื”ื›ื™ืŸ ืืช ื”ื ืชื•ื ื™ื ืฉืœื”ื ืœืžื›ืฉื•ืœ ื”ืื—ืจื•ืŸ ื”ื–ื”, ืžืฉืชืžืฉื™ื ืจื‘ื™ื ืžืฉืชืžืฉื™ื ื‘ืืจื›ื™ื˜ืงื˜ื•ืจืช "ืจื‘-ื”ื•ืค" ืคืฉื•ื˜ื” ืฉืžื›ื ื™ืกื” ื‘ื”ื“ืจื’ื” ืžื‘ื ื” ืœื˜ื‘ืœืื•ืช ืฉืœื”ื. ื›ื“ื™ ืœืœืžื•ื“ ืขื•ื“ ืขืœ ื–ื”, ืืชื” ื™ื›ื•ืœ ืœืขื™ื™ืŸ ื‘ืžืืžืจ ืœืžื™ื“ืช ืžื›ื•ื ื” ื‘ื“ืจื’ืช ื™ื™ืฆื•ืจ ืขื Delta Lake.

ื›ืžื•ื‘ืŸ ืฉื ื™ืชืŸ ืœื”ืฉืชืžืฉ ื‘ืื›ื™ืคืช ืกื›ื™ืžื” ื‘ื›ืœ ืžืงื•ื ื‘ืฆื ืจืช ืฉืœืš, ืืš ื–ื›ืจื• ืฉื–ืจื™ืžื” ืœื˜ื‘ืœื” ื‘ืžืงืจื” ื–ื” ื™ื›ื•ืœื” ืœื”ื™ื•ืช ืžืชืกื›ืœืช, ื›ื™ ืœืžืฉืœ ืฉื›ื—ืช ืฉื”ื•ืกืคืช ืขืžื•ื“ื” ื ื•ืกืคืช ืœื ืชื•ื ื™ื ื”ื ื›ื ืกื™ื.

ืžื ื™ืขืช ื“ื™ืœื•ืœ ื ืชื•ื ื™ื

ืขื“ ืขื›ืฉื™ื• ืืชื ืื•ืœื™ ืชื•ื”ื™ื, ืขืœ ืžื” ื›ืœ ื”ืžื”ื•ืžื”? ืื—ืจื™ ื”ื›ืœ, ืœืคืขืžื™ื ืฉื’ื™ืืช "ืื™ ื”ืชืืžื” ืฉืœ ืกื›ื™ืžื”" ื‘ืœืชื™ ืฆืคื•ื™ื” ื™ื›ื•ืœื” ืœื”ื›ืฉื™ืœ ืื•ืชืš ื‘ื–ืจื™ืžืช ื”ืขื‘ื•ื“ื” ืฉืœืš, ื‘ืžื™ื•ื—ื“ ืื ืืชื” ื—ื“ืฉ ื‘-Delta Lake. ืœืžื” ืœื ืคืฉื•ื˜ ืœืชืช ืœืกื›ื™ืžื” ืœื”ืฉืชื ื•ืช ืœืคื™ ื”ืฆื•ืจืš ื›ื“ื™ ืฉืื•ื›ืœ ืœื›ืชื•ื‘ ืืช ื”-DataFrame ืฉืœื™ ืœื ืžืฉื ื” ืžื”?

ื›ืคื™ ืฉืื•ืžืจ ื”ืคืชื’ื ื”ื™ืฉืŸ, "ื’ืจื ืฉืœ ืžื ื™ืขื” ืฉื•ื•ื” ืงื™ืœื• ืฉืœ ืชืจื•ืคื”." ื‘ืฉืœื‘ ืžืกื•ื™ื, ืื ืœื ืชื“ืื’ ืœืื›ื•ืฃ ืืช ื”ืกื›ื™ืžื” ืฉืœืš, ื‘ืขื™ื•ืช ืชืื™ืžื•ืช ืžืกื•ื’ ื ืชื•ื ื™ื ื™ืขืœื• ืืช ืจืืฉืŸ ื”ืžื›ื•ืขืจ - ืžืงื•ืจื•ืช ื ืชื•ื ื™ื ื’ื•ืœืžื™ื™ื ื”ื•ืžื•ื’ื ื™ื™ื ืœื›ืื•ืจื” ืขืฉื•ื™ื™ื ืœื”ื›ื™ืœ ืžืงืจื™ ืงืฆื”, ืขืžื•ื“ื•ืช ืคื’ื•ืžื•ืช, ืžื™ืคื•ื™ื™ื ืฉื’ื•ื™ื™ื ืื• ื“ื‘ืจื™ื ืžืคื—ื™ื“ื™ื ืื—ืจื™ื ืœื—ืœื•ื ืขืœื™ื”ื. ืกื™ื•ื˜ื™ื. ื”ื’ื™ืฉื” ื”ื˜ื•ื‘ื” ื‘ื™ื•ืชืจ ื”ื™ื ืœืขืฆื•ืจ ืืช ื”ืื•ื™ื‘ื™ื ื”ืืœื” ื‘ืฉืขืจ - ืขื ืื›ื™ืคืช ืกื›ืžื” - ื•ืœื”ืชืžื•ื“ื“ ืื™ืชื ื‘ืื•ืจ, ื•ืœื ืžืื•ื—ืจ ื™ื•ืชืจ ื›ืฉื”ื ืžืชื—ื™ืœื™ื ืœืืจื•ื‘ ื‘ืžืขืžืงื™ื ื”ืืคืœื™ื ืฉืœ ืงื•ื“ ื”ื™ื™ืฆื•ืจ ืฉืœืš.

ืื›ื™ืคืช ืกื›ื™ืžื” ืžืขื ื™ืงื” ืœืš ืืช ื”ื‘ื™ื˜ื—ื•ืŸ ืฉื”ืกื›ื™ืžื” ืฉืœ ื”ื˜ื‘ืœื” ืฉืœืš ืœื ืชืฉืชื ื” ืืœื ืื ืชืืฉืจ ืืช ื”ืฉื™ื ื•ื™. ื–ื” ืžื•ื ืข ื“ื™ืœื•ืœ ื ืชื•ื ื™ื, ืฉืขืœื•ืœ ืœื”ืชืจื—ืฉ ื›ืืฉืจ ืขืžื•ื“ื•ืช ื—ื“ืฉื•ืช ืžืชื•ื•ืกืคื•ืช ื‘ืชื“ื™ืจื•ืช ื›ื” ื’ื‘ื•ื”ื” ืขื“ ืฉื˜ื‘ืœืื•ืช ื“ื—ื•ืกื•ืช ื‘ืขืœื•ืช ืขืจืš ื‘ืขื‘ืจ ืžืื‘ื“ื•ืช ืืช ื”ืžืฉืžืขื•ืช ื•ื”ืฉื™ืžื•ืฉื™ื•ืช ืฉืœื”ืŸ ืขืงื‘ ื”ืฆืคื” ืฉืœ ื ืชื•ื ื™ื. ืขืœ ื™ื“ื™ ืขื™ื“ื•ื“ืš ืœื”ื™ื•ืช ืžื›ื•ื•ืŸ, ืœื”ืฆื™ื‘ ืกื˜ื ื“ืจื˜ื™ื ื’ื‘ื•ื”ื™ื ื•ืœืฆืคื•ืช ืœืื™ื›ื•ืช ื’ื‘ื•ื”ื”, ืื›ื™ืคืช ื”ืกื›ื™ืžื” ืขื•ืฉื” ื‘ื“ื™ื•ืง ืืช ืžื” ืฉื”ื™ื ืชื•ื›ื ื ื” ืœืขืฉื•ืช - ืขื•ื–ืจืช ืœืš ืœื”ื™ืฉืืจ ืžืฆืคื•ื ื™ืช ื•ื”ื’ืœื™ื•ื ื•ืช ื”ืืœืงื˜ืจื•ื ื™ื™ื ืฉืœืš ื ืงื™ื™ื.

ืื ื‘ืฉื™ืงื•ืœ ื ื•ืกืฃ ืชื—ืœื™ื˜ ืฉืืชื” ื‘ืืžืช ืฆื•ืจืš ื”ื•ืกืฃ ืขืžื•ื“ื” ื—ื“ืฉื” - ืื™ืŸ ื‘ืขื™ื”, ืœื”ืœืŸ ืชื™ืงื•ืŸ ื‘ืฉื•ืจื” ืื—ืช. ื”ืคืชืจื•ืŸ ื”ื•ื ื”ืชืคืชื—ื•ืช ื”ืžืขื’ืœ!

ืžื”ื™ ืื‘ื•ืœื•ืฆื™ื” ืฉืœ ืกื›ื™ืžื”?

ื”ืชืคืชื—ื•ืช ืกื›ื™ืžื” ื”ื™ื ืชื›ื•ื ื” ื”ืžืืคืฉืจืช ืœืžืฉืชืžืฉื™ื ืœืฉื ื•ืช ื‘ืงืœื•ืช ืืช ืกื›ื™ืžืช ื”ื˜ื‘ืœื” ื”ื ื•ื›ื—ื™ืช ื‘ื”ืชืื ืœื ืชื•ื ื™ื ื”ืžืฉืชื ื™ื ืขื ื”ื–ืžืŸ. ื”ื•ื ืžืฉืžืฉ ืœืจื•ื‘ ื‘ืขืช ื‘ื™ืฆื•ืข ืคืขื•ืœืช ื”ื•ืกืคื” ืื• ื›ืชื™ื‘ื” ืžื—ื“ืฉ ื›ื“ื™ ืœื”ืชืื™ื ืื•ื˜ื•ืžื˜ื™ืช ืืช ื”ืกื›ื™ืžื” ื›ืš ืฉืชื›ืœื•ืœ ืขืžื•ื“ื” ื—ื“ืฉื” ืื—ืช ืื• ื™ื•ืชืจ.

ืื™ืš ืขื•ื‘ื“ืช ื”ืชืคืชื—ื•ืช ื”ืกื›ื™ืžื”?

ื‘ืขืงื‘ื•ืช ื”ื“ื•ื’ืžื” ืžื”ืกืขื™ืฃ ื”ืงื•ื“ื, ืžืคืชื—ื™ื ื™ื›ื•ืœื™ื ืœื”ืฉืชืžืฉ ื‘ืงืœื•ืช ื‘ืคื™ืชื•ื— ืกื›ื™ืžื” ื›ื“ื™ ืœื”ื•ืกื™ืฃ ืขืžื•ื“ื•ืช ื—ื“ืฉื•ืช ืฉื ื“ื—ื• ื‘ืขื‘ืจ ืขืงื‘ ื—ื•ืกืจ ืขืงื‘ื™ื•ืช ื‘ืกื›ื™ืžื”. ื”ืชืคืชื—ื•ืช ื”ืžืขื’ืœ ืžื•ืคืขืœืช ืขืœ ื™ื“ื™ ื”ื•ืกืคื” .option('mergeSchema', 'true') ืœืฆื•ื•ืช ื”ืกืคืืจืง ืฉืœืš .write ะธะปะธ .writeStream.

# ะ”ะพะฑะฐะฒัŒั‚ะต ะฟะฐั€ะฐะผะตั‚ั€ mergeSchema
loans.write.format("delta") 
           .option("mergeSchema", "true") 
           .mode("append") 
           .save(DELTALAKE_SILVER_PATH)

ื›ื“ื™ ืœื”ืฆื™ื’ ืืช ื”ื’ืจืฃ, ื”ืคืขืœ ืืช ืฉืื™ืœืชืช Spark SQL ื”ื‘ืื”

# ะกะพะทะดะฐะนั‚ะต ะณั€ะฐั„ะธะบ ั ะฝะพะฒั‹ะผ ัั‚ะพะปะฑั†ะพะผ, ั‡ั‚ะพะฑั‹ ะฟะพะดั‚ะฒะตั€ะดะธั‚ัŒ, ั‡ั‚ะพ ะทะฐะฟะธััŒ ะฟั€ะพัˆะปะฐ ัƒัะฟะตัˆะฝะพ
%sql
SELECT addr_state, sum(`amount`) AS amount
FROM loan_by_state_delta
GROUP BY addr_state
ORDER BY sum(`amount`)
DESC LIMIT 10

ืฆืœื•ืœ ืœืชื•ืš ืื’ื Delta: Schema Enforcement and Evolution
ืœื—ืœื•ืคื™ืŸ, ืืชื” ื™ื›ื•ืœ ืœื”ื’ื“ื™ืจ ืืคืฉืจื•ืช ื–ื• ืขื‘ื•ืจ ื›ืœ ื”ืคืขืœืช Spark ืขืœ ื™ื“ื™ ื”ื•ืกืคื” spark.databricks.delta.schema.autoMerge = True ืœืชืฆื•ืจืช Spark. ืื‘ืœ ื”ืฉืชืžืฉ ื‘ื–ื” ื‘ื–ื”ื™ืจื•ืช, ืฉื›ืŸ ืื›ื™ืคืช ืกื›ื™ืžื” ืœื ืชืชืจื™ืข ืขื•ื“ ืขืœ ื—ื•ืกืจ ืขืงื‘ื™ื•ืช ื‘ืกื›ื™ืžื” ืœื ืžื›ื•ื•ื ืช.

ืขืœ ื™ื“ื™ ื”ื›ืœืœืช ื”ืคืจืžื˜ืจ ื‘ื‘ืงืฉื” mergeSchema, ื›ืœ ื”ืขืžื•ื“ื•ืช ืฉื ืžืฆืื•ืช ื‘-DataFrame ืืš ืœื ื‘ื˜ื‘ืœืช ื”ื™ืขื“ ืžืชื•ื•ืกืคื•ืช ืื•ื˜ื•ืžื˜ื™ืช ืœืกื•ืฃ ื”ืกื›ื™ืžื” ื›ื—ืœืง ืžืขืกืงืช ื›ืชื™ื‘ื”. ื ื™ืชืŸ ืœื”ื•ืกื™ืฃ ื’ื ืฉื“ื•ืช ืžืงื•ื ื ื™ื ื•ืืœื” ื™ืชื•ื•ืกืคื• ื’ื ืœืกื•ืฃ ืขืžื•ื“ื•ืช ื”ืžื‘ื ื” ื”ืžืชืื™ืžื•ืช.

ืžื”ื ื“ืกื™ ืชืืจื™ื›ื™ื ื•ืžื“ืขื ื™ ื ืชื•ื ื™ื ื™ื›ื•ืœื™ื ืœื”ืฉืชืžืฉ ื‘ืืคืฉืจื•ืช ื–ื• ื›ื“ื™ ืœื”ื•ืกื™ืฃ ืขืžื•ื“ื•ืช ื—ื“ืฉื•ืช (ืื•ืœื™ ืžื“ื“ ืฉื ื‘ื“ืง ืœืื—ืจื•ื ื” ืื• ืขืžื•ื“ืช ื‘ื™ืฆื•ืขื™ ื”ืžื›ื™ืจื•ืช ืฉืœ ื”ื—ื•ื“ืฉ) ืœื˜ื‘ืœืื•ืช ื”ื™ื™ืฆื•ืจ ื”ืงื™ื™ืžื•ืช ืฉืœ ืœืžื™ื“ืช ืžื›ื•ื ื” ืžื‘ืœื™ ืœืฉื‘ื•ืจ ืžื•ื“ืœื™ื ืงื™ื™ืžื™ื ื”ืžื‘ื•ืกืกื™ื ืขืœ ืขืžื•ื“ื•ืช ื™ืฉื ื•ืช.

ื”ืกื•ื’ื™ื ื”ื‘ืื™ื ืฉืœ ืฉื™ื ื•ื™ื™ื ื‘ืกื›ื™ืžื” ืžื•ืชืจื™ื ื›ื—ืœืง ืžื”ืชืคืชื—ื•ืช ื”ืกื›ื™ืžื” ื‘ืžื”ืœืš ื”ื•ืกืคื” ืื• ืฉื›ืชื•ื‘ ืฉืœ ื˜ื‘ืœื”:

  • ื”ื•ืกืคืช ืขืžื•ื“ื•ืช ื—ื“ืฉื•ืช (ื–ื”ื• ื”ืชืจื—ื™ืฉ ื”ื ืคื•ืฅ ื‘ื™ื•ืชืจ)
  • ืฉื™ื ื•ื™ ืกื•ื’ื™ ื ืชื•ื ื™ื ืž- NullType -> ื›ืœ ืกื•ื’ ืื—ืจ ืื• ืงื™ื“ื•ื ืž- ByteType -> ShortType -> IntegerType

ืฉื™ื ื•ื™ื™ื ืื—ืจื™ื ืฉืื™ื ื ืžื•ืชืจื™ื ื‘ื”ืชืคืชื—ื•ืช ื”ืกื›ื™ืžื” ื“ื•ืจืฉื™ื ื›ื™ ื”ืกื›ื™ืžื” ื•ื”ื ืชื•ื ื™ื ื™ื™ื›ืชื‘ื• ืžื—ื“ืฉ ืขืœ ื™ื“ื™ ื”ื•ืกืคื” .option("overwriteSchema", "true"). ืœื“ื•ื’ืžื”, ื‘ืžืงืจื” ืฉื‘ื• ื”ืขืžื•ื“ื” "Foo" ื”ื™ื™ืชื” ื‘ืžืงื•ืจ ืžืกืคืจ ืฉืœื ื•ื”ืกื›ื™ืžื” ื”ื—ื“ืฉื” ื”ื™ื™ืชื” ืžืกื•ื’ ื ืชื•ื ื™ ืžื—ืจื•ื–ืช, ืื– ื›ืœ ืงื‘ืฆื™ Parquet(data) ื™ืฆื˜ืจืš ืœื”ื™ื›ืชื‘ ืžื—ื“ืฉ. ืฉื™ื ื•ื™ื™ื ื›ืืœื” ื›ื•ืœืœื™ื:

  • ืžื—ื™ืงืช ืขืžื•ื“ื”
  • ืฉื™ื ื•ื™ ืกื•ื’ ื”ื ืชื•ื ื™ื ืฉืœ ืขืžื•ื“ื” ืงื™ื™ืžืช (ื‘ืžืงื•ื)
  • ืฉื™ื ื•ื™ ืฉืžื•ืช ืฉืœ ืขืžื•ื“ื•ืช ืฉื ื‘ื“ืœื•ืช ืจืง ื‘ืžืงืจื” (ืœื“ื•ื’ืžื”, "Foo" ื•-"foo")

ืœื‘ืกื•ืฃ, ืขื ื”ืžื”ื“ื•ืจื” ื”ื‘ืื” ืฉืœ Spark 3.0, DDL ืžืคื•ืจืฉ ื™ืงื‘ืœ ืชืžื™ื›ื” ืžืœืื” (ื‘ืืžืฆืขื•ืช ALTER TABLE), ื”ืžืืคืฉืจ ืœืžืฉืชืžืฉื™ื ืœื‘ืฆืข ืืช ื”ืคืขื•ืœื•ืช ื”ื‘ืื•ืช ื‘ืกื›ื™ืžื•ืช ื˜ื‘ืœื”:

  • ื”ื•ืกืคืช ืขืžื•ื“ื•ืช
  • ืฉื™ื ื•ื™ ื”ืขืจื•ืช ืขืžื•ื“ื•ืช
  • ื”ื’ื“ืจืช ืžืืคื™ื™ื ื™ ื˜ื‘ืœื” ื”ืฉื•ืœื˜ื™ื ื‘ื”ืชื ื”ื’ื•ืช ื”ื˜ื‘ืœื”, ื›ื’ื•ืŸ ื”ื’ื“ืจืช ืžืฉืš ื”ื–ืžืŸ ืฉื‘ื• ื™ื•ืžืŸ ื˜ืจื ื–ืงืฆื™ื•ืช ืžืื•ื—ืกืŸ.

ืžื” ื”ื™ืชืจื•ืŸ ืฉืœ ืื‘ื•ืœื•ืฆื™ื” ื‘ืžืขื’ืœ?

ื ื™ืชืŸ ืœื”ืฉืชืžืฉ ื‘ื”ืชืคืชื—ื•ืช ืกื›ื™ืžื” ื‘ื›ืœ ืคืขื ืฉืืชื” ืžืชื›ื•ื•ืŸ ืฉื ื” ืืช ื”ืกื›ื™ืžื” ืฉืœ ื”ื˜ื‘ืœื” ืฉืœืš (ื‘ื ื™ื’ื•ื“ ืœืžืงืจื” ืฉื”ื•ืกืคืช ื‘ื˜ืขื•ืช ืขืžื•ื“ื•ืช ืœ-DataFrame ืฉืœืš โ€‹โ€‹ืฉืœื ืืžื•ืจื•ืช ืœื”ื™ื•ืช ืฉื). ื–ื•ื”ื™ ื”ื“ืจืš ื”ืงืœื” ื‘ื™ื•ืชืจ ืœื”ืขื‘ื™ืจ ืืช ื”ืกื›ื™ืžื” ืฉืœืš ืžื›ื™ื•ื•ืŸ ืฉื”ื™ื ืžื•ืกื™ืคื” ืื•ื˜ื•ืžื˜ื™ืช ืืช ืฉืžื•ืช ื”ืขืžื•ื“ื•ืช ื•ืกื•ื’ื™ ื”ื ืชื•ื ื™ื ื”ื ื›ื•ื ื™ื ืžื‘ืœื™ ืœื”ื›ืจื™ื– ืขืœื™ื”ื ื‘ืžืคื•ืจืฉ.

ืžืกืงื ื”

ืื›ื™ืคืช ืกื›ื™ืžื” ื“ื•ื—ื” ื›ืœ ืขืžื•ื“ื” ื—ื“ืฉื” ืื• ืฉื™ื ื•ื™ื™ ืกื›ื™ืžื” ืื—ืจื™ื ืฉืื™ื ื ืชื•ืืžื™ื ืœื˜ื‘ืœื” ืฉืœืš. ืขืœ ื™ื“ื™ ืงื‘ื™ืขืช ื•ืชื—ื–ื•ืงื” ืฉืœ ืกื˜ื ื“ืจื˜ื™ื ื’ื‘ื•ื”ื™ื ืืœื”, ืื ืœื™ืกื˜ื™ื ื•ืžื”ื ื“ืกื™ื ื™ื›ื•ืœื™ื ืœืกืžื•ืš ืขืœ ื›ืš ืฉืœื ืชื•ื ื™ื ืฉืœื”ื ื™ืฉ ืืช ืจืžืช ื”ืื™ื ื˜ื’ืจื™ื˜ื™ ื”ื’ื‘ื•ื”ื” ื‘ื™ื•ืชืจ, ืžืชืงืฉืจื™ื ืื•ืชื ื‘ืฆื•ืจื” ื‘ืจื•ืจื” ื•ื‘ืจื•ืจื”, ื•ืžืืคืฉืจื™ื ืœื”ื ืœืงื‘ืœ ื”ื—ืœื˜ื•ืช ืขืกืงื™ื•ืช ื˜ื•ื‘ื•ืช ื™ื•ืชืจ.

ืžืฆื“ ืฉื ื™, ื”ืชืคืชื—ื•ืช ื”ืกื›ื™ืžื” ืžืฉืœื™ืžื” ืืช ื”ืื›ื™ืคื” ืขืœ ื™ื“ื™ ืคื™ืฉื•ื˜ ืืžื•ืจ ืฉื™ื ื•ื™ื™ื ืื•ื˜ื•ืžื˜ื™ื™ื ื‘ืกื›ื™ืžื”. ืื—ืจื™ ื”ื›ืœ, ื–ื” ืœื ืืžื•ืจ ืœื”ื™ื•ืช ืงืฉื” ืœื”ื•ืกื™ืฃ ืขืžื•ื“ื”.

ื”ื™ื™ืฉื•ื ื”ื›ืคื•ื™ ืฉืœ ื”ืกื›ื™ืžื” ื”ื•ื ื™ืื ื’, ื›ืืฉืจ ื”ืื‘ื•ืœื•ืฆื™ื” ืฉืœ ื”ืกื›ื™ืžื” ื”ื™ื ื™ื™ืŸ. ื‘ืฉื™ืžื•ืฉ ื™ื—ื“, ืชื›ื•ื ื•ืช ืืœื” ื”ื•ืคื›ื•ืช ืืช ื“ื™ื›ื•ื™ ื”ืจืขืฉื™ื ื•ื›ื•ื•ื ื•ืŸ ื”ืื•ืชื•ืช ืœืงืœื™ื ืžืื™ ืคืขื.

ื‘ืจืฆื•ื ื ื• ื’ื ืœื”ื•ื“ื•ืช ืœืžื•ืงื•ืœ ืžืจืชื™ ื•ืคืจืื ื‘ ืื ืื ื“ ืขืœ ืชืจื•ืžืชื ืœืžืืžืจ ื–ื”.

ืžืืžืจื™ื ื ื•ืกืคื™ื ื‘ืกื“ืจื” ื–ื•:

ืฆืœื•ืœ ืœืชื•ืš ืื’ื ื“ืœืชื: ืคืจื™ืงืช ื™ื•ืžืŸ ื”ืขืกืงืื•ืช

ืžืืžืจื™ื ืงืฉื•ืจื™ื

ืœืžื™ื“ืช ืžื›ื•ื ื” ื‘ื“ืจื’ืช ื™ื™ืฆื•ืจ ืขื Delta Lake

ืžื”ื• ืื’ื ื ืชื•ื ื™ื?

ืœืžื™ื“ืข ื ื•ืกืฃ ืขืœ ื”ืงื•ืจืก

ืžืงื•ืจ: www.habr.com

ื”ื•ืกืคืช ืชื’ื•ื‘ื”