ืึธื ืฆื™ื ื“ืŸ ืกื›ืขืžืข ืขื•ื•ืึทืœื•ืฉืึทืŸ ืื™ืŸ ืคื™ืจ

ื˜ื™ื™ืขืจืข ืœื™ื™ืขื ืขืจ, ื ื’ื•ื˜ืŸ ื˜ืื’!

ืื™ืŸ ื“ืขื ืึทืจื˜ื™ืงืœ, ื“ื™ ืœื™ื“ื™ื ื’ ืงืึธื ืกื•ืœื˜ืึทื ื˜ ืคื•ืŸ Neoflex ืก ื‘ื™ื’ ื“ืึทื˜ืึท ืกืึทืœื•ืฉืึทื ื– ื’ืขืฉืขืคื˜ ื’ืขื’ื ื˜ ื‘ืืฉืจื™ื™ื‘ื˜ ืื™ืŸ ื“ืขื˜ืึทืœ ื“ื™ ืึธืคึผืฆื™ืขืก ืคึฟืึทืจ ื‘ื ื™ืŸ ื•ื•ืขืจื™ืึทื‘ืึทืœ ืกื˜ืจื•ืงื˜ื•ืจ ืฉืึธื•ืงื™ื™ืกื™ื– ื ื™ืฆืŸ Apache Spark.

ื•ื•ื™ ืึท ื˜ื™ื™ืœ ืคื•ืŸ ืึท ื“ืึทื˜ืŸ ืึทื ืึทืœื™ืกื™ืก ืคึผืจื•ื™ืขืงื˜, ื“ื™ ืึทืจื‘ืขื˜ ืคื•ืŸ ื‘ื•ื™ืขืŸ ืกื˜ืึธืจืคืจืึทื ืฅ ื‘ืื–ื™ืจื˜ ืื•ื™ืฃ ืœื•ืกืœื™ ืกื˜ืจืึทืงื˜ืฉืขืจื“ ื“ืึทื˜ืŸ ืึธืคื˜ ืขืจื™ื™ื–ืึทื–.

ื™ื•ื–ืฉืึทื•ื•ืึทืœื™ ื“ืึธืก ื–ืขื ืขืŸ ืœืึธื’ืก, ืึธื“ืขืจ ืจืขืกืคึผืึธื ืกืขืก ืคื•ืŸ ืคืึทืจืฉื™ื“ืŸ ืกื™ืกื˜ืขืžืขืŸ, ื’ืขืจืื˜ืขื•ื•ืขื˜ ื•ื•ื™ JSON ืึธื“ืขืจ XML. ื“ื™ ื“ืึทื˜ืŸ ื–ืขื ืขืŸ ื•ืคึผืœืึธืึทื“ืขื“ ืฆื• Hadoop, ืื•ืŸ ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ื‘ื•ื™ืขืŸ ืึท ืกื˜ืึธืจืคืจืึทื ื˜ ืคึฟื•ืŸ ื–ื™ื™. ืžื™ืจ ืงืขื ืขืŸ ืึธืจื’ืึทื ื™ื–ื™ืจืŸ ืึทืงืกืขืก ืฆื• ื“ื™ ื‘ืืฉืืคืŸ ื•ื•ื™ื˜ืจื™ื ืข, ืœืžืฉืœ, ื“ื•ืจืš ื™ืžืคึผืึทืœืึท.

ืื™ืŸ ื“ืขื ืคืึทืœ, ื“ื™ ืกื›ืขืžืข ืคื•ืŸ โ€‹โ€‹ื“ื™ ืฆื™ืœ ืกื˜ืึธืจืคืจืึทื ื˜ ืื™ื– ื ื™ืฉื˜ ื‘ืึทื•ื•ื•ืกื˜ ืคืจื™ืขืจ. ื“ืขืจืฆื•, ื“ื™ ืกื›ืขืžืข ืื•ื™ืš ืงืขื ืขืŸ ื ื™ื˜ ื–ื™ื™ืŸ ืฆื™ืขืŸ ืึทืจื•ื™ืฃ ืื™ืŸ ืฉื˜ื™ื™ึทื’ืŸ, ื•ื•ื™ื™ึทืœ ืขืก ื“ืขืคึผืขื ื“ืก ืื•ื™ืฃ ื“ื™ ื“ืึทื˜ืŸ, ืื•ืŸ ืžื™ืจ ื–ืขื ืขืŸ ื“ื™ืœื™ื ื’ ืžื™ื˜ ื“ื™ ื–ื™ื™ืขืจ ืœื•ืกืœื™ ืกื˜ืจืึทืงื˜ืฉืขืจื“ ื“ืึทื˜ืŸ.

ืคึฟืึทืจ ื‘ื™ื™ึทืฉืคึผื™ืœ, ื”ื™ื™ึทื ื˜ ื“ื™ ืคืืœื’ืขื ื“ืข ืขื ื˜ืคืขืจ ืื™ื– ืœืึธื’ื“:

{source: "app1", error_code: ""}

ืื•ืŸ ืžืึธืจื’ืŸ ืคื•ืŸ ื“ืขืจ ื–ืขืœื‘ื™ืงืขืจ ืกื™ืกื˜ืขื ืงื•ืžื˜ ื“ื™ ืคืืœื’ืขื ื“ืข ืขื ื˜ืคืขืจ:

{source: "app1", error_code: "error", description: "Network error"}

ื•ื•ื™ ืึท ืจืขื–ื•ืœื˜ืึทื˜, ื ืึธืš ืื™ื™ืŸ ืคืขืœื“ ื–ืึธืœ ื–ื™ื™ืŸ ืฆื•ื’ืขื’ืขื‘ืŸ ืฆื• ื“ื™ ื•ื•ื™ื˜ืจื™ื ืข - ื‘ืึทืฉืจื™ื™ึทื‘ื•ื ื’, ืื•ืŸ ืงื™ื™ืŸ ืื™ื™ื ืขืจ ื•ื•ื™ื™ืกื˜ ืฆื™ ืขืก ื•ื•ืขื˜ ืงื•ืžืขืŸ ืึธื“ืขืจ ื ื™ืฉื˜.

ื“ื™ ืึทืจื‘ืขื˜ ืคื•ืŸ ืงืจื™ื™ื™ื˜ื™ื ื’ ืึท ืกื˜ืึธืจืคืจืึทื ื˜ ืื•ื™ืฃ ืึทื–ืึท ื“ืึทื˜ืŸ ืื™ื– ื’ืึทื ืฅ ื ืึธืจืžืึทืœ, ืื•ืŸ ืกืคึผืึทืจืง ื”ืื˜ ืึท ื ื•ืžืขืจ ืคื•ืŸ ืžื›ืฉื™ืจื™ื ืคึฟืึทืจ ื“ืขื. ืคึฟืึทืจ ืคึผืึทืจืกื™ื ื’ ื“ื™ ืžืงื•ืจ ื“ืึทื˜ืŸ, ืขืก ืื™ื– ืฉื˜ื™ืฆืŸ ืคึฟืึทืจ ื‘ื™ื™ื“ืข JSON ืื•ืŸ XML, ืื•ืŸ ืคึฟืึทืจ ืึท ืคืจื™ืขืจ ืื•ืžื‘ืึทืงืึทื ื˜ ืกื˜ืฉืขืžืึท, ืฉื˜ื™ืฆืŸ ืคึฟืึทืจ schemaEvolution ืื™ื– ืฆื•ื’ืขืฉื˜ืขืœื˜.

ืื™ืŸ ืขืจืฉื˜ืขืจ ื‘ืœื™ืง, ื“ื™ ืœื™ื™ื–ื•ื ื’ ืงื•ืงื˜ ืคึผืฉื•ื˜. ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ื ืขืžืขืŸ ืึท ื˜ืขืงืข ืžื™ื˜ JSON ืื•ืŸ ืœื™ื™ืขื ืขืŸ ืขืก ืื™ืŸ ืึท ื“ืึทื˜ืึทืคืจืึทืžืข. ืกืคึผืึทืจืง ื•ื•ืขื˜ ืžืึทื›ืŸ ืึท ืกื›ืขืžืข, ื•ื•ืขื ื“ืŸ ื ืขืกื˜ืขื“ ื“ืึทื˜ืŸ ืื™ืŸ ืกื˜ืจืึทืงื˜ืฉืขืจื–. ื•ื•ื™ื™ึทื˜ืขืจ, ืึทืœืฅ ื“ืึทืจืฃ ื–ื™ื™ืŸ ื’ืขืจืื˜ืขื•ื•ืขื˜ ืื™ืŸ ืคึผืึทืจืงื™ื™, ื•ื•ืึธืก ืื™ื– ืื•ื™ืš ื’ืขืฉื˜ื™ืฆื˜ ืื™ืŸ ื™ืžืคึผืึทืœืึท, ื“ื•ืจืš ืจืขื“ื–ืฉื™ืกื˜ืขืจื™ื ื’ ื“ื™ ืกื˜ืึธืจืคืจืึทื ื˜ ืื™ืŸ ื“ื™ ื”ื™ื•ื•ืข ืžืขื˜ืึทืกื˜ืึธืจ.

ืึทืœืฅ ืžื™ื™ื ื˜ ืฆื• ื–ื™ื™ืŸ ืคึผืฉื•ื˜.

ืึธื‘ืขืจ, ืขืก ืื™ื– ื ื™ืฉื˜ ืงืœืึธืจ ืคื•ืŸ ื“ื™ ืงื•ืจืฅ ื‘ื™ื™ืฉืคื™ืœืŸ ืื™ืŸ ื“ื™ ื“ืึทืงื™ื•ืžืขื ื˜ื™ื™ืฉืึทืŸ ื•ื•ืึธืก ืฆื• ื˜ืึธืŸ ืžื™ื˜ ืึท ื ื•ืžืขืจ ืคื•ืŸ ืคืจืื‘ืœืขืžืขืŸ ืื™ืŸ ืคื™ืจ.

ื“ื™ ื“ืึทืงื™ื•ืžืขื ื˜ื™ื™ืฉืึทืŸ ื‘ืืฉืจื™ื™ื‘ื˜ ืึท ืฆื•ื’ืึทื ื’ ื ื™ืฉื˜ ืฆื• ืฉืึทืคึฟืŸ ืึท ืกื˜ืึธืจืคืจืึทื ื˜, ืึธื‘ืขืจ ืฆื• ืœื™ื™ืขื ืขืŸ JSON ืึธื“ืขืจ XML ืื™ืŸ ืึท ื“ืึทื˜ืึทืคืจืึทืžืข.

ื ื™ื™ืžืœื™, ืขืก ืคืฉื•ื˜ ื•ื•ื™ื™ื–ื˜ ื•ื•ื™ ืฆื• ืœื™ื™ืขื ืขืŸ ืื•ืŸ ืคึผืึทืจืก JSON:

df = spark.read.json(path...)

ื“ืึธืก ืื™ื– ื’ืขื ื•ื’ ืฆื• ืžืึทื›ืŸ ื“ื™ ื“ืึทื˜ืŸ ื‘ื ื™ืžืฆื ืฆื• Spark.

ืื™ืŸ ืคื™ืจ, ื“ื™ ืฉืจื™ืคื˜ ืื™ื– ืคื™ืœ ืžืขืจ ืงืึธืžืคึผืœื™ืฆื™ืจื˜ ื•ื•ื™ ื ืึธืจ ืœื™ื™ืขื ืขืŸ JSON ื˜ืขืงืขืก ืคึฟื•ืŸ ืึท ื˜ืขืงืข ืื•ืŸ ืฉืึทืคึฟืŸ ืึท ื“ืึทื˜ืึทืคืจืึทืžืข. ื“ืขืจ ืžืฆื‘ ื–ืขื˜ ืื•ื™ืก ืื–ื•ื™: ืขืก ืื™ื– ืฉื•ื™ืŸ ื“ื ื ื’ืขื•ื•ื™ืกืข ืกื˜ืึธืจืคืจืื ื˜, ื ื™ื™ืข ื“ืึทื˜ืŸ ืงื•ืžืขืŸ ื™ืขื“ืŸ ื˜ืื’ ืืจื™ื™ืŸ, ืžืขืŸ ื“ืืจืฃ ืฆื•ื’ืขื‘ืŸ ื–ื™ื™ ืฆื•ื ืกื˜ืึธืจืคืจืื ื˜, ื ื™ืฉื˜ ืคืืจื’ืขืกืŸ ืื– ื“ื™ ืกื›ืขืžืข ืงืขืŸ ื–ื™ื™ืŸ ืื ื“ืขืจืฉ.

ื“ืขืจ ื ืึธืจืžืึทืœ ืกื›ืขืžืข ืคึฟืึทืจ ื‘ื ื™ืŸ ืึท ื•ื•ื™ื˜ืจื™ื ืข ืื™ื– ื•ื•ื™ ื’ื™ื™ื˜:

ืฉืจื™ื˜ ืงืกื ื•ืžืงืก. ื“ื™ ื“ืึทื˜ืŸ ื–ืขื ืขืŸ ืœืึธื•ื“ื™ื“ ืื™ืŸ Hadoop ืžื™ื˜ ืกืึทื‘ืกืึทืงื•ื•ืึทื ื˜ ื˜ืขื’ืœืขืš ืจื™ืœืึธื•ื“ื™ื ื’ ืื•ืŸ ืฆื•ื’ืขื’ืขื‘ืŸ ืฆื• ืึท ื ื™ื™ึทืข ืฆืขื˜ื™ื™ืœื•ื ื’. ืขืก ื˜ื•ืจื ืก ืื•ื™ืก ืึท ื˜ืขืงืข ืžื™ื˜ ืขืจืฉื˜ ื“ืึทื˜ืŸ ืคึผืึทืจื˜ื™ืฉืึทื ื“ ื“ื•ืจืš ื˜ืึธื’.

ืฉืจื™ื˜ ืงืกื ื•ืžืงืก. ื‘ืขืฉืึทืก ื“ืขืจ ืขืจืฉื˜ ืžืึทืกืข, ื“ืขื ื˜ืขืงืข ืื™ื– ืœื™ื™ืขื ืขืŸ ืื•ืŸ ืคึผืึทืจืกืขื“ ื“ื•ืจืš Spark. ื“ื™ ืจื™ื–ืึทืœื˜ื™ื ื’ ื“ืึทื˜ืึทืคืจืึทืžืข ืื™ื– ื’ืขืจืื˜ืขื•ื•ืขื˜ ืื™ืŸ ืึท ืคึผืึทืจืกืึทื‘ืœืข ืคึฟืึธืจืžืึทื˜, ืคึฟืึทืจ ื‘ื™ื™ึทืฉืคึผื™ืœ, ืื™ืŸ ืคึผืึทืจืงื™ื™, ื•ื•ืึธืก ืงืขื ืขืŸ ื–ื™ื™ืŸ ื™ืžืคึผืึธืจื˜ื™ื“ ืื™ืŸ ื™ืžืคึผืึทืœืึท. ื“ืึธืก ืงืจื™ื™ื™ืฅ ืึท ืฆื™ืœ ื•ื•ื™ื˜ืจื™ื ืข ืžื™ื˜ ืึทืœืข ื“ื™ ื“ืึทื˜ืŸ ื•ื•ืึธืก ื”ืึธื‘ืŸ ืึทืงื™ื•ืžื™ืึทืœื™ื™ื˜ื™ื“ ื‘ื™ื– ื“ืขื ืคื•ื ื˜.

ืฉืจื™ื˜ ืงืกื ื•ืžืงืก. ื ืืจืืคืงืืคื™ืข ืื™ื– ื‘ืืฉืืคืŸ ื•ื•ืึธืก ื•ื•ืขื˜ ื“ืขืจื”ื™ื™ึทื ื˜ื™ืงืŸ ื“ื™ ืกื˜ืึธืจืคืจืึทื ื˜ ื™ืขื“ืขืจ ื˜ืึธื’.
ืขืก ืื™ื– ืึท ืงืฉื™ื ืคื•ืŸ ื™ื ืงืจืึทืžืขื ื˜ืึทืœ ืœืึธื•ื“ื™ื ื’, ื“ื™ ื ื•ื™ื˜ ืฆื• ืฆืขื˜ื™ื™ืœืŸ ื“ื™ ื•ื•ื™ื˜ืจื™ื ืข, ืื•ืŸ ื“ื™ ืงืฉื™ื ืคื•ืŸ ืžื™ื™ื ื˜ื™ื™ื ื™ื ื’ ื“ื™ ืึทืœื’ืขืžื™ื™ื ืข ืกื›ืขืžืข ืคื•ืŸ โ€‹โ€‹โ€‹โ€‹ื“ื™ ื•ื•ื™ื˜ืจื™ื ืข.

ืœืืžื™ืจ ื ืขืžืขืŸ ื ื‘ื™ื™ืฉืคื™ืœ. ื–ืืœ ืก ื–ืึธื’ืŸ ืึทื– ื“ืขืจ ืขืจืฉื˜ืขืจ ืฉืจื™ื˜ ืคื•ืŸ ื‘ื ื™ืŸ ืึท ืจื™ืคึผืึทื–ืึทื˜ืึธืจื™ ืื™ื– ื™ืžืคึผืœืึทืžืขื ืึทื“, ืื•ืŸ JSON ื˜ืขืงืขืก ื–ืขื ืขืŸ ื•ืคึผืœืึธืึทื“ืขื“ ืฆื• ืึท ื˜ืขืงืข.

ืงืจื™ื™ื™ื˜ื™ื ื’ ืึท ื“ืึทื˜ืึทืคืจืึทืžืข ืคึฟื•ืŸ ื–ื™ื™, ืื•ืŸ ืฉืคึผืึธืจืŸ ืขืก ื•ื•ื™ ืึท ื•ื•ื™ื˜ืจื™ื ืข, ืื™ื– ื ื™ืฉื˜ ืึท ืคึผืจืึธื‘ืœืขื. ื“ืึธืก ืื™ื– ื“ืขืจ ืขืจืฉื˜ืขืจ ืฉืจื™ื˜ ื•ื•ืึธืก ืงืขื ืขืŸ ื–ื™ื™ืŸ ื’ืขืคึฟื•ื ืขืŸ ืœื™ื™ื›ื˜ ืื™ืŸ ื“ื™ Spark ื“ืึทืงื™ื•ืžืขื ื˜ื™ื™ืฉืึทืŸ:

df = spark.read.option("mergeSchema", True).json(".../*") 
df.printSchema()

root 
|-- a: long (nullable = true) 
|-- b: string (nullable = true) 
|-- c: struct (nullable = true) |    
|-- d: long (nullable = true)

ืึทืœืฅ ืžื™ื™ื ื˜ ืฆื• ื–ื™ื™ืŸ ืคื™ื™ึทืŸ.

ืžื™ืจ ืœื™ื™ืขื ืขืŸ ืื•ืŸ ืคึผืึทืจืกืขื“ JSON, ื“ืขืžืึธืœื˜ ืžื™ืจ ืจืึทื˜ืขื•ื•ืขืŸ ื“ื™ ื“ืึทื˜ืึทืคืจืึทืžืข ื•ื•ื™ ืึท ืคึผืึทืจืงื™ื™, ืจืขื“ื–ืฉื™ืกื˜ืขืจื™ื ื’ ืขืก ืื™ืŸ ื”ื™ื•ื•ืข ืื™ืŸ ืงื™ื™ืŸ ื‘ืึทืงื•ื•ืขื ื•ื•ืขื’:

df.write.format(โ€œparquetโ€).option('path','<External Table Path>').saveAsTable('<Table Name>')

ืžื™ืจ ื‘ืึทืงื•ืžืขืŸ ืึท ืคึฟืขื ืฆื˜ืขืจ.

ืึธื‘ืขืจ, ื“ืขืจ ื•ื•ื™ื™ึทื˜ืขืจ ื˜ืึธื’, ื ื™ื™ึท ื“ืึทื˜ืŸ ืคื•ืŸ ื“ื™ ืžืงื•ืจ ืื™ื– ืฆื•ื’ืขื’ืขื‘ืŸ. ืžื™ืจ ื”ืึธื‘ืŸ ืึท ื˜ืขืงืข ืžื™ื˜ JSON ืื•ืŸ ืึท ื•ื•ื™ื˜ืจื™ื ืข ื‘ืืฉืืคืŸ ืคึฟื•ืŸ ื“ืขื ื˜ืขืงืข. ื ืึธืš ืœืึธื•ื“ื™ื ื’ ื“ื™ ื•ื•ื™ื™ึทื˜ืขืจ ืคึผืขืงืœ ืคื•ืŸ ื“ืึทื˜ืŸ ืคื•ืŸ ื“ื™ ืžืงื•ืจ, ื“ื™ ื“ืึทื˜ืŸ ืžืึทืจื˜ ืคืขืœื ื“ื™ืง ืื™ื™ืŸ ื˜ืึธื’ ืก ื•ื•ืขืจื˜ ืคื•ืŸ ื“ืึทื˜ืŸ.

ื“ื™ ืœืึทื“ื–ืฉื™ืงืึทืœ ืœื™ื™ื–ื•ื ื’ ื•ื•ืึธืœื˜ ื–ื™ื™ืŸ ืฆื• ืฆืขื˜ื™ื™ืœืŸ ื“ื™ ืกื˜ืึธืจืคืจืึทื ื˜ ื“ื•ืจืš ื˜ืึธื’, ื•ื•ืึธืก ื•ื•ืขื˜ ืœืึธื–ืŸ ืึทื“ื™ื ื’ ืึท ื ื™ื™ึทืข ืฆืขื˜ื™ื™ืœื•ื ื’ ื™ืขื“ืขืจ ื•ื•ื™ื™ึทื˜ืขืจ ื˜ืึธื’. ื“ืขืจ ืžืขืงืึทื ื™ื–ืึทื ืคึฟืึทืจ ื“ืขื ืื™ื– ืื•ื™ืš ื‘ืึทื•ื•ื•ืกื˜, ืกืคึผืึทืจืง ืึทืœืึทื•ื– ืื™ืจ ืฆื• ืฉืจื™ื™ึทื‘ืŸ ืคึผืึทืจื˜ื™ืฉืึทื ื– ืกืขืคึผืขืจืึทื˜ืœื™.

ืขืจืฉื˜ืขืจ, ืžื™ืจ ื˜ืึธืŸ ืึทืŸ ืขืจืฉื˜ ืžืึทืกืข, ืฉืคึผืึธืจืŸ ื“ื™ ื“ืึทื˜ืŸ ื•ื•ื™ ื“ื™ืกืงืจื™ื™ื‘ื“ ืื•ื™ื‘ืŸ, ืึทื“ื™ื ื’ ื‘ืœื•ื™ื– ืคึผืึทืจื˜ื™ืฉืึทื ื™ื ื’. ื“ืขืจ ืงืึทืžืฃ ืื™ื– ื’ืขืจื•ืคึฟืŸ ืกื˜ืึธืจืคืจืึทื ื˜ ื™ื ื™ื˜ื™ืึทืœื™ื–ืึทื˜ื™ืึธืŸ ืื•ืŸ ืื™ื– ื’ืขื˜ืืŸ ื‘ืœื•ื™ื– ืึทืžืึธืœ:

df.write.partitionBy("date_load").mode("overwrite").parquet(dbpath + "/" + db + "/" + destTable)

ื“ืขืจ ื•ื•ื™ื™ึทื˜ืขืจ ื˜ืึธื’, ืžื™ืจ ืœืึธื“ืŸ ื‘ืœื•ื™ื– ืึท ื ื™ื™ึทืข ืฆืขื˜ื™ื™ืœื•ื ื’:

df.coalesce(1).write.mode("overwrite").parquet(dbpath + "/" + db + "/" + destTable +"/date_load=" + date_load + "/")

ืึทืœืข ื•ื•ืึธืก ื‘ืœื™ื™ื‘ื˜ ืื™ื– ืฆื• ืฉื™ื™ึทืขืš-ืจืขื’ื™ืกื˜ืจื™ืจืŸ ืื™ืŸ ื”ื™ื•ื•ืข ืฆื• ื“ืขืจื”ื™ื™ึทื ื˜ื™ืงืŸ ื“ื™ ืกื˜ืฉืขืžืึท.
ืึธื‘ืขืจ, ื“ืึธืก ืื™ื– ื•ื•ื• ืคึผืจืึธื‘ืœืขืžืก ืื•ื™ืคืฉื˜ื™ื™ืŸ.

ืขืจืฉื˜ืขืจ ืคึผืจืึธื‘ืœืขื. ื’ื™ื›ืขืจ ืึธื“ืขืจ ืฉืคึผืขื˜ืขืจ, ื“ื™ ืจื™ื–ืึทืœื˜ื™ื ื’ ืคึผืึทืจืงื™ื™ ื•ื•ืขื˜ ื–ื™ื™ืŸ ืึทื ืจื™ื“ืึทื‘ืึทืœ. ื“ืึธืก ืื™ื– ืจืขื›ื˜ ืฆื• ื“ืขื ื•ื•ื™ ืคึผืึทืจืงื™ื™ ืื•ืŸ JSON ืžื™ื™ึทื›ืœ ืœื™ื™ื“ื™ืง ืคืขืœื“ืขืจ ืึทื ื“ืขืจืฉ.

ื–ืืœ ืก ื‘ืึทื˜ืจืึทื›ื˜ืŸ ืึท ื˜ื™ืคึผื™ืฉ ืกื™ื˜ื•ืึทืฆื™ืข. ืฆื•ื ื‘ื™ื™ืฉืคึผื™ืœ, ื ืขื›ื˜ืŸ ืงื•ืžื˜ JSON:

ะ”ะตะฝัŒ 1: {"a": {"b": 1}},

ืื•ืŸ ื”ื™ื™ึทื ื˜ ื“ืขืจ ื–ืขืœื‘ื™ืงืขืจ JSON ืงื•ืงื˜ ื•ื•ื™ ื“ืึธืก:

ะ”ะตะฝัŒ 2: {"a": null}

ื–ืืœ ืก ื–ืึธื’ืŸ ืžื™ืจ ื”ืึธื‘ืŸ ืฆื•ื•ื™ื™ ืคืึทืจืฉื™ื“ืขื ืข ืคึผืึทืจื˜ื™ืฉืึทื ื–, ื™ืขื“ืขืจ ืžื™ื˜ ืื™ื™ืŸ ืฉื•ืจื”.
ื•ื•ืขืŸ ืžื™ืจ ืœื™ื™ืขื ืขืŸ ื“ื™ ื’ืื ืฆืข ืžืงื•ืจ ื“ืึทื˜ืŸ, ืกืคึผืึทืจืง ื•ื•ืขื˜ ืงืขื ืขืŸ ืฆื• ื‘ืึทืฉื˜ื™ืžืขืŸ ื“ืขื ื˜ื™ืคึผ, ืื•ืŸ ื•ื•ืขื˜ ืคึฟืึทืจืฉื˜ื™ื™ืŸ ืึทื– "ืึท" ืื™ื– ืึท ืคืขืœื“ ืคื•ืŸ ื˜ื™ืคึผ "ืกื˜ืจื•ืงื˜ื•ืจ", ืžื™ื˜ ืึท ื ืขืกื˜ืขื“ ืคืขืœื“ "ื‘" ืคื•ืŸ ื˜ื™ืคึผ INT. ืึธื‘ืขืจ, ืื•ื™ื‘ ื™ืขื“ืขืจ ืฆืขื˜ื™ื™ืœื•ื ื’ ืื™ื– ื’ืขืจืื˜ืขื•ื•ืขื˜ ืกืขืคึผืขืจืึทื˜ืœื™, ืžื™ืจ ื‘ืึทืงื•ืžืขืŸ ืึท ืคึผืึทืจืงื™ื™ ืžื™ื˜ ื™ื ืงืึทืžืคึผืึทื˜ืึทื‘ืึทืœ ืฆืขื˜ื™ื™ืœื•ื ื’ ืกืงื™ืžื–:

df1 (a: <struct<"b": INT>>)
df2 (a: STRING NULLABLE)

ื“ื™ ืกื™ื˜ื•ืึทืฆื™ืข ืื™ื– ื‘ืึทื•ื•ื•ืกื˜, ืึทื–ื•ื™ ืึทืŸ ืึธืคึผืฆื™ืข ืื™ื– ืกืคึผืขืฆื™ืขืœ ืฆื•ื’ืขื’ืขื‘ืŸ - ื•ื•ืขืŸ ืคึผืึทืจืกื™ื ื’ ื“ื™ ืžืงื•ืจ ื“ืึทื˜ืŸ, ืึทืจืึธืคึผื ืขืžืขืŸ ืœื™ื™ื“ื™ืง ืคืขืœื“ืขืจ:

df = spark.read.json("...", dropFieldIfAllNull=True)

ืื™ืŸ ื“ืขื ืคืึทืœ, ื“ื™ ืคึผืึทืจืงื™ื™ ื•ื•ืขื˜ ืฆื•ื ื•ื™ืคืฉื˜ืขืœื  ื–ื™ืš ืคื•ืŸ ืคึผืึทืจื˜ื™ืฉืึทื ื– ื•ื•ืึธืก ืงืขื ืขืŸ ื–ื™ื™ืŸ ืœื™ื™ืขื ืขืŸ ืฆื•ื–ืึทืžืขืŸ.
ื›ืึธื˜ืฉ ื“ื™ ื•ื•ืึธืก ื”ืึธื‘ืŸ ื“ืึธืก ื’ืขื˜ืึธืŸ ืื™ืŸ ืคื™ืจ ื•ื•ืขืœืŸ ื“ืึธ ื‘ื™ื˜ืขืจ ืฉืžื™ื™ื›ืœืขืŸ. ืคืืจื•ื•ืืก? ื™ืึธ, ื•ื•ื™ื™ึทืœ ืขืก ื–ืขื ืขืŸ ืžืกืชึผืžื ืฆื• ื–ื™ื™ืŸ ืฆื•ื•ื™ื™ ืžืขืจ ืกื™ื˜ื•ืึทื˜ื™ืึธื ืก. ืึธื“ืขืจ ื“ืจื™ื™ึท. ืึธื“ืขืจ ืคื™ืจ. ื“ืขืจ ืขืจืฉื˜ืขืจ, ื•ื•ืึธืก ื•ื•ืขื˜ ื›ึผืžืขื˜ ื–ื™ื›ืขืจ ืคึผืึทืกื™ืจืŸ, ืื™ื– ืึทื– ื ื•ืžืขืจื™ืง ื˜ื™ื™ืคึผืก ื•ื•ืขื˜ ืงื•ืงืŸ ืึทื ื“ืขืจืฉ ืื™ืŸ ืคืึทืจืฉื™ื“ืขื ืข JSON ื˜ืขืงืขืก. ืคึฟืึทืจ ื‘ื™ื™ึทืฉืคึผื™ืœ, {ื™ื ื˜ืคื™ืขืœื“: 1} ืื•ืŸ {ื™ื ื˜ืคื™ืขืœื“: 1.1}. ืื•ื™ื‘ ืึทื–ืึท ืคืขืœื“ืขืจ ื–ืขื ืขืŸ ื’ืขืคึฟื•ื ืขืŸ ืื™ืŸ ืื™ื™ืŸ ืฆืขื˜ื™ื™ืœื•ื ื’, ื“ื™ ืกื˜ืฉืขืžืึท ืฆื•ื ื•ื™ืคื’ื™ืกืŸ ื•ื•ืขื˜ ืœื™ื™ืขื ืขืŸ ืึทืœืฅ ืจื™ื›ื˜ื™ืง, ื•ื•ืึธืก ื•ื•ืขื˜ ืคื™ืจืŸ ืฆื• ื“ื™ ืžืขืจืกื˜ ืคึผื™ื ื˜ืœืขืš ื˜ื™ืคึผ. ืื‘ืขืจ ืื•ื™ื‘ ืื™ืŸ ืคืึทืจืฉื™ื“ืขื ืข ืึธื ืขืก, ืื™ื™ื ืขืจ ื•ื•ืขื˜ ื”ืึธื‘ืŸ intField: int, ืื•ืŸ ื“ื™ ืื ื“ืขืจืข ื•ื•ืขื˜ ื”ืึธื‘ืŸ intField: ื˜ืึธืคึผืœ.

ืขืก ืื™ื– ื“ื™ ืคืืœื’ืขื ื“ืข ืคืึธืŸ ืฆื• ื”ืึทื ื“ืœืขืŸ ืžื™ื˜ ื“ืขื ืกื™ื˜ื•ืึทืฆื™ืข:

df = spark.read.json("...", dropFieldIfAllNull=True, primitivesAsString=True)

ืื™ืฆื˜ ืžื™ืจ ื”ืึธื‘ืŸ ืึท ื˜ืขืงืข ื•ื•ื• ืขืก ื–ืขื ืขืŸ ืคึผืึทืจื˜ื™ืฉืึทื ื– ื•ื•ืึธืก ืงืขื ืขืŸ ื–ื™ื™ืŸ ืœื™ื™ืขื ืขืŸ ืื™ืŸ ืึท ืื™ื™ืŸ ื“ืึทื˜ืึทืคืจืึทืžืข ืื•ืŸ ืึท ื’ื™ืœื˜ื™ืง ืคึผืึทืจืงื™ื™ ืคื•ืŸ ื“ื™ ื’ืื ืฆืข ื•ื•ื™ื˜ืจื™ื ืข. ื™ื? ื ื™ื™ืŸ.

ืžื™ืจ ืžื•ื–ืŸ ื’ืขื“ืขื ืงืขืŸ ืึทื– ืžื™ืจ ืจืขื’ื™ืกื˜ืจื™ืจื˜ ื“ื™ ื˜ื™ืฉ ืื™ืŸ ื”ื™ื•ื•ืข. ื”ื™ื•ื•ืข ืื™ื– ื ื™ืฉื˜ ืคืึทืœ-ืฉืคึผื™ืจืขื•ื•ื“ื™ืง ืื™ืŸ ืคืขืœื“ ื ืขืžืขืŸ, ื‘ืฉืขืช ืคึผืึทืจืงื™ื™ ืื™ื– ืคืึทืœ-ืฉืคึผื™ืจืขื•ื•ื“ื™ืง. ื“ืขืจื™ื‘ืขืจ, ืคึผืึทืจื˜ื™ืฉืึทื ื– ืžื™ื˜ ืกื˜ืฉืขืžืึทืก: field1: int ืื•ืŸ Field1: int ื–ืขื ืขืŸ ื“ื™ ื–ืขืœื‘ืข ืคึฟืึทืจ ื”ื™ื•ื•ืข, ืึธื‘ืขืจ ื ื™ืฉื˜ ืคึฟืึทืจ ืกืคึผืึทืจืง. ื“ื• ื–ืืœืกื˜ ื ื™ืฉื˜ ืคืึทืจื’ืขืกืŸ ืฆื• ื’ืขืจ ื“ื™ ืคืขืœื“ ื ืขืžืขืŸ ืฆื• ื ื™ื“ืขืจื™ืงืขืจ ืคืึทืœ.

ื ืึธืš ื“ืขื, ืึทืœืฅ ืžื™ื™ื ื˜ ืฆื• ื–ื™ื™ืŸ ื’ื•ื˜.

ืึธื‘ืขืจ, ื ื™ื˜ ืึทืœืข ืึทื–ื•ื™ ืคึผืฉื•ื˜. ืขืก ืื™ื– ื“ื ื ืฆื•ื•ื™ื™ื˜ืข, ืื•ื™ืš ื‘ืืงืื ื˜ืข ืคืจืื‘ืœืขื. ื–ื™ื ื˜ ื™ืขื“ืขืจ ื ื™ื™ึท ืฆืขื˜ื™ื™ืœื•ื ื’ ืื™ื– ื’ืขืจืื˜ืขื•ื•ืขื˜ ืกืขืคึผืขืจืึทื˜ืœื™, ื“ื™ ืฆืขื˜ื™ื™ืœื•ื ื’ ื˜ืขืงืข ื•ื•ืขื˜ ืึทื ื˜ื”ืึทืœื˜ืŸ Spark ืกืขืจื•ื•ื™ืก ื˜ืขืงืขืก, ืคึฟืึทืจ ื‘ื™ื™ึทืฉืคึผื™ืœ, ื“ื™ _SUCCESS ืึธืคึผืขืจืึทืฆื™ืข ื”ืฆืœื—ื” ืคืึธืŸ. ื“ืขื ื•ื•ืขื˜ ืจืขื–ื•ืœื˜ืึทื˜ ืื™ืŸ ืึท ื˜ืขื•ืช ื•ื•ืขืŸ ื˜ืจื™ื™ื ื’ ืฆื• ืคึผืึทืจืงื™ื™. ืฆื• ื•ื™ืกืžื™ื™ื“ืŸ ื“ืขื, ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ืงืึทื ืคื™ื’ื™ืขืจ ื“ื™ ืงืึทื ืคื™ื’ื™ืขืจื™ื™ืฉืึทืŸ ืฆื• ืคืึทืจืžื™ื™ึทื“ืŸ ืกืคึผืึทืจืง ืฆื• ืœื™ื™ื’ืŸ ืกืขืจื•ื•ื™ืก ื˜ืขืงืขืก ืฆื• ื“ืขืจ ื˜ืขืงืข:

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("parquet.enable.summary-metadata", "false")
hadoopConf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

ืขืก ืžื™ื™ื ื˜ ืึทื– ืื™ืฆื˜ ื™ืขื“ืขืจ ื˜ืึธื’ ืึท ื ื™ื™ึท ืคึผืึทืจืงื™ื™ ืฆืขื˜ื™ื™ืœื•ื ื’ ืื™ื– ืžื•ืกื™ืฃ ืฆื• ื“ื™ ืฆื™ืœ ื•ื•ื™ื˜ืจื™ื ืข ื˜ืขืงืข, ื•ื•ื• ื“ื™ ืคึผืึทืจืกืขื“ ื“ืึทื˜ืŸ ืคึฟืึทืจ ื“ืขื ื˜ืึธื’ ืื™ื– ืœื™ื’ืŸ. ืžื™ืจ ื”ืึธื‘ืŸ ืื™ืŸ ืฉื˜ื™ื™ึทื’ืŸ ื–ืึธืจื’ืŸ ืึทื– ืขืก ื–ืขื ืขืŸ ืงื™ื™ืŸ ืคึผืึทืจื˜ื™ืฉืึทื ื– ืžื™ื˜ ืึท ื“ืึทื˜ืŸ ื˜ื™ืคึผ ืงืึธื ืคืœื™ืงื˜.

ืึธื‘ืขืจ, ืžื™ืจ ื”ืึธื‘ืŸ ืึท ื“ืจื™ื˜ ืคึผืจืึธื‘ืœืขื. ื™ืขืฆื˜ ืื™ื– ื ื™ืฉื˜ ื‘ืืงืื ื˜ ื“ื™ ืืœื’ืขืžื™ื™ื ืข ืกื›ืขืžืข, ื“ืขืจืฆื• ื”ืื˜ ื“ืขืจ ื˜ื™ืฉ ืื™ืŸ ื”ื™ื•ื• ืืŸ ืื•ืžืจืขื›ื˜ ืกื›ืขืžืข, ื•ื•ื™ื‘ืืœื“ ื™ืขื“ืข ื ื™ื™ืข ืฆืขื˜ื™ื™ืœื•ื ื’ ื”ืื˜ ืžืขืจืกื˜ื ืก ืืจื™ื™ื ื’ืขื‘ืจืขื ื’ื˜ ื ื“ื™ืกื˜ืึธืจืฉืึทืŸ ืื™ืŸ ื“ื™ ืกื›ืขืžืข.

ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ืฉื™ื™ึทืขืš-ืจืขื’ื™ืกื˜ืจื™ืจืŸ ื“ื™ ื˜ื™ืฉ. ื“ืึธืก ืงืขืŸ ื–ื™ื™ืŸ ื’ืขื˜ืืŸ ืคืฉื•ื˜: ืœื™ื™ืขื ืขืŸ ื“ื™ ืคึผืึทืจืงื™ื™ ืคื•ืŸ ื“ื™ ืกื˜ืึธืจืคืจืึทื ื˜ ื•ื•ื™ื“ืขืจ, ื ืขืžืขืŸ ื“ื™ ืกื˜ืฉืขืžืึท ืื•ืŸ ืฉืึทืคึฟืŸ ืึท ื“ื“ืœ ื‘ืื–ื™ืจื˜ ืื•ื™ืฃ ืขืก, ืžื™ื˜ ื•ื•ืึธืก ืฆื• ืจืข-ืจืขื’ื™ืกื˜ืจื™ืจืŸ ื“ื™ ื˜ืขืงืข ืื™ืŸ ื”ื™ื•ื•ืข ื•ื•ื™ ืึท ืคื•ื ื“ืจื•ื™ืกื ื“ื™ืง ื˜ื™ืฉ, ืึทืคึผื“ื™ื™ื˜ื™ื ื’ ื“ื™ ืกื˜ืฉืขืžืึท ืคื•ืŸ ื“ื™ ืฆื™ืœ ืกื˜ืึธืจืคืจืึทื ื˜.

ืžื™ืจ ื”ืึธื‘ืŸ ืึท ืคืขืจื˜ ืคึผืจืึธื‘ืœืขื. ื•ื•ืขืŸ ืžื™ืจ ืจืขื’ื™ืกื˜ืจื™ืจื˜ ื“ื™ ื˜ื™ืฉ ืคึฟืึทืจ ื“ื™ ืขืจืฉื˜ืขืจ ืžืึธืœ, ืžื™ืจ ืคืึทืจืœืึธื–ื  ื–ื™ืš ืื•ื™ืฃ ืกืคึผืึทืจืง. ืื™ืฆื˜ ืžื™ืจ ื˜ืึธืŸ ืขืก ื–ื™ืš, ืื•ืŸ ืžื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ื’ืขื“ืขื ืงืขืŸ ืึทื– ืคึผืึทืจืงื™ื™ ืคืขืœื“ืขืจ ืงืขื ืขืŸ ืึธื ื”ื™ื™ื‘ืŸ ืžื™ื˜ ืื•ืชื™ื•ืช ื•ื•ืึธืก ื–ืขื ืขืŸ ื ื™ืฉื˜ ืขืจืœื•ื™ื‘ื˜ ืคึฟืึทืจ ื”ื™ื•ื•ืข. ืคึฟืึทืจ ื‘ื™ื™ึทืฉืคึผื™ืœ, ืกืคึผืึทืจืง ื•ื•ืืจืคื˜ ืขืจ ืื•ื™ืก ืฉื•ืจื•ืช ื•ื•ืึธืก ืขืก ืงืขืŸ ื ื™ืฉื˜ ืคึผืึทืจืกื™ืจืŸ ืื™ืŸ ื“ื™ "ืงืึธืจืจื•ืคึผื˜_ืจืขืงืึธืจื“" ืคืขืœื“. ืึทื–ืึท ืคืขืœื“ ืงืขื ืขืŸ ื ื™ื˜ ื–ื™ื™ืŸ ืจืขื’ื™ืกื˜ืจื™ืจื˜ ืื™ืŸ ื”ื™ื•ื•ืข ืึธืŸ ื–ื™ื™ืŸ ืื ื˜ืจื•ื ืขืŸ.

ื•ื•ื™ื™ืœ ื“ืึธืก, ืžื™ืจ ื‘ืึทืงื•ืžืขืŸ ื“ื™ ืกื›ืขืžืข:

f_def = ""
for f in pf.dtypes:
  if f[0] != "date_load":
    f_def = f_def + "," + f[0].replace("_corrupt_record", "`_corrupt_record`") + " " + f[1].replace(":", "`:").replace("<", "<`").replace(",", ",`").replace("array<`", "array<") 
table_define = "CREATE EXTERNAL TABLE jsonevolvtable (" + f_def[1:] + " ) "
table_define = table_define + "PARTITIONED BY (date_load string) STORED AS PARQUET LOCATION '/user/admin/testJson/testSchemaEvolution/pq/'"
hc.sql("drop table if exists jsonevolvtable")
hc.sql(table_define)

ืงืึธื“ืขืงืก ("_corrupt_record", "`_corrupt_record`") + "" + f[1].replace(":", "`:").replace("<", "<`").replace(",", ",`").replace("ืžืขื ื’ืข <`", "ืžืขื ื’ืข <") ืžืื›ื˜ ื–ื™ื›ืขืจ DDL, ื“"ื” ืึทื ืฉื˜ืึธื˜ ืคื•ืŸ:

create table tname (_field1 string, 1field string)

ืžื™ื˜ ืคืขืœื“ ื ืขืžืขืŸ ื•ื•ื™ "_field1, 1field", ื–ื™ื›ืขืจ DDL ืื™ื– ื’ืขืžืื›ื˜ ื•ื•ื• ื“ื™ ืคืขืœื“ ื ืขืžืขืŸ ื–ืขื ืขืŸ ืื ื˜ืจื•ื ืขืŸ: ืฉืึทืคึฟืŸ ื˜ื™ืฉ `ื˜ื ืึทืžืข` (`_field1` ืฉื˜ืจื™ืงืœ, `1field` ืฉื˜ืจื™ืงืœ).

ื“ื™ ืงืฉื™ื ืขืจื™ื™ื–ืึทื–: ื•ื•ื™ ืฆื• ื‘ืึทืงื•ืžืขืŸ ืึท ื“ืึทื˜ืึทืคืจืึทืžืข ืžื™ื˜ ืึท ื’ืึทื ืฅ ืกื›ืขืžืข (ืื™ืŸ ืคึผืฃ ืงืึธื“)? ื•ื•ื™ ืฆื• ื‘ืึทืงื•ืžืขืŸ ื“ืขื ืคึผืฃ? ื“ืึธืก ืื™ื– ื“ืขืจ ืคื™ื ืคื˜ืขืจ ืคึผืจืึธื‘ืœืขื. ืจื™ืœื™ื™ืขื ืขืŸ ื“ื™ ืกื›ืขืžืข ืคื•ืŸ โ€‹โ€‹โ€‹โ€‹ืึทืœืข ืคึผืึทืจื˜ื™ืฉืึทื ื– ืคื•ืŸ ื“ืขืจ ื˜ืขืงืข ืžื™ื˜ ืคึผืึทืจืงื™ื™ ื˜ืขืงืขืก ืคื•ืŸ ื“ื™ ืฆื™ืœ ื•ื•ื™ื˜ืจื™ื ืข? ื“ืขื ืื•ืคึฟืŸ ืื™ื– ื“ื™ ืกื™ื™ืคืึทืกื˜, ืึธื‘ืขืจ ืฉื•ื•ืขืจ.

ื“ื™ ืกื›ืขืžืข ืื™ื– ืฉื•ื™ืŸ ืื™ืŸ ื”ื™ื•ื•ืข. ืื™ืจ ืงืขื ืขืŸ ื‘ืึทืงื•ืžืขืŸ ืึท ื ื™ื™ึทืข ืกื˜ืฉืขืžืึท ื“ื•ืจืš ืงืึทืžื‘ื™ื™ื ื™ื ื’ ื“ื™ ืกื˜ืฉืขืžืึท ืคื•ืŸ ื“ื™ ื’ืื ืฆืข ื˜ื™ืฉ ืื•ืŸ ื“ื™ ื ื™ื™ึทืข ืฆืขื˜ื™ื™ืœื•ื ื’. ืึทื–ื•ื™ ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ื ืขืžืขืŸ ื“ื™ ื˜ื™ืฉ ืกื˜ืฉืขืžืึท ืคึฟื•ืŸ ื”ื™ื•ื•ืข ืื•ืŸ ืคืึทืจื‘ื™ื ื“ืŸ ืขืก ืžื™ื˜ ื“ื™ ืกื˜ืฉืขืžืึท ืคื•ืŸ ื“ื™ ื ื™ื™ึทืข ืฆืขื˜ื™ื™ืœื•ื ื’. ื“ืึธืก ืงืขืŸ ื–ื™ื™ืŸ ื’ืขื˜ืืŸ ื“ื•ืจืš ืœื™ื™ืขื ืขืŸ ื“ื™ ืคึผืจืึธื‘ืข ืžืขื˜ืึทื“ืึทื˜ืึท ืคื•ืŸ ื”ื™ื•ื•ืข, ืฉืคึผืึธืจืŸ ืขืก ืื™ืŸ ืึท ืฆื™ื™ึทื˜ื•ื•ื™ื™ึทืœื™ืง ื˜ืขืงืข, ืื•ืŸ ื ื™ืฆืŸ ืกืคึผืึทืจืง ืฆื• ืœื™ื™ืขื ืขืŸ ื‘ื™ื™ื“ืข ืคึผืึทืจื˜ื™ืฉืึทื ื– ืื™ืŸ ืึทืžืึธืœ.

ืื™ืŸ ืคืึทืงื˜, ืขืก ืื™ื– ืึทืœืฅ ืื™ืจ ื“ืึทืจืคึฟืŸ: ื“ืขืจ ืึธืจื™ื’ื™ื ืขืœ ื˜ื™ืฉ ืกื˜ืฉืขืžืึท ืื™ืŸ ื”ื™ื•ื•ืข ืื•ืŸ ื“ื™ ื ื™ื™ึทืข ืฆืขื˜ื™ื™ืœื•ื ื’. ืžื™ืจ ืื•ื™ืš ื”ืึธื‘ืŸ ื“ืึทื˜ืŸ. ืขืก ื‘ืœื™ื™ื‘ื˜ ื ืึธืจ ืฆื• ื‘ืึทืงื•ืžืขืŸ ืึท ื ื™ื™ึทืข ืกื˜ืฉืขืžืึท ื•ื•ืึธืก ืงืึทืžื‘ื™ื™ื ื– ื“ื™ ืกื˜ืึธืจืคืจืึทื ื˜ ืกื˜ืฉืขืžืึท ืื•ืŸ ื ื™ื™ึทืข ืคืขืœื“ืขืจ ืคึฟื•ืŸ ื“ื™ ื‘ืืฉืืคืŸ ืฆืขื˜ื™ื™ืœื•ื ื’:

from pyspark.sql import HiveContext
from pyspark.sql.functions import lit
hc = HiveContext(spark)
df = spark.read.json("...", dropFieldIfAllNull=True)
df.write.mode("overwrite").parquet(".../date_load=12-12-2019")
pe = hc.sql("select * from jsonevolvtable limit 1")
pe.write.mode("overwrite").parquet(".../fakePartiton/")
pf = spark.read.option("mergeSchema", True).parquet(".../date_load=12-12-2019/*", ".../fakePartiton/*")

ื•ื•ื™ื™ึทื˜ืขืจ, ืžื™ืจ ืžืึทื›ืŸ ื“ื™ ื˜ื™ืฉ ืจืขื’ื™ืกื˜ืจืึทืฆื™ืข DDL, ื•ื•ื™ ืื™ืŸ ื“ื™ ืคืจื™ืขืจื“ื™ืงืข ืกื ื™ืคึผืึทื˜.
ืื•ื™ื‘ ื“ื™ ื’ืื ืฆืข ืงื™ื™ื˜ ืึทืจื‘ืขื˜ ืจื™ื›ื˜ื™ืง, ื ื™ื™ืžืœื™, ืขืก ืื™ื– ื’ืขื•ื•ืขืŸ ืึท ื™ื ื™ื˜ื™ืึทืœื™ื–ื™ื ื’ ืžืึทืกืข, ืื•ืŸ ื“ื™ ื˜ื™ืฉ ืื™ื– ื‘ืืฉืืคืŸ ืจื™ื›ื˜ื™ืง ืื™ืŸ ื”ื™ื•ื•ืข, ืžื™ืจ ื‘ืึทืงื•ืžืขืŸ ืึท ื“ืขืจื”ื™ื™ึทื ื˜ื™ืงื˜ ื˜ื™ืฉ ืกื˜ืฉืขืžืึท.

ืื•ืŸ ื“ื™ ืœืขืฆื˜ืข ืคึผืจืึธื‘ืœืขื ืื™ื– ืึทื– ืื™ืจ ืงืขื ืขืŸ ื ื™ืฉื˜ ื ืึธืจ ืœื™ื™ื’ืŸ ืึท ืฆืขื˜ื™ื™ืœื•ื ื’ ืฆื• ืึท ื”ื™ื•ื•ืข ื˜ื™ืฉ, ื•ื•ื™ื™ึทืœ ืขืก ื•ื•ืขื˜ ื–ื™ื™ืŸ ืฆืขื‘ืจืื›ืŸ. ืื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ืฆื•ื•ื™ื ื’ืขืŸ ื”ื™ื•ื•ืข ืฆื• ืคืึทืจืจื™ื›ื˜ืŸ ื–ื™ื™ึทืŸ ืฆืขื˜ื™ื™ืœื•ื ื’ ืกื˜ืจื•ืงื˜ื•ืจ:

from pyspark.sql import HiveContext
hc = HiveContext(spark) 
hc.sql("MSCK REPAIR TABLE " + db + "." + destTable)

ื“ื™ ืคึผืฉื•ื˜ ืึทืจื‘ืขื˜ ืคื•ืŸ ืœื™ื™ืขื ืขืŸ JSON ืื•ืŸ ืฉืึทืคึฟืŸ ืึท ืกื˜ืึธืจืคืจืึทื ื˜ ื‘ืื–ื™ืจื˜ ืื•ื™ืฃ ืขืก ืจื™ื–ืึทืœื˜ื™ื“ ืื™ืŸ ืึธื•ื•ื•ืขืจืงืึทืžื™ื ื’ ืึท ื ื•ืžืขืจ ืคื•ืŸ ื™ืžืคึผืœื™ืกืึทื˜ ืฉื•ื•ืขืจื™ืงื™ื™ื˜ืŸ, ืกืึทืœื•ืฉืึทื ื– ืคึฟืึทืจ ื•ื•ืึธืก ืื™ืจ ื”ืึธื‘ืŸ ืฆื• ืงื•ืงืŸ ืคึฟืึทืจ ืกืขืคึผืขืจืึทื˜ืœื™. ืื•ืŸ ื›ืึธื˜ืฉ ื“ื™ ืกืึทืœื•ืฉืึทื ื– ื–ืขื ืขืŸ ืคึผืฉื•ื˜, ืขืก ื ืขืžื˜ ืึท ืคึผืœืึทืฅ ืคื•ืŸ ืฆื™ื™ื˜ ืฆื• ื’ืขืคึฟื™ื ืขืŸ ื–ื™ื™.

ืฆื• ื™ื ืกื˜ืจื•ืžืขื ื˜ ื“ื™ ืงืึทื ืกื˜ืจืึทืงืฉืึทืŸ ืคื•ืŸ ื“ื™ ื•ื•ื™ื˜ืจื™ื ืข, ืื™ืš ื’ืขื”ืื˜ ืฆื•:

  • ืœื™ื™ื’ ืคึผืึทืจื˜ื™ืฉืึทื ื– ืฆื• ื“ื™ ื•ื•ื™ื˜ืจื™ื ืข, ื‘ืึทืงื•ืžืขืŸ ื‘ืึทืคืจื™ื™ึทืขืŸ ืคื•ืŸ ืกืขืจื•ื•ื™ืก ื˜ืขืงืขืก
  • ื”ืึทื ื“ืœืขืŸ ืžื™ื˜ ืœื™ื™ื“ื™ืง ืคืขืœื“ืขืจ ืื™ืŸ ืžืงื•ืจ ื“ืึทื˜ืŸ ื•ื•ืึธืก ืกืคึผืึทืจืง ื”ืื˜ ื˜ื™ื™ืคึผื˜
  • ื•ื•ืึทืจืคืŸ ืคึผืฉื•ื˜ ื˜ื™ื™ืคึผืก ืฆื• ืึท ืฉื˜ืจื™ืงืœ
  • ื’ืขืจ ืคืขืœื“ ื ืขืžืขืŸ ืฆื• ืœืึธื•ื•ืขืจืงืึทืกืข
  • ื‘ืึทื–ื•ื ื“ืขืจ ื“ืึทื˜ืŸ ื•ืคึผืœืึธืึทื“ ืื•ืŸ ื˜ื™ืฉ ืจืขื’ื™ืกื˜ืจืึทืฆื™ืข ืื™ืŸ ื”ื™ื•ื•ืข (DDL ื“ื•ืจ)
  • ื“ื• ื–ืืœืกื˜ ื ื™ืฉื˜ ืคืึทืจื’ืขืกืŸ ืฆื• ืึทื ื˜ืœื•ื™ืคืŸ ืคืขืœื“ ื ืขืžืขืŸ ื•ื•ืึธืก ืงืขืŸ ื–ื™ื™ืŸ ื™ื ืงืึทืžืคึผืึทื˜ืึทื‘ืึทืœ ืžื™ื˜ ื”ื™ื•ื•ืข
  • ืœืขืจื ืขืŸ ื•ื•ื™ ืฆื• ื“ืขืจื”ื™ื™ึทื ื˜ื™ืงืŸ ื˜ื™ืฉ ืจืขื’ื™ืกื˜ืจืึทืฆื™ืข ืื™ืŸ ื”ื™ื•ื•ืข

ืกืึทืžื™ื ื’ ืึทืจื•ื™ืฃ, ืžื™ืจ ื˜ืึธืŸ ืึทื– ื“ืขืจ ื‘ืึทืฉืœื•ืก ืฆื• ื‘ื•ื™ืขืŸ ืงืจืึธื ืคึฟืขื ืฆื˜ืขืจ ืื™ื– ืคืจืึธื˜ ืžื™ื˜ ืคื™ืœืข ืคึผื™ื˜ืคืึธืœื–. ื“ืขืจื™ื‘ืขืจ, ืื™ืŸ ืคืึทืœ ืคื•ืŸ ื™ืžืคึผืœืึทืžืขื ื˜ื™ื™ืฉืึทืŸ ืฉื•ื•ืขืจื™ืงื™ื™ื˜ืŸ, ืขืก ืื™ื– ื‘ืขืกืขืจ ืฆื• ืงืึธื ื˜ืึทืงื˜ ืึท ื™ืงืกืคึผื™ืจื™ืึทื ืกื˜ ืฉื•ื˜ืขืฃ ืžื™ื˜ ืžืฆืœื™ื— ืขืงืกืคึผืขืจื˜ื™ื–.

ื“ืื ืง ืื™ืจ ืคึฟืึทืจ ืœื™ื™ืขื ืขืŸ ื“ืขื ืึทืจื˜ื™ืงืœ, ืžื™ืจ ื”ืึธืคึฟืŸ ืื™ืจ ื’ืขืคึฟื™ื ืขืŸ ื“ื™ ืื™ื ืคึฟืึธืจืžืึทืฆื™ืข ื ื•ืฆื™ืง.

ืžืงื•ืจ: www.habr.com

ืœื™ื™ื’ืŸ ืึท ื‘ืึทืžืขืจืงื•ื ื’