ããŒã
ãªãããŸããŸãªãã¡ã€ã«åœ¢åŒãå¿ èŠãªã®ã§ãããã?
MapReduce ã Spark ãªã©ã® HDFS 察å¿ã¢ããªã±ãŒã·ã§ã³ã®äž»ãªããã©ãŒãã³ã¹ã®ããã«ããã¯ã¯ãããŒã¿ã®æ€çŽ¢ãèªã¿åããæžã蟌ã¿ã«ãããæéã§ãã åºå®ã¹ããŒãã§ã¯ãªãé²åããã¹ããŒããããå ŽåããŸãã¯ã¹ãã¬ãŒãžã«äœããã®å¶çŽãããå Žåã倧èŠæš¡ãªããŒã¿ ã»ããã®ç®¡çãå°é£ã«ãªãããããããã®åé¡ã¯ããã«æªåããŸãã
ããã° ããŒã¿ãåŠçãããšãã¹ãã¬ãŒãž ãµãã·ã¹ãã ã®è² è·ãå¢å ããŸããHadoop ã¯ãã©ãŒã«ã ãã¬ã©ã³ã¹ãå®çŸããããã«ããŒã¿ãåé·çã«ä¿åããŸãã ãã£ã¹ã¯ã®ä»ã«ããã»ããµããããã¯ãŒã¯ãå ¥åºåã·ã¹ãã ãªã©ãæèŒãããŸãã ããŒã¿ã®éãå¢å ããã«ã€ããŠãããŒã¿ã®åŠçãšä¿åã®ã³ã¹ããå¢å ããŸãã
ããŸããŸãªãã¡ã€ã«åœ¢åŒ
- èªæžæéã®ççž®ã
- é²é³æéãççž®ãããŸããã
- å ±æãã¡ã€ã«ã
- ã¹ããŒãé²åã®ãµããŒãã
- æ¡åŒµãããå§çž®ãµããŒãã
ãã¡ã€ã«åœ¢åŒã«ã¯ãäžè¬çãªäœ¿çšãç®çãšãããã®ãããç¹æ®ãªçšéãç®çãšãããã®ãããã³ç¹å®ã®ããŒã¿ç¹æ§ãæºããããã«èšèšããããã®ããããŸãã ãããã£ãŠãéžæè¢ã¯å®éã«ã¯éåžžã«å€§ããã§ãã
Avro ãã¡ã€ã«åœ¢åŒ
ã®ããã« ããŒã¿ã®ã·ãªã¢ã«å Avro ã¯åºã䜿çšãããŠããŸã - ãã æååããŒã¹ã®ãã€ãŸããHadoop ã«ãããæååããŒã¿ã®ä¿å圢åŒã§ãã ã¹ããŒã㯠JSON 圢åŒã§ä¿åããããããããããããã°ã©ã ã§ç°¡åã«èªã¿åã£ãŠè§£éã§ããŸãã ããŒã¿èªäœã¯ãã€ããªåœ¢åŒã§ãããã³ã³ãã¯ãã§å¹ççã§ãã
Avro ã®ã·ãªã¢ã«åã·ã¹ãã ã¯èšèªã«äŸåããŸããã ãã¡ã€ã«ã¯ããŸããŸãªèšèª (çŸåšã¯ CãC++ãC#ãJavaãPythonãRuby) ã§åŠçã§ããŸãã
Avro ã®éèŠãªæ©èœã¯ãæéã®çµéãšãšãã«å€åãããã€ãŸãé²åããããŒã¿ ã¹ããŒãã匷åã«ãµããŒãããŠããããšã§ãã Avro ã¯ããã£ãŒã«ãã®åé€ãè¿œå ãå€æŽãšãã£ãã¹ããŒãã®å€æŽãç解ããŸãã
Avro ã¯ããŸããŸãªããŒã¿æ§é ããµããŒãããŠããŸãã ããšãã°ãé åãåæåãããã³ãµãã¬ã³ãŒããå«ãã¬ã³ãŒããäœæã§ããŸãã
ãã®åœ¢åŒã¯ãããŒã¿ ã¬ã€ã¯ã®ã©ã³ãã£ã³ã° (移è¡) ãŸãŒã³ãžã®æžã蟌ã¿ã«æé©ã§ã (
ãããã£ãŠããã®åœ¢åŒã¯ã次ã®çç±ã«ãããããŒã¿ ã¬ã€ã¯ã®ã©ã³ãã£ã³ã° ãŸãŒã³ãžã®æžã蟌ã¿ã«æé©ã§ãã
- éåžžããã®ãŸãŒã³ããã®ããŒã¿ã¯ãããŠã³ã¹ããªãŒã ã·ã¹ãã ã«ãããããªãåŠçã®ããã«å šäœãèªã¿åãããŸãããã®å Žåãè¡ããŒã¹ã®åœ¢åŒã®æ¹ãå¹ççã§ãã
- ããŠã³ã¹ããªãŒã ã·ã¹ãã ã¯ããã¡ã€ã«ããã¹ããŒã ããŒãã«ãç°¡åã«ååŸã§ããŸããã¹ããŒããå€éšã¡ã¿ ã¹ãã¬ãŒãžã«åå¥ã«ä¿åããå¿ èŠã¯ãããŸããã
- å ã®ã¹ããŒããžã®å€æŽã¯ç°¡åã«åŠçãããŸã (ã¹ããŒãã®é²å)ã
å¯æšçŽ°å·¥ã®ãã¡ã€ã«åœ¢åŒ
Parquet ã¯ãHadoop ã®ãªãŒãã³ãœãŒã¹ ãã¡ã€ã«åœ¢åŒã§ãã ãã©ãããªå圢åŒã®ãã¹ããããããŒã¿æ§é .
åŸæ¥ã®è¡ã¢ãããŒããšæ¯èŒããŠãParquet ã¯ã¹ãã¬ãŒãžãšããã©ãŒãã³ã¹ã®ç¹ã§ããå¹ççã§ãã
ããã¯ãå¹ ã®åºã (å€æ°ã®å) ããŒãã«ããç¹å®ã®åãèªã¿åãã¯ãšãªã«ç¹ã«åœ¹ç«ã¡ãŸãã ãã¡ã€ã«åœ¢åŒã®ãããã§ãå¿ èŠãªåã®ã¿ãèªã¿åããããããI/O ã¯æå°éã«æããããŸãã
ã¡ãã£ãšããäœè«ãšèª¬æ: Hadoop ã® Parquet ãã¡ã€ã«åœ¢åŒãããæ·±ãç解ããããã«ãåããŒã¹ (ã€ãŸããå圢åŒ) 圢åŒãäœã§ããããèŠãŠã¿ãŸãããã ãã®åœ¢åŒã¯ãååã®åæ§ã®å€ããŸãšããŠä¿åããŸãã
ID
åå
éšé
1
emp1
d1
2
emp2
d2
3
emp3
d3
æåå圢åŒã®å ŽåãããŒã¿ã¯æ¬¡ã®ããã«ä¿åãããŸãã
1
emp1
d1
2
emp2
d2
3
emp3
d3
ã«ã©ã ã圢åŒã®ãã¡ã€ã«åœ¢åŒã§ã¯ãåãããŒã¿ã次ã®ããã«ä¿åãããŸãã
1
2
3
emp1
emp2
emp3
d1
d2
d3
ããŒãã«ããè€æ°ã®åãã¯ãšãªããå¿ èŠãããå Žåã¯ãå圢åŒã®æ¹ãå¹ççã§ãã åã¯é£æ¥ããŠãããããå¿ èŠãªåã®ã¿ãèªã¿åãããŸãã ãã®ããã«ããŠãI/O æäœã¯æå°éã«æããããŸãã
ããšãã°ãNAME åã®ã¿ãå¿
èŠã§ãã ã§
ãããã£ãŠãåæå圢åŒã§ã¯ãå¿ èŠãªåã«å°éããããã®æ€çŽ¢æéãççž®ãããå¿ èŠãªåã®ã¿ãèªã¿åããããã I/O æäœã®æ°ãåæžããããããã¯ãšãªã®ããã©ãŒãã³ã¹ãåäžããŸãã
ãŠããŒã¯ãªæ©èœã® XNUMX ã€
Hadoop ã® Parquet ãã¡ã€ã«åœ¢åŒãç解ããã«ã¯ã次ã®çšèªãç解ããŠããå¿
èŠããããŸãã
- è¡ã°ã«ãŒã (è¡ã°ã«ãŒã): ããŒã¿ãè¡ã«è«ççã«æ°Žå¹³ã«åå²ãããã®ã è¡ã°ã«ãŒãã¯ãããŒã¿ ã»ããå ã®ååã®ãã©ã°ã¡ã³ãã§æ§æãããŸãã
- åãã©ã°ã¡ã³ã (åãã£ã³ã¯): ç¹å®ã®åã®ãã©ã°ã¡ã³ãã ãããã®åãã©ã°ã¡ã³ãã¯ç¹å®ã®è¡ã°ã«ãŒãã«ååšãããã¡ã€ã«å ã§é£ç¶ããŠããããšãä¿èšŒãããŸãã
- ããŒãž (ããŒãž): åãã©ã°ã¡ã³ãã¯ã次ã ã«èšè¿°ãããããŒãžã«åå²ãããŸãã åããŒãžã«ã¯å ±éã®ã¿ã€ãã«ãä»ããŠããã®ã§ãäžèŠãªããŒãžãèªã¿é£ã°ããŠèªãããšãã§ããŸãã
ããã§ã¯ã¿ã€ãã«ã«ããžãã¯ãã³ããŒãå«ãŸããŠããã ãã§ã PAR1 (4 ãã€ã) ãã¡ã€ã«ã Parquet ãã¡ã€ã«ã§ããããšãèå¥ããŸãã
ããã¿ãŒã«ã¯æ¬¡ã®ããã«æžãããŠããŸãã
- ååã®ã¡ã¿ããŒã¿ã®éå§åº§æšãå«ããã¡ã€ã« ã¡ã¿ããŒã¿ã èªã¿åããšãã¯ããŸããã¡ã€ã«ã®ã¡ã¿ããŒã¿ãèªã¿åãã察象ãšãªããã¹ãŠã®åãã©ã°ã¡ã³ããèŠã€ããå¿ èŠããããŸãã ãã®åŸãåãã©ã°ã¡ã³ããé çªã«èªã¿åãå¿ èŠããããŸãã ãã®ä»ã®ã¡ã¿ããŒã¿ã«ã¯ã圢åŒã®ããŒãžã§ã³ãã¹ããŒããããã³è¿œå ã®ããŒãšå€ã®ãã¢ãå«ãŸããŸãã
- ã¡ã¿ããŒã¿ã®é·ã (4 ãã€ã)ã
- ããžãã¯ãã³ã㌠PAR1 (4ãã€ã)ã
ORC ãã¡ã€ã«åœ¢åŒ
æé©åãããè¡-åãã¡ã€ã«åœ¢åŒ (æé©åãããè¡åã
ORC 圢åŒã®å©ç¹:
- åã¿ã¹ã¯ã®åºå㯠XNUMX ã€ã®ãã¡ã€ã«ã§ãããNameNode (ããŒã ããŒã) ã®è² è·ã軜æžãããŸãã
- DateTimeãXNUMX é²æ°ãè€åããŒã¿å (æ§é äœããªã¹ããããããå ±çšäœ) ãå«ã Hive ããŒã¿åã®ãµããŒãã
- ç°ãªã RecordReader ããã»ã¹ã«ããåããã¡ã€ã«ã®åæèªã¿åãã
- ããŒã«ãŒãã¹ãã£ã³ããã«ãã¡ã€ã«ãåå²ããæ©èœã
- ãã¡ã€ã« ããã¿ãŒã®æ å ±ã«åºã¥ããŠãèªã¿åã/æžã蟌ã¿ããã»ã¹ã«å¯èœãªæ倧ããŒã ã¡ã¢ãªå²ãåœãŠãæšå®ããŸãã
- ã¡ã¿ããŒã¿ã¯ãããã³ã« ãããã¡ãŒã®ãã€ã㪠ã·ãªã¢ã«å圢åŒã§ä¿åããããã£ãŒã«ãã®è¿œå ãšåé€ãå¯èœã«ãªããŸãã
ORC ã¯æååã®ã³ã¬ã¯ã·ã§ã³ã XNUMX ã€ã®ãã¡ã€ã«ã«ä¿åããã³ã¬ã¯ã·ã§ã³å
ã§ã¯æååããŒã¿ãå圢åŒã§ä¿åãããŸãã
ORC ãã¡ã€ã«ã«ã¯ãã¹ãã©ã€ããšåŒã°ããè¡ã®ã°ã«ãŒããšããã¡ã€ã«ã®ããã¿ãŒã«ãµããŒãæ å ±ãæ ŒçŽãããŸãã ãã¡ã€ã«ã®æåŸã«ãããã¹ãã¹ã¯ãªããã«ã¯ãå§çž®ãã©ã¡ãŒã¿ãšå§çž®ãããããã¿ãŒã®ãµã€ãºãå«ãŸããŠããŸãã
ããã©ã«ãã®ã¹ãã©ã€ã ãµã€ãºã¯ 250 MB ã§ãã ãã®ãããªå€§ããªã¹ãã©ã€ãã«ãããHDFS ããã®èªã¿åãã¯å€§ããªé£ç¶ãããã¯ã§ããå¹ççã«å®è¡ãããŸãã
ãã¡ã€ã« ããã¿ãŒã«ã¯ããã¡ã€ã«å ã®ã¬ãŒã³ã®ãªã¹ããã¬ãŒã³ããšã®è¡æ°ãããã³ååã®ããŒã¿åãèšé²ãããŸãã ååã® countãminãmaxãsum ã®çµæã®å€ãããã«æžã蟌ãŸããŸãã
ã¹ããªããã®ããã¿ãŒã«ã¯ãã¹ããªãŒã ã®å Žæã®ãã£ã¬ã¯ããªãå«ãŸããŠããŸãã
è¡ããŒã¿ã¯ããŒãã«ãã¹ãã£ã³ãããšãã«äœ¿çšãããŸãã
ã€ã³ããã¯ã¹ ããŒã¿ã«ã¯ãååã®æå°å€ãšæ倧å€ãããã³ååå ã®è¡ã®äœçœ®ãå«ãŸããŸãã ORC ã€ã³ããã¯ã¹ã¯ãã¯ãšãªã«å¿çããããã§ã¯ãªããã¹ãã©ã€ããšè¡ã°ã«ãŒããéžæããããã«ã®ã¿äœ¿çšãããŸãã
ããŸããŸãªãã¡ã€ã«åœ¢åŒã®æ¯èŒ
ã¢ãããšå¯æšçŽ°å·¥ã®åºã®æ¯èŒ
- Avro ã¯è¡ã¹ãã¬ãŒãžåœ¢åŒã§ãããParquet ã¯ããŒã¿ãåã«æ ŒçŽããŸãã
- Parquet ã¯åæã¯ãšãªã«é©ããŠããŸããã€ãŸããèªã¿åãæäœãšããŒã¿ã®ã¯ãšãªã®æ¹ãæžã蟌ã¿ãããã¯ããã«å¹ççã§ãã
- Avro ã§ã®æžã蟌ã¿æäœã¯ãParquet ãããå¹ççã«å®è¡ãããŸãã
- Avro ã¯åè·¯ã®é²åãããæçããŠæ±ã£ãŠããŸãã Parquet ã¯ã¹ããŒãã®è¿œå ã®ã¿ããµããŒãããŸãããAvro ã¯å€æ©èœã®é²åãã€ãŸãåã®è¿œå ãŸãã¯å€æŽããµããŒãããŸãã
- Parquet ã¯ãè€æ°åããŒãã«å ã®åã®ãµãã»ãããã¯ãšãªããã®ã«æé©ã§ãã Avro ã¯ããã¹ãŠã®åãã¯ãšãªãã ETL æäœã«é©ããŠããŸãã
ORC vs å¯æšçŽ°å·¥
- Parquet ã¯ãã¹ããããããŒã¿ãããé©åã«ä¿åããŸãã
- ORC ã¯è¿°èªããã·ã¥ããŠã³ã«é©ããŠããŸãã
- ORC 㯠ACID ããããã£ããµããŒãããŸãã
- ORC ã¯ããŒã¿ãããé©åã«å§çž®ããŸãã
ãã®ãããã¯ã«é¢ããŠä»ã«äœãèªãã¹ãã:
ã¯ã©ãŠãã§ã®ããã°ããŒã¿åæ: äŒæ¥ãããŒã¿æåã«ãªãã«ã¯ .ããŒã¿ããŒã¹ ã¹ããŒãã®è¬èãªã¬ã€ã .ããžã¿ã«å€é©ã«é¢ããé»å ±ãã£ãã« .
åºæïŒ habr.com