ãããããã«ïŒ æ°ããã³ãŒã¹ã¹ããªãŒã ãžã®ç»é²ã¯çŸåšOTUSã§åä»äžã§ã
æ¯æ¥ XNUMX å人以äžã®äººã ã Twitter ã蚪ããäžçã§äœãèµ·ãã£ãŠãããã調ã¹ãè°è«ããŠããŸãã åãã€ãŒãããã®ä»ã®ãŠãŒã¶ãŒã¢ã¯ã·ã§ã³ã«ãããTwitter å ã®å éšããŒã¿åæã«äœ¿çšã§ããã€ãã³ããçæãããŸãã äœçŸäººãã®åŸæ¥å¡ããã®ããŒã¿ãåæããŠèŠèŠåããŠããããšã¯ã¹ããªãšã³ã¹ãåäžãããããšã Twitter ããŒã¿ ãã©ãããã©ãŒã ããŒã ã®æåªå äºé ã§ãã
ç§ãã¡ã¯ãå¹
åºãæè¡ã¹ãã«ãæã€ãŠãŒã¶ãŒãããŒã¿ãæ€çŽ¢ããé©åã«æ©èœãã SQL ããŒã¹ã®åæããã³èŠèŠåããŒã«ã«ã¢ã¯ã»ã¹ã§ããå¿
èŠããããšèããŠããŸãã ããã«ãããããŒã¿ ã¢ããªã¹ãããããã¯ã ãããŒãžã£ãŒãªã©ãæè¡ã«è©³ãããªããŸã£ããæ°ãããŠãŒã¶ãŒ ã°ã«ãŒããããŒã¿ããæŽå¯ãæœåºã§ããããã«ãªããTwitter ã®åãããæ·±ãç解ãã掻çšã§ããããã«ãªããŸãã ããããTwitter ã§ã®ããŒã¿åæãæ°äž»åããæ¹æ³ã§ãã
å éšããŒã¿åæã®ããŒã«ãšæ©èœãåäžããã«ã€ããŠãTwitter ãµãŒãã¹ãåäžããŸããã ãã ãããŸã æ¹åã®äœå°ããããŸãã Scalding ãªã©ã®çŸåšã®ããŒã«ã«ã¯ããã°ã©ãã³ã°çµéšãå¿ èŠã§ãã Presto ã Vertica ãªã©ã® SQL ããŒã¹ã®åæããŒã«ã«ã¯ã倧èŠæš¡ãªããã©ãŒãã³ã¹ã®åé¡ããããŸãã ãŸããããŒã¿ã«åžžæã¢ã¯ã»ã¹ããã«è€æ°ã®ã·ã¹ãã ã«ããŒã¿ãåæ£ãããããšã«ãåé¡ããããŸãã
æšå¹Žçºè¡šããŸãã
ããã°ã¯ãšãªãŒ : SQL ãšã³ãžã³ããŒã¹ã®ãšã³ã¿ãŒãã©ã€ãº ããŒã¿ ãŠã§ã¢ããŠã¹ãã¬ã¡ã« ããã®ã¹ããŒããã·ã³ãã«ãã§æåã§ãæ©æ¢°åŠç¿ .ããŒã¿ã¹ã¿ãžãª: Google ããã¥ã¡ã³ãã®ãããªã³ã©ãã¬ãŒã·ã§ã³æ©èœãåããããã°ããŒã¿èŠèŠåããŒã«ã
ãã®èšäºã§ã¯ããããã®ããŒã«ã䜿çšããç§ãã¡ã®çµéšãã€ãŸãç§ãã¡ããããŸã§ã«è¡ã£ãããšãåŠãã ããšããããŠæ¬¡ã«äœãè¡ããã«ã€ããŠåŠã³ãŸãã ããã§ã¯ããããåæãšå¯Ÿè©±ååæã«çŠç¹ãåœãŠãŸãã ãªã¢ã«ã¿ã€ã åæã«ã€ããŠã¯æ¬¡ã®èšäºã§èª¬æããŸãã
Twitter äžã®ããŒã¿ ãŠã§ã¢ããŠã¹ã®æŽå²
BigQuery ã«ã€ããŠèª¬æããåã«ãTwitter ã§ããŒã¿ ãŠã§ã¢ããŠã¹ã®æŽå²ãç°¡åã«æ¯ãè¿ã£ãŠã¿ã䟡å€ããããŸãã 2011 幎ã«ã¯ãTwitter ããŒã¿åæã Vertica ãš Hadoop ã§å®è¡ãããŸããã MapReduce Hadoop ãžã§ããäœæããã«ã¯ãPig ã䜿çšããŸããã 2012 幎ã«ãPig ã Scalding ã«çœ®ãæããŸãããScalding ã«ã¯ãè€éãªãã€ãã©ã€ã³ã®äœææ©èœããã¹ãã®å®¹æããªã©ã®å©ç¹ãåãã Scala API ããããŸããã ãã ããSQL ã®æäœã«æ £ããŠããå€ãã®ããŒã¿ ã¢ããªã¹ãããããã¯ã ãããŒãžã£ãŒã«ãšã£ãŠãããã¯éåžžã«æ¥ãªåŠç¿æ²ç·ã§ããã 2016 幎é ãç§ãã¡ã¯ Hadoop ããŒã¿ã® SQL ããã³ããšã³ããšã㊠Presto ã䜿ãå§ããŸããã Spark ã¯ãã¢ããã㯠ããŒã¿ ãµã€ãšã³ã¹ãæ©æ¢°åŠç¿ã«é©ãã Python ã€ã³ã¿ãŒãã§ã€ã¹ãæäŸããŸããã
2018 幎以æ¥ãããŒã¿åæãšèŠèŠåã«æ¬¡ã®ããŒã«ã䜿çšããŠããŸããã
- çç£ã©ã€ã³ã®ç±åŠç
- ã¢ãããã¯ãªããŒã¿åæãšæ©æ¢°åŠç¿ã®ããã® Scalding ãš Spark
- ã¢ãããã¯ããã³ã€ã³ã¿ã©ã¯ãã£ã㪠SQL åæã®ããã® Vertica ãš Presto
- Druid ã¯ãæç³»åã¡ããªã¯ã¹ãžã®äœã€ã³ã¿ã©ã¯ãã£ããæ¢çŽ¢çãäœã¬ã€ãã³ã·ãŒã®ã¢ã¯ã»ã¹ãå®çŸããŸãã
- ããŒã¿èŠèŠåã®ããã® TableauãZeppelinãPivoâât
ãããã®ããŒã«ã¯éåžžã«åŒ·åãªæ©èœãæäŸããŸããããããã®æ©èœã Twitter ã§ããå€ãã®ãŠãŒã¶ãŒãå©çšã§ããããã«ããã®ã¯é£ããããšãããããŸããã Google Cloud ã§ãã©ãããã©ãŒã ãæ¡åŒµããããšã§ãTwitter å šäœã®åæããŒã«ãç°¡çŽ åããããšã«éç¹ã眮ããŠããŸãã
Google ã® BigQuery ããŒã¿ ãŠã§ã¢ããŠã¹
Twitter ã®ããã€ãã®ããŒã ã¯ããã§ã« BigQuery ãæ¬çªãã€ãã©ã€ã³ã®äžéšã«çµã¿èŸŒãã§ããŸãã 圌ãã®çµéšã掻çšããŠãç§ãã¡ã¯ Twitter ã®ãããããŠãŒã¹ã±ãŒã¹ã«å¯Ÿãã BigQuery ã®å¯èœæ§ãè©äŸ¡ãå§ããŸããã ç§ãã¡ã®ç®æšã¯ãBigQuery ãå šç€Ÿã«æäŸããData Platform ããŒã«ãããå 㧠BigQuery ãæšæºåããŠãµããŒãããããšã§ããã ããã¯å€ãã®çç±ããå°é£ã§ããã 倧éã®ããŒã¿ã確å®ã«åä¿¡ããå šç€ŸçãªããŒã¿ç®¡çããµããŒãããé©åãªã¢ã¯ã»ã¹å¶åŸ¡ã確ä¿ãã顧客ã®ãã©ã€ãã·ãŒã確ââä¿ããããã®ã€ã³ãã©ã¹ãã©ã¯ãã£ãéçºããå¿ èŠããããŸããã ãŸããããŒã ã BigQuery ãå¹æçã«äœ¿çšã§ããããã«ããªãœãŒã¹ã®å²ãåœãŠãã¢ãã¿ãªã³ã°ããã£ãŒãžããã¯ã®ããã®ã·ã¹ãã ãäœæããå¿ èŠããããŸããã
2018 幎 250 æã«ãå šç€Ÿåãã« BigQuery ãšããŒã¿ããŒã¿ã«ã®ã¢ã«ãã¡ ãªãªãŒã¹ããªãªãŒã¹ããŸããã ç§ãã¡ã¯ãæããã䜿çšãããŠããå人ããŒã¿ãæ¶å»ããã¹ãã¬ããã·ãŒãã®äžéšã Twitter ã¹ã¿ããã«æäŸããŸããã BigQuery ã¯ããšã³ãžãã¢ãªã³ã°ã財åãããŒã±ãã£ã³ã°ãªã©ã®ããŸããŸãªããŒã ã® 8 人ãè¶ ãããŠãŒã¶ãŒã«ãã£ãŠäœ¿çšãããŠããŸãã ããæè¿ã§ã¯ãã¹ã±ãžã¥ãŒã«ããããªã¯ãšã¹ããé€ããŠãçŽ 100 件ã®ãªã¯ãšã¹ããå®è¡ããæ¯æçŽ XNUMX PB ãåŠçããŠããŸããã éåžžã«è¯å®çãªãã£ãŒãããã¯ãåããåŸãç§ãã¡ã¯ããã«åé²ããTwitter äžã®ããŒã¿ãæäœããããã®äž»èŠãªãªãœãŒã¹ãšã㊠BigQuery ãæäŸããããšã«ããŸããã
ããã¯ãGoogle BigQuery ããŒã¿ ãŠã§ã¢ããŠã¹ã®é«ã¬ãã« ã¢ãŒããã¯ãã£ã®å³ã§ãã
å
éš Cloud Replicator ããŒã«ã䜿çšããŠãããŒã«ã« Hadoop ã¯ã©ã¹ã¿ãŒãã Google Cloud Storage (GCS) ã«ããŒã¿ãã³ããŒããŸãã 次ã«ãApache Airflow ã䜿çšããŠãã
次ã®ã»ã¯ã·ã§ã³ã§ã¯ã䜿ãããããããã©ãŒãã³ã¹ãããŒã¿ç®¡çãã·ã¹ãã ã®å¥å šæ§ãã³ã¹ãã«é¢ããåœç€Ÿã®ã¢ãããŒããšå°éç¥èã«ã€ããŠèª¬æããŸãã
䜿ãããã
BigQuery ã¯ãœãããŠã§ã¢ã®ã€ã³ã¹ããŒã«ãäžèŠã§ãçŽæçãªãŠã§ã ã€ã³ã¿ãŒãã§ãŒã¹ãéããŠã¢ã¯ã»ã¹ã§ããããããŠãŒã¶ãŒã¯ç°¡åã«äœ¿ãå§ããããšãã§ããããšãããããŸããã ãã ãããŠãŒã¶ãŒã¯ããããžã§ã¯ããããŒã¿ã»ãããããŒãã«ãªã©ã®ãªãœãŒã¹ãå«ããGCP ã®äžéšã®æ©èœãšæŠå¿µã«æ £ããå¿ èŠããããŸããã ãŠãŒã¶ãŒã䜿ãå§ããã®ã«åœ¹ç«ã€ãã¥ãŒããªã¢ã«ãšãã¥ãŒããªã¢ã«ãéçºããŸããã åºæ¬çãªç解ãããã°ããŠãŒã¶ãŒã¯ããŒã¿ã»ããã®ç§»åãã¹ããŒããšããŒãã« ããŒã¿ã®è¡šç€ºãåçŽãªã¯ãšãªã®å®è¡ãããŒã¿ããŒã¿ã«ã§ã®çµæã®èŠèŠåãç°¡åã«è¡ãããšãã§ããŸãã
BigQuery ãžã®ããŒã¿å
¥åã«é¢ããç§ãã¡ã®ç®æšã¯ãã¯ã³ã¯ãªãã¯ã§ HDFS ãŸã㯠GCS ããŒã¿ã»ãããã·ãŒã ã¬ã¹ã«èªã¿èŸŒããããã«ããããšã§ããã æ€èšããŸãã
ããŒã¿ã BigQuery ã«å€æããã«ã¯ããŠãŒã¶ãŒã¯ã¹ã±ãžã¥ãŒã«ãããã¯ãšãªã䜿çšããŠåçŽãª SQL ããŒã¿ ãã€ãã©ã€ã³ãäœæããŸãã äŸåé¢ä¿ã®ããè€éãªãã«ãã¹ããŒãž ãã€ãã©ã€ã³ã®å Žåã¯ãç¬èªã® Airflow ãã¬ãŒã ã¯ãŒã¯ãŸã㯠Cloud Composer ã䜿çšããäºå®ã§ãã
ÐÑПОзвПЎОÑелÑМПÑÑÑ
BigQuery ã¯ã倧éã®ããŒã¿ãåŠçããæ±çš SQL ã¯ãšãªåãã«èšèšãããŠããŸãã ããã¯ããã©ã³ã¶ã¯ã·ã§ã³ ããŒã¿ããŒã¹ã«å¿
èŠãªäœã¬ã€ãã³ã·ãé«ã¹ã«ãŒãããã®ã¯ãšãªããŸãã¯ãã©ã³ã¶ã¯ã·ã§ã³ ããŒã¿ããŒã¹ã«ãã£ãŠå®è£
ãããäœã¬ã€ãã³ã·ã®æç³»ååæãç®çãšãããã®ã§ã¯ãããŸããã
ããããçŽ 800 TB ã®ããŒã¿ãåŠçãã 1 ãè¶ ããã¯ãšãªãåæãããšãããå¹³åå®è¡æé㯠30 ç§ã§ããããšãããããŸããã ãŸããããã©ãŒãã³ã¹ã¯ããŸããŸãªãããžã§ã¯ããã¿ã¹ã¯ã§ã®ã¹ãããã®äœ¿çšã«å€§ããäŸåããããšãããããŸããã å®çšŒåãŠãŒã¹ã±ãŒã¹ãšå¯Ÿè©±ååæã®ããã©ãŒãã³ã¹ãç¶æããã«ã¯ãå®çšŒåã¹ããããšã¢ãããã¯ã¹ãããã®äºçŽãæ確ã«åé¢ããå¿ èŠããããŸããã ããã¯ãã¹ãããäºçŽãšãããžã§ã¯ãéå±€ã®èšèšã«å€§ããªåœ±é¿ãäžããŸããã
ã·ã¹ãã ã®ããŒã¿ç®¡çãæ©èœãã³ã¹ãã«ã€ããŠã¯ãæ°æ¥ä»¥å
ã«ç¿»èš³ã®ç¬¬ XNUMX éšã§èª¬æããäºå®ã§ãã
ç¶ããèªãïŒ
ããŒã¿æ§ç¯ããŒã«ããŸãã¯ããŒã¿ ãŠã§ã¢ããŠã¹ãšã¹ã ãŒãžãŒã®å ±éç¹ Delta Lake ã®è©³çŽ°: ã¹ããŒãã®é©çšãšé²å Apache Arrow ã䜿çšãã Python ã§ã®é«é Apache Parquet
åºæïŒ habr.com