ãã®èšäºã®èªã¿æ¹: éåžžã«é·ããŠæ··æ²ãšããæç« ã«ãªã£ãŠããŸãç³ãèš³ãããŸããã æéãç¯çŽããããã«ãåç« ã¯ãåŠãã ããšãã®å°å
¥éšããå§ãŸãããã®ç« ã®èŠç¹ã XNUMX ïœ XNUMX æã§èŠçŽããŠããŸãã
ã解決çãæããŠãã ããïŒã ç§ã®åç¹ãç¥ãããã ããªããããã£ãšåµæ工倫ãããã®ç« ãŸã§é£ã°ããŠãã ããããã ãã倱æã«ã€ããŠèªãã ã»ããèå³æ·±ãã圹ã«ç«ã€ãšæããŸãã
ç§ã¯æè¿ã倧éã®çã® DNA é
å (æè¡çã«ã¯ SNP ããã) ãåŠçããããã»ã¹ãã»ããã¢ããããä»»åãè² ã£ãŠããŸããã å¿
èŠãªã®ã¯ãåŸç¶ã®ã¢ããªã³ã°ããã®ä»ã®ã¿ã¹ã¯ã®ããã«ãç¹å®ã®éºäŒåäœçœ® (SNP ãšåŒã°ãã) ã«é¢ããããŒã¿ãè¿
éã«ååŸããããšã§ããã R ãš AWK ã䜿çšãããšãèªç¶ãªæ¹æ³ã§ããŒã¿ãæŽçããŠæŽçããããšãã§ããã¯ãšãªåŠçã倧å¹
ã«é«éåãããŸããã ããã¯ç§ã«ãšã£ãŠç°¡åã§ã¯ãªããäœåºŠãç¹°ãè¿ãå¿
èŠããããŸããã ãã®èšäºã¯ãç§ã®ééãã®ããã€ããåé¿ããç§ãæçµçã«ã©ããªã£ããã瀺ãã®ã«åœ¹ç«ã¡ãŸãã
ãŸããå
¥éçãªèª¬æãããŸãã
ããŒã¿
ç§ãã¡ã®å€§åŠã®éºäŒæ å ±åŠçã»ã³ã¿ãŒããã¯ã25 TB TSV ã®åœ¢åŒã§ããŒã¿ãæäŸãããŸããã ç§ã¯ãããã 5 ã€ã®ããã±ãŒãžã«åå²ããŠåãåããGzip ã§å§çž®ããŸãããåããã±ãŒãžã«ã¯çŽ 240 ã® 2,5 GB ã®ãã¡ã€ã«ãå«ãŸããŠããŸããã åè¡ã«ã¯ã60 人ã®å人ããã® 30 ã€ã® SNP ã®ããŒã¿ãå«ãŸããŠããŸããã åèšã§ãçŽ XNUMX äžã® SNP ãšçŽ XNUMX äžäººã«é¢ããããŒã¿ãéä¿¡ãããŸããã ãã¡ã€ã«ã«ã¯ãSNP æ å ±ã«å ããŠãèªã¿åã匷床ãããŸããŸãªå¯Ÿç«éºäŒåã®é »åºŠãªã©ãããŸããŸãªç¹æ§ãåæ ããæ°å€ãèšèŒãããå€æ°ã®åãå«ãŸããŠããŸããã äžæã®å€ãæã€åã¯åèšã§çŽ XNUMX åãããŸããã
ç®æš
ä»ã®ããŒã¿ç®¡çãããžã§ã¯ããšåæ§ã«ãæãéèŠãªããšã¯ãããŒã¿ãã©ã®ããã«äœ¿çšããããã決å®ããããšã§ããã ãã®å Žå 䞻㫠SNP ã«åºã¥ã㊠SNP ã®ã¢ãã«ãšã¯ãŒã¯ãããŒãéžæããŸããã ã€ãŸããäžåºŠã« 2,5 ã€ã® SNP ã«é¢ããããŒã¿ã®ã¿ãå¿ èŠã«ãªããŸãã ç§ã¯ãXNUMX äžã® SNP ã® XNUMX ã€ã«é¢é£ä»ãããããã¹ãŠã®ã¬ã³ãŒãããã§ããã ãç°¡åãè¿ éããããŠå®äŸ¡ã«ååŸããæ¹æ³ãåŠã°ãªããã°ãªããŸããã§ããã
ãããããªãæ¹æ³
é©åãªæ±ºãŸãæå¥ãåŒçšãããšã次ã®ããã«ãªããŸãã
ç§ã¯äœååã倱æããããã§ã¯ãããŸãããã¯ãšãªã«é©ãã圢åŒã§å€§éã®ããŒã¿ã解æããããšãåé¿ããåã®æ¹æ³ãçºèŠããã ãã§ãã
æåã«è©Šã
äœãåŠãã ã®ã: äžåºŠã« 25 TB ã解æããå®äŸ¡ãªæ¹æ³ã¯ãããŸããã
ãŽã¡ã³ããŒãã«ã倧åŠã§ãããã° ããŒã¿åŠçã®é«åºŠãªææ³ãã³ãŒã¹ãåè¬ããã®ã§ãç§ã¯ãã®ã³ããããã°ã®äžã«ãããšç¢ºä¿¡ããŠããŸããã Hive ãµãŒããŒãã»ããã¢ããããŠãã¹ãŠã®ããŒã¿ãå®è¡ããçµæãã¬ããŒãããã«ã¯ããããã 3 ïœ XNUMX æéããããŸãã ããŒã¿ã¯ AWS SXNUMX ã«ä¿åãããŠããããããµãŒãã¹ã䜿çšããŸãã
Athena ã«ããŒã¿ãšãã®åœ¢åŒã瀺ããåŸã次ã®ãããªã¯ãšãªã䜿çšããŠããã€ãã®ãã¹ããå®è¡ããŸããã
select * from intensityData limit 10;
ãããŠãããã«ããæ§é åãããçµæãåãåããŸããã æºåãã§ããŠã
ããŒã¿ãä»äºã«äœ¿ãããšãããŸã§ã¯...
ã¢ãã«ããã¹ãããããã«ãã¹ãŠã® SNP æ å ±ãåŒãåºãããã«æ±ããããŸããã ã¯ãšãªãå®è¡ããŸãã:
select * from intensityData
where snp = 'rs123456';
...ãããŠåŸ ã¡å§ããŸããã 4 ååŸã5 TB ãè¶ ããããŒã¿ãèŠæ±ãããåŸãçµæãåãåããŸããã Athena ã¯ãèŠã€ãã£ãããŒã¿ã®éã«å¿ã㊠20 ãã©ãã€ãããã 38 ãã«ãè«æ±ããŸãã ã€ãŸãããã® 50 åã®ãªã¯ãšã¹ãã«ã¯ XNUMX ãã«ã®ã³ã¹ããš XNUMX åã®åŸ æ©æéãããããŸããã ãã¹ãŠã®ããŒã¿ã«å¯ŸããŠã¢ãã«ãå®è¡ããã«ã¯ãXNUMX 幎éåŸ æ©ããXNUMX äžãã«ãæ¯æããªããã°ãªããŸããã§ããããããã¯æããã«ç§ãã¡ã«ã¯é©ããŠããŸããã§ããã
å¯æšçŽ°å·¥ã䜿çšããå¿ èŠããããŸãã...
äœãåŠãã ã®ã: Parquet ãã¡ã€ã«ã®ãµã€ãºãšãã®æ§æã«ã¯æ³šæããŠãã ããã
ç§ã¯æåã«ããã¹ãŠã® TSV ã次ã®åœ¢åŒã«å€æããããšã§ç¶æ³ãä¿®æ£ããããšããŸããã
ç°¡åãªã¿ã¹ã¯ãå®è¡ããŸãã
èå³æ·±ãããšã«ãParquet ã®ããã©ã«ã (ããã³æšå¥š) å§çž®ã¿ã€ãã§ãã snappy ã¯åå²å¯èœã§ã¯ãããŸããã ãã®ãããåå®è¡è ã¯ã3,5 GB ã®å®å šãªããŒã¿ã»ããã解åããŠããŠã³ããŒããããšããäœæ¥ã«è¿œãããŠããŸããã
åé¡ãç解ããŸããã
äœãåŠãã ã®ã: ç¹ã«ããŒã¿ãåæ£ãããŠããå Žåã䞊ã¹æ¿ãã¯å°é£ã§ãã
ä»ãç§ã¯åé¡ã®æ¬è³ªãç解ããããã«æããŸããã ããŒã¿ã人ããšã§ã¯ãªããSNP åããšã«äžŠã¹æ¿ããã ãã§æžã¿ãŸããã 次ã«ãããã€ãã® SNP ãå¥ã®ããŒã¿ ãã£ã³ã¯ã«ä¿åãããParquet ã®ãã¹ããŒããæ©èœãå€ãç¯å²å ã«ããå Žåã«ã®ã¿ãªãŒãã³ãããã®æ å ãçºæ®ããŸãã æ®å¿µãªãããã¯ã©ã¹ã¿ãŒå ã«æ£åšããæ°ååè¡ã䞊ã¹æ¿ããã®ã¯å°é£ãªäœæ¥ã§ããããšãå€æããŸããã
倧åŠã§ã¢ã«ãŽãªãºã ã®ææ¥ãåããŠããç§: ãããŒããããããã¹ãŠã®äžŠã¹æ¿ãã¢ã«ãŽãªãºã ã®èšç®ã®è€éããªã©èª°ãæ°ã«ããŸãããã
20TB ã®åã§ãœãŒãããããšããŠãã
ïŒã¹ããŒã¯ è¡š: ããªããããªã«æéããããã®ã§ãã?ã#ããŒã¿ãµã€ãšã³ã¹ éäºãâ ããã¯ã»ã¹ãã¬ã€ã€ãŒ (@NicholasStrayer)
2019 幎 3 æ 11 æ¥
AWS ã¯ããéäžåã®ãªãåŠçã ããããšããçç±ã§è¿éãè¡ãããšã絶察ã«æãã§ããŸããã Amazon Glue ã§äžŠã¹æ¿ããå®è¡ããåŸã2 æ¥éå®è¡ãããŠã¯ã©ãã·ã¥ããŸããã
ããŒãã£ã·ã§ã³åå²ã«ã€ããŠã¯ã©ãã§ãã?
äœãåŠãã ã®ã: Spark ã®ããŒãã£ã·ã§ã³ã¯ãã©ã³ã¹ããšãå¿ èŠããããŸãã
ããã§ç§ã¯ãããŒã¿ãæè²äœã§åå²ãããšããã¢ã€ãã¢ãæãã€ããŸããã ããã㯠23 åãããŸã (ããã³ã³ããªã¢ DNA ãšãããã³ã°ãããŠããªãé åãèæ
®ãããšããã«ããã€ããããŸã)ã
ããã«ãããããŒã¿ãããå°ããªãã£ã³ã¯ã«åå²ã§ããããã«ãªããŸãã Glue ã¹ã¯ãªããã® Spark ãšã¯ã¹ããŒãé¢æ°ã« XNUMX è¡ã ãè¿œå ãããšã partition_by = "chr"
ããã®åŸãããŒã¿ããã±ããã«åå²ããå¿
èŠããããŸãã
ã²ãã ã¯æè²äœãšåŒã°ããå€æ°ã®æçããæ§æãããŠããŸãã
æ®å¿µãªãããããŸããããŸããã§ããã æè²äœã®ãµã€ãºãç°ãªããšããããšã¯ãæ å ±éãç°ãªãããšãæå³ããŸãã ããã¯ãäžéšã®ããŒããæ©ãçµäºããŠã¢ã€ãã«ç¶æ ã«ãªã£ããããSpark ãã¯ãŒã«ãŒã«éä¿¡ããã¿ã¹ã¯ã®ãã©ã³ã¹ããšãããå®äºã«æéããããããšãæå³ããŸãã ãã ããã¿ã¹ã¯ã¯å®äºããŸããã ãããã10 ã€ã® SNP ãèŠæ±ãããšãäžåè¡¡ãåã³åé¡ãåŒãèµ·ãããŸããã ãã倧ããªæè²äœ (ã€ãŸããããŒã¿ãååŸãããå Žæ) äžã® SNP ãåŠçããã³ã¹ãã¯ãçŽ XNUMX åã® XNUMX ããæžå°ããŸããã ãããããããŸãããååã§ã¯ãããŸããã
ããã«çŽ°ããåå²ãããšã©ããªãã§ããããïŒ
äœãåŠãã ã®ã: 2,5 äžããŒãã£ã·ã§ã³ã決ããŠå®è¡ããªãã§ãã ããã
ç§ã¯æãåã£ãŠå SNP ãåå²ããããšã«ããŸããã ããã«ãããããŒãã£ã·ã§ã³ã®ãµã€ãºã確å®ã«åãã«ãªããŸãã ããã¯æªãèãã§ããã æ¥çå€ã䜿ã£ãŠç¡éªæ°ãªã©ã€ã³ãè¿œå ããŸãã partition_by = 'snp'
ã ã¿ã¹ã¯ãéå§ãããå®è¡ãéå§ãããŸããã 3 æ¥åŸç¢ºèªãããšããããŸã S3 ã«äœãæžã蟌ãŸããŠããªãã£ãã®ã§ãã¿ã¹ã¯ã匷å¶çµäºããŸããã Glue ã SXNUMX ã®é ãå Žæã«äžéãã¡ã€ã«ãæžã蟌ãã§ããããã§ããããããæ°çŸäžãã®å€§éã®ãã¡ã€ã«ãæžã蟌ãŸããŠããŸããã çµæãšããŠãç§ã®ééãã¯åãã«ä»¥äžã®æ害ããããããç§ã®æå°è
ãåã°ããããšã¯ã§ããŸããã§ããã
ããŒãã£ã·ã§ãã³ã° + ãœãŒã
äœãåŠãã ã®ã: Spark ã®ãã¥ãŒãã³ã°ãšåæ§ã«ã䞊ã¹æ¿ãã¯äŸç¶ãšããŠå°é£ã§ãã
ååã®ããŒãã£ã·ã§ã³åã®è©Šã¿ã§ã¯ãæè²äœãããŒãã£ã·ã§ã³åããåããŒãã£ã·ã§ã³ã䞊ã¹æ¿ããŸããã çè«çã«ã¯ãå¿ èŠãª SNP ããŒã¿ã¯æå®ãããç¯å²å ã®ããã€ãã® Parquet ãã£ã³ã¯å ã«ããå¿ èŠããããããããã«ããåã¯ãšãªãé«éåãããŸãã æ®å¿µãªãããåå²ãããããŒã¿ã§ãã£ãŠã䞊ã¹æ¿ããã®ã¯å°é£ãªäœæ¥ã§ããããšãå€æããŸããã ãã®çµæãã«ã¹ã¿ã ã¯ã©ã¹ã¿ãŒã® EMR ã«åãæ¿ãã5.4 ã€ã®åŒ·åãªã€ã³ã¹ã¿ã³ã¹ (CXNUMXxl) ãš Sparklyr ã䜿çšããŠãããæè»ãªã¯ãŒã¯ãããŒãäœæããŸããã
# Sparklyr snippet to partition by chr and sort w/in partition
# Join the raw data with the snp bins
raw_data
group_by(chr) %>%
arrange(Position) %>%
Spark_write_Parquet(
path = DUMP_LOC,
mode = 'overwrite',
partition_by = c('chr')
)
...ããããã¿ã¹ã¯ã¯ãŸã å®äºããŠããŸããã§ããã åã¯ãšãªå®è¡ããã°ã©ã ã®ã¡ã¢ãªå²ãåœãŠãå¢ããããã倧éã®ã¡ã¢ãªãæã€ããŒãã䜿çšãããããããŒããã£ã¹ãå€æ° (ãããŒããã£ã¹ãå€æ°) ã䜿çšããããããŸããŸãªæ¹æ³ã§æ§æããŸããããæ¯åãããã¯äžéå端ã§ããããšãå€æããåŸã ã«å®è¡ããã°ã©ã ãèµ·åãå§ããŸããããã¹ãŠãæ¢ãŸããŸã§å€±æããããšã
ã¢ããããŒãïŒããã§å§ãŸããŸãã
pic.twitter.com/agY4GU2ru5 â ããã¯ã»ã¹ãã¬ã€ã€ãŒ (@NicholasStrayer)
2019 幎 5 æ 15 æ¥
ããåµé çã«ãªã£ãŠãããŸã
äœãåŠãã ã®ã: ç¹æ®ãªããŒã¿ã«ã¯ç¹æ®ãªãœãªã¥ãŒã·ã§ã³ãå¿ èŠãªå ŽåããããŸãã
å SNP ã«ã¯äœçœ®ã®å€ããããŸãã ããã¯ãæè²äœäžã®å¡©åºã®æ°ã«å¯Ÿå¿ããæ°å€ã§ãã ããã¯ããŒã¿ãæŽçããããã®èªç¶ã§åªããæ¹æ³ã§ãã æåã¯åæè²äœã®é åããšã«åå²ããããšèããŠããŸããã ããšãã°ãäœçœ® 1 ïœ 2000ã2001 ïœ 4000 ãªã©ã§ãã ãããåé¡ã¯ãSNP ãæè²äœå šäœã«åäžã«ååžããŠããªããããã°ã«ãŒãã®ãµã€ãºã倧ããç°ãªãããšã§ãã
ãã®çµæã圹è·ãã«ããŽãªãŒïŒã©ã³ã¯ïŒã«åé¡ããããšã«ãã©ãçããŸããã ãã§ã«ããŠã³ããŒãããããŒã¿ã䜿çšããŠãåºæã® SNPããã®äœçœ®ãããã³æè²äœã®ãªã¹ããååŸãããªã¯ãšã¹ããå®è¡ããŸããã 次ã«ãåæè²äœå
ã®ããŒã¿ã䞊ã¹æ¿ããSNP ãæå®ã®ãµã€ãºã®ã°ã«ãŒã (ãã³) ã«åéããŸããã ãããã 1000 ã® SNP ããããšããŸãã ããã«ãããSNP ãšæè²äœããšã®ã°ã«ãŒãã®é¢ä¿ãããããŸããã
æçµçã« 75 åã® SNP ã®ã°ã«ãŒã (bin) ãäœæããŸããããã®çç±ã¯ä»¥äžã§èª¬æããŸãã
snp_to_bin <- unique_snps %>%
group_by(chr) %>%
arrange(position) %>%
mutate(
rank = 1:n()
bin = floor(rank/snps_per_bin)
) %>%
ungroup()
ãŸãã¯Sparkã§è©ŠããŠã¿ã
äœãåŠãã ã®ã: Spark ã®éçŽã¯é«éã§ãããããŒãã£ã·ã§ãã³ã°ã«ã¯äŸç¶ãšããŠã³ã¹ããããããŸãã
ãã®å°ã㪠(2,5 äžè¡) ããŒã¿ ãã¬ãŒã ã Spark ã«èªã¿èŸŒã¿ãçããŒã¿ãšçµåããŠãæ°ããè¿œå ãããåã§ããŒãã£ã·ã§ã³åããããšèããŠããŸããã bin
.
# Join the raw data with the snp bins
data_w_bin <- raw_data %>%
left_join(sdf_broadcast(snp_to_bin), by ='snp_name') %>%
group_by(chr_bin) %>%
arrange(Position) %>%
Spark_write_Parquet(
path = DUMP_LOC,
mode = 'overwrite',
partition_by = c('chr_bin')
)
䜿çšããŸãã sdf_broadcast()
ãããã£ãŠãSpark ã¯ããŒã¿ ãã¬ãŒã ããã¹ãŠã®ããŒãã«éä¿¡ããå¿
èŠãããããšãèªèããŸãã ããã¯ãããŒã¿ã®ãµã€ãºãå°ããããã¹ãŠã®ã¿ã¹ã¯ã«å¿
èŠãªå Žåã«äŸ¿å©ã§ãã ããããªããšãSpark ã¯å¿
èŠã«å¿ããŠè³¢ãããŒã¿ãé
åžããããšãããããé床ãäœäžããå¯èœæ§ããããŸãã
ãããŠåã³ãç§ã®ã¢ã€ãã¢ã¯ããŸããããŸããã§ãããã¿ã¹ã¯ã¯ãã°ããã®éæ©èœããçµåãå®äºããŸãããããã®åŸãããŒãã£ã·ã§ãã³ã°ã«ãã£ãŠèµ·åããããšã°ãŒãã¥ãŒã¿ãšåæ§ã«ã倱æãå§ããŸããã
AWKã®è¿œå
äœãåŠãã ã®ãïŒåºæ¬ãæããŠããã£ãŠããéã¯å¯ãªãã§ãã ããã ãã£ãšèª°ãã 1980 幎代ã«ãã§ã«ããªãã®åé¡ã解決ããŠããŸããã
ãã®æç¹ãŸã§ãSpark ã§ã®ãã¹ãŠã®å€±æã®åå ã¯ã¯ã©ã¹ã¿ãŒå ã®ããŒã¿ã®ãã¡ãæ··ãã§ããã ãããããååŠçã«ãã£ãŠç¶æ³ãæ¹åãããå¯èœæ§ããããŸãã ç§ã¯çã®ããã¹ã ããŒã¿ãæè²äœã®åã«åå²ããŠã¿ãããšã«ããã®ã§ãSpark ã«ãäºåã«åå²ããããããŒã¿ãæäŸããããšèããŸããã
StackOverflowã§åã®å€ã§åå²ããæ¹æ³ãæ€çŽ¢ãããšããã stdout
.
Bash ã¹ã¯ãªãããæžããŠè©ŠããŠã¿ãŸããã ããã±ãŒãžåããã TSV ã® XNUMX ã€ãããŠã³ããŒããã次ã®ã³ãã³ãã䜿çšããŠè§£åããŸãã gzip
ãããŠéä¿¡ãããŸãã awk
.
gzip -dc path/to/chunk/file.gz |
awk -F 't'
'{print $1",..."$30">"chunked/"$chr"_chr"$15".csv"}'
åããïŒ
ã³ã¢ãåãã
äœãåŠãã ã®ã: gnu parallel
- ããã¯éæ³ã®ãã®ã§ãã誰ããããã䜿ãã¹ãã§ãã
å¥ãã¯ããªããã£ããã§ãç§ãå¥ãå§ãããšãã htop
匷å㪠(ãããŠé«äŸ¡ãª) EC2 ã€ã³ã¹ã¿ã³ã¹ã®äœ¿çšç¶æ³ã確èªããããã«ã200 ã€ã®ã³ã¢ãšçŽ XNUMX MB ã®ã¡ã¢ãªã®ã¿ã䜿çšããŠããããšãå€æããŸããã å€é¡ã®æ倱ãåºããã«åé¡ã解決ããã«ã¯ãäœæ¥ã䞊ååããæ¹æ³ãèŠã€ããå¿
èŠããããŸããã 幞ããªããšã«ãæ¬åœã«çŽ æŽãããæ¬ã®äžã§ gnu parallel
ãUnix ã§ãã«ãã¹ã¬ãããå®è£
ããããã®éåžžã«æè»ãªæ¹æ³ã§ãã
æ°ããããã»ã¹ã䜿çšããŠããŒãã£ã·ã§ãã³ã°ãéå§ãããšãã¯ããã¹ãŠåé¡ãããŸããã§ãããããŸã ããã«ããã¯ããããŸãããS3 ãªããžã§ã¯ãã®ãã£ã¹ã¯ãžã®ããŠã³ããŒãã¯ããã»ã©é«éã§ã¯ãªããå®å
šã«äžŠååãããŠããŸããã§ããã ãããä¿®æ£ããããã«ã次ã®ããã«ããŸããã
- S3 ããŠã³ããŒã ã¹ããŒãžããã€ãã©ã€ã³ã«çŽæ¥å®è£ ãããã£ã¹ã¯äžã®äžéã¹ãã¬ãŒãžãå®å šã«æé€ã§ããããšãããããŸããã ããã¯ãçããŒã¿ããã£ã¹ã¯ã«æžã蟌ãå¿ èŠããªããAWS äžã®ããã«å°ããããããã£ãŠå®äŸ¡ãªã¹ãã¬ãŒãžã䜿çšã§ããããšãæå³ããŸãã
- ããŒã
aws configure set default.s3.max_concurrent_requests 50
AWS CLI ã䜿çšããã¹ã¬ããã®æ°ãå€§å¹ ã«å¢å ããŸãã (ããã©ã«ãã§ã¯ 10 ã§ã)ã - ååã«æå n ãå«ãŸããããããã¯ãŒã¯é床ã«æé©åããã EC2 ã€ã³ã¹ã¿ã³ã¹ã«åãæ¿ããŸããã n ã€ã³ã¹ã¿ã³ã¹ã䜿çšããå Žåã®åŠçââèœåã®æ倱ã¯ãèªã¿èŸŒã¿é床ã®åäžã«ãã£ãŠååã«è£ãããããšãããããŸããã ã»ãšãã©ã®ã¿ã¹ã¯ã§ã¯ c5n.4xl ã䜿çšããŸããã
- ããã£ã
gzip
Ма ãããã¯ããã¡ã€ã«ã解åãããšããæåã¯äžŠååãããŠããªãã£ãã¿ã¹ã¯ã䞊ååããçŽ æŽãããããšãå®è¡ã§ãã gzip ããŒã«ã§ã (ããã¯æã圹ã«ç«ã¡ãŸããã§ãã)ãpigz
# Let S3 use as many threads as it wants
aws configure set default.s3.max_concurrent_requests 50
for chunk_file in $(aws s3 ls $DATA_LOC | awk '{print $4}' | grep 'chr'$DESIRED_CHR'.csv') ; do
aws s3 cp s3://$batch_loc$chunk_file - |
pigz -dc |
parallel --block 100M --pipe
"awk -F 't' '{print $1",..."$30">"chunked/{#}_chr"$15".csv"}'"
# Combine all the parallel process chunks to single files
ls chunked/ |
cut -d '_' -f 2 |
sort -u |
parallel 'cat chunked/*_{} | sort -k5 -n -S 80% -t, | aws s3 cp - '$s3_dest'/batch_'$batch_num'_{}'
# Clean up intermediate data
rm chunked/*
done
ãããã®æé ãçžäºã«çµã¿åãããããšã§ããã¹ãŠãéåžžã«è¿ éã«æ©èœããŸãã ããŠã³ããŒãé床ãåäžããããã£ã¹ã¯ãžã®æžã蟌ã¿ãæé€ããããšã§ã5 ãã©ãã€ãã®ããã±ãŒãžããããæ°æéã§åŠçã§ããããã«ãªããŸããã
AWS ã§æéãæã£ãŠãããã¹ãŠã®ã³ã¢ã䜿çšãããŠããã®ãèŠãããšã»ã©å¬ããããšã¯ãããŸããã gnu-Parallel ã®ãããã§ã19 ã®ã¬ã® CSV ãããŠã³ããŒãããã®ãšåããããéã解åããŠåå²ã§ããŸãã ãããå®è¡ããããã®ç«è±ããåŸãããŸããã§ããã
#ããŒã¿ãµã€ãšã³ã¹ #Linux pic.twitter.com/Nqyba2zqEk â ããã¯ã»ã¹ãã¬ã€ã€ãŒ (@NicholasStrayer)
2019 幎 5 æ 17 æ¥
ãã®ãã€ãŒãã«ã¯ãTSVãã«ã€ããŠèšåããå¿ èŠããããŸããã ãããããã
æ°ãã解æãããããŒã¿ã®äœ¿çš
äœãåŠãã ã®ã: Spark ã¯éå§çž®ããŒã¿ã奜ã¿ãããŒãã£ã·ã§ã³ã®çµåã奜ã¿ãŸããã
ããã§ãããŒã¿ãã¢ã³ãã㯠(å
±æ: å
±æ) ãã€åé åºä»ãããã圢åŒã§ S3 ã«ä¿åãããåã³ Spark ã«æ»ãããšãã§ããŸããã é©ããç§ãåŸ
ã£ãŠããŸãããç§ã¯åã³æãã§ãããã®ãéæã§ããŸããã§ããã ããŒã¿ãã©ã®ããã«åå²ãããŠãããã Spark ã«æ£ç¢ºã«äŒããã®ã¯éåžžã«å°é£ã§ããã ãããŠããããå®è¡ãããšãã§ããããŒãã£ã·ã§ã³ãå€ãããããšãå€æããŸããïŒ95ïŒã coalesce
ãã®æ°ã劥åœãªå¶éãŸã§æžãããã®ã§ãããŒãã£ã·ã§ã³åå²ãç Žå£ãããŸããã ããã¯ä¿®æ£ã§ãããšç¢ºä¿¡ããŠããŸãããæ°æ¥éæ€çŽ¢ããŠã解決çãèŠã€ãããŸããã§ããã æçµçã«ã¯ Spark ã§ãã¹ãŠã®ã¿ã¹ã¯ãå®äºããŸããããæéãããããåå²ããã Parquet ãã¡ã€ã«ã¯ããã»ã©å°ãããããŸããã§ãã (~200 KB)ã ãã ããããŒã¿ã¯å¿
èŠãªå Žæã«ãããŸããã
å°ããããŠåžå¹ããŠãŠçŽ æµïŒ
ããŒã«ã« Spark ã¯ãšãªã®ãã¹ã
äœãåŠãã ã®ã: åçŽãªåé¡ã解決ããå ŽåãSpark ã«ã¯ãªãŒããŒããããå€ãããŸãã
ããŒã¿ãè³¢ã圢åŒã§ããŠã³ããŒãããããšã§ãé床ããã¹ãããããšãã§ããŸããã ããŒã«ã« Spark ãµãŒããŒãå®è¡ããããã« R ã¹ã¯ãªãããèšå®ããæå®ããã Parquet ã°ã«ãŒã ã¹ãã¬ãŒãž (bin) ãã Spark ããŒã¿ ãã¬ãŒã ãããŒãããŸãã ãã¹ãŠã®ããŒã¿ãããŒãããããšããŸããããSparklyr ã«ããŒãã£ã·ã§ãã³ã°ãèªèãããããšãã§ããŸããã§ããã
sc <- Spark_connect(master = "local")
desired_snp <- 'rs34771739'
# Start a timer
start_time <- Sys.time()
# Load the desired bin into Spark
intensity_data <- sc %>%
Spark_read_Parquet(
name = 'intensity_data',
path = get_snp_location(desired_snp),
memory = FALSE )
# Subset bin to snp and then collect to local
test_subset <- intensity_data %>%
filter(SNP_Name == desired_snp) %>%
collect()
print(Sys.time() - start_time)
å®è¡ã«ã¯ 29,415 ç§ããããŸããã ã¯ããã«åªããŠããŸãããäœãã倧éã«ãã¹ãããã«ã¯ããŸãé©ããŠããŸããã ããã«ãããŒã¿ ãã¬ãŒã ãã¡ã¢ãªã«ãã£ãã·ã¥ããããšãããšãéã¿ã 50 æªæºã®ããŒã¿ã»ããã« 15 GB 以äžã®ã¡ã¢ãªãå²ãåœãŠãå Žåã§ããSpark ãåžžã«ã¯ã©ãã·ã¥ããããããã£ãã·ã¥ã䜿çšããŠãé床ãäžããããšãã§ããŸããã§ããã
AWKã«æ»ã
äœãåŠãã ã®ã: AWK ã®é£æ³é åã¯éåžžã«å¹ççã§ãã
ãã£ãšé«éã«å°éã§ããããšã«æ°ã¥ããŸããã ãããçŽ æŽããã圢ã§æãåºããŸãã
ãããè¡ãã«ã¯ãAWK ã¹ã¯ãªããã§æ¬¡ã®ãããã¯ã䜿çšããŸããã BEGIN
ã ããã¯ãããŒã¿ã®æåã®è¡ãã¹ã¯ãªããã®æ¬äœã«æž¡ãããåã«å®è¡ãããã³ãŒãã§ãã
join_data.awk
BEGIN {
FS=",";
batch_num=substr(chunk,7,1);
chunk_id=substr(chunk,15,2);
while(getline < "snp_to_bin.csv") {bin[$1] = $2}
}
{
print $0 > "chunked/chr_"chr"_bin_"bin[$1]"_"batch_num"_"chunk_id".csv"
}
ããŒã while(getline...)
CSV ã°ã«ãŒã (bin) ãããã¹ãŠã®è¡ãããŒãããæåã®å (SNP å) ãé£æ³é
åã®ããŒãšããŠèšå®ããŸã bin
å€ãšã㊠XNUMX çªç®ã®å€ (ã°ã«ãŒã) ãæå®ããŸãã ãããããããã¯ã§ {
}
ããã¯ã¡ã€ã³ ãã¡ã€ã«ã®ãã¹ãŠã®è¡ã§å®è¡ãããåè¡ã¯åºåãã¡ã€ã«ã«éä¿¡ãããã°ã«ãŒã (bin) ã«å¿ããŠäžæã®ååãä»ããããŸãã ..._bin_"bin[$1]"_...
.
å€æ° batch_num
О chunk_id
ãã€ãã©ã€ã³ã«ãã£ãŠæäŸãããããŒã¿ãšäžèŽãã競åç¶æ
ãåé¿ãããåå®è¡ã¹ã¬ãããå®è¡ãããŸãã parallel
ãç¬èªã®äžæã®ãã¡ã€ã«ã«æžã蟌ã¿ãŸãã
AWK ã䜿çšããååã®å®éšã§æ®ã£ãæè²äœäžã®ãã©ã«ããŒã«ãã¹ãŠã®çããŒã¿ãåæ£ãããããäžåºŠã« 3 ã€ã®æè²äœãåŠçããããæ·±ãåå²ãããããŒã¿ã SXNUMX ã«éä¿¡ããå¥ã® Bash ã¹ã¯ãªãããäœæã§ããããã«ãªããŸããã
DESIRED_CHR='13'
# Download chromosome data from s3 and split into bins
aws s3 ls $DATA_LOC |
awk '{print $4}' |
grep 'chr'$DESIRED_CHR'.csv' |
parallel "echo 'reading {}'; aws s3 cp "$DATA_LOC"{} - | awk -v chr=""$DESIRED_CHR"" -v chunk="{}" -f split_on_chr_bin.awk"
# Combine all the parallel process chunks to single files and upload to rds using R
ls chunked/ |
cut -d '_' -f 4 |
sort -u |
parallel "echo 'zipping bin {}'; cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R '$S3_DEST'/chr_'$DESIRED_CHR'_bin_{}.rds"
rm chunked/*
ã¹ã¯ãªããã«ã¯ XNUMX ã€ã®ã»ã¯ã·ã§ã³ããããŸã parallel
.
æåã®ã»ã¯ã·ã§ã³ã§ã¯ãç®çã®æè²äœã«é¢ããæ
å ±ãå«ããã¹ãŠã®ãã¡ã€ã«ããããŒã¿ãèªã¿åããã次ã«ãã®ããŒã¿ãã¹ã¬ããéã§åæ£ããããã¡ã€ã«ãé©åãªã°ã«ãŒã (bin) ã«åæ£ãããŸãã è€æ°ã®ã¹ã¬ãããåããã¡ã€ã«ã«æžã蟌ããšãã®ç«¶åç¶æ
ãåé¿ããããã«ãAWK ã¯ãã¡ã€ã«åãæž¡ããŠããŒã¿ãå¥ã®å Žæã«æžã蟌ã¿ãŸãã chr_10_bin_52_batch_2_aa.csv
ã ãã®çµæããã£ã¹ã¯äžã«å°ããªãã¡ã€ã«ãå€æ°äœæãããŸã (ãã®ããã«ããã©ãã€ãã® EBS ããªã¥ãŒã ã䜿çšããŸãã)ã
第äºã»ã¯ã·ã§ã³ããã®ã³ã³ãã€ãŒ parallel
ã°ã«ãŒã (bin) ã調ã¹ãŠãåã
ã®ãã¡ã€ã«ãå
±éã® CSV c ã«çµåããŸãã cat
ãããŠãšã¯ã¹ããŒãã®ããã«éä¿¡ããŸãã
Rã§æŸéïŒ
äœãåŠãã ã®ãïŒ é£çµ¡ã§ãã stdin
О stdout
R ã¹ã¯ãªããããååŸããããããã€ãã©ã€ã³ã§äœ¿çšããŸãã
Bash ã¹ã¯ãªããã«æ¬¡ã®è¡ãããããšã«æ°ä»ãããããããŸããã ...cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R...
ã ãã¹ãŠã®é£çµãããã°ã«ãŒã ãã¡ã€ã« (bin) ã以äžã® R ã¹ã¯ãªããã«å€æãããŸãã {}
ç¹æ®ãªãã¯ããã¯ã§ã parallel
ãæå®ãããã¹ããªãŒã ã«éä¿¡ããããŒã¿ãã³ãã³ãèªäœã«çŽæ¥æ¿å
¥ããŸãã ãªãã·ã§ã³ {#}
äžæã®ã¹ã¬ãã ID ãæäŸãã {%}
ãžã§ã ã¹ãããçªå·ãè¡šããŸã (ç¹°ãè¿ãããŸãããåæã«ã¯å®è¡ãããŸãã)ã ãã¹ãŠã®ãªãã·ã§ã³ã®ãªã¹ãã¯æ¬¡ã®å Žæã«ãããŸãã
#!/usr/bin/env Rscript
library(readr)
library(aws.s3)
# Read first command line argument
data_destination <- commandArgs(trailingOnly = TRUE)[1]
data_cols <- list(SNP_Name = 'c', ...)
s3saveRDS(
read_csv(
file("stdin"),
col_names = names(data_cols),
col_types = data_cols
),
object = data_destination
)
å€æ°ã®å Žå file("stdin")
ã«éä¿¡ããã readr::read_csv
ãR ã¹ã¯ãªããã«å€æãããããŒã¿ã¯ãã¬ãŒã ã«ããŒãããã次ã®åœ¢åŒã«ãªããŸãã .rds
-ãã¡ã€ã«ã䜿çšã㊠aws.s3
S3 ã«çŽæ¥æžã蟌ãŸããŸãã
RDS ã¯ãã¹ããŒã«ãŒ ã¹ãã¬ãŒãžã®äœåãªèŠçŽ ãçãããParquet ã®ãžã¥ã㢠ããŒãžã§ã³ã®ãããªãã®ã§ãã
Bash ã¹ã¯ãªãããå®äºãããšããã³ãã«ãåŸãããŸãã .rds
-ãã¡ã€ã«ã¯ S3 ã«ãããå¹ççãªå§çž®ãšçµã¿èŸŒã¿ã¿ã€ãã䜿çšã§ããããã«ãªããŸããã
ãã¬ãŒã R ã䜿çšããã«ããããããããã¹ãŠãéåžžã«è¿ éã«æ©èœããŸããã åœç¶ã®ããšãªãããããŒã¿ã®èªã¿åããšæžã蟌ã¿ãè¡ã R ã®éšåã¯é«åºŠã«æé©åãããŠããŸãã 5 ã€ã®äžåæè²äœã§ãã¹ãããåŸãC4n.XNUMXxl ã€ã³ã¹ã¿ã³ã¹ã§ã®ãžã§ãã¯çŽ XNUMX æéã§å®äºããŸããã
S3 ã®å¶éäºé
äœãåŠãã ã®ã: ã¹ããŒã ãã¹ã®å®è£ ã®ãããã§ãS3 ã¯å€ãã®ãã¡ã€ã«ãåŠçã§ããŸãã
S3 ã«è»¢éãããå€æ°ã®ãã¡ã€ã«ã S3 ãåŠçã§ãããã©ãããå¿é ã§ããã æå³ã®ãããã¡ã€ã«åã«ããããšã¯ã§ããŸãããSXNUMX ã¯ã©ã®ããã«ãã¡ã€ã«åãæ¢ãã®ã§ãããã?
S3 ã®ãã©ã«ããŒã¯åãªãèŠãç©ã§ãããå®éã«ã¯ã·ã¹ãã ã¯ãã®ã·ã³ãã«ã«ã¯èå³ããããŸããã /
.
S3 ã¯ãç¹å®ã®ãã¡ã€ã«ãžã®ãã¹ããäžçš®ã®ããã·ã¥ ããŒãã«ãŸãã¯ããã¥ã¡ã³ã ããŒã¹ã®ããŒã¿ããŒã¹å ã®åçŽãªããŒãšããŠè¡šããŠããããã§ãã ãã±ããã¯ããŒãã«ãšããŠèããããšãã§ãããã¡ã€ã«ã¯ãã®ããŒãã«å ã®ã¬ã³ãŒããšããŠèããããšãã§ããŸãã
Amazon ã§å©çãäžããã«ã¯ã¹ããŒããšå¹çãéèŠã§ããããããã®ãã¡ã€ã«ãã¹ãšããŠã®ããŒã·ã¹ãã ãç°åžžã«æé©åãããŠããããšã¯é©ãã¹ãããšã§ã¯ãããŸããã 倧éã®ååŸãªã¯ãšã¹ããäœæããå¿ èŠããªãããªã¯ãšã¹ããè¿ éã«å®è¡ãããããã«ããã©ã³ã¹ãèŠã€ããããšããŸããã çŽ20äžåã®binãã¡ã€ã«ãäœæããã®ãæé©ã§ããããšãå€æããŸããã æé©åãç¶ããã°ãé床ã®åäžãéæã§ãããšæããŸã (ããšãã°ãããŒã¿å°çšã®ç¹å¥ãªãã±ãããäœæããŠãã«ãã¯ã¢ãã ããŒãã«ã®ãµã€ãºãåæžãããªã©)ã ãããããããªãå®éšãè¡ãããã®æéããéããããŸããã§ããã
çžäºäºææ§ã«ã€ããŠã¯ã©ãã§ãã?
åŠãã ããš: æéãç¡é§ã«ããæ倧ã®åå ã¯ãä¿ç®¡æ¹æ³ãææå°æ©ã«æé©åããããšã§ãã
ãã®æç¹ã§ãããªãç¬èªã®ãã¡ã€ã«åœ¢åŒã䜿çšããã®ã?ããšèªåããããšãéåžžã«éèŠã§ãã ãã®çç±ã¯ãèªã¿èŸŒã¿é床 (gzip å§çž®ããã CSV ãã¡ã€ã«ã®èªã¿èŸŒã¿ã« 7 åã®æéãããããŸãã) ãšã¯ãŒã¯ãããŒãšã®äºææ§ã«ãããŸãã R ã Spark ãããŒãããã« Parquet (ãŸã㯠Arrow) ãã¡ã€ã«ãç°¡åã«ããŒãã§ãããã©ãããåèãããããããŸããã ç§ãã¡ã®ç 究宀ã§ã¯å šå¡ã R ã䜿çšããŠãããããŒã¿ãå¥ã®åœ¢åŒã«å€æããå¿ èŠãããå Žåã§ããå ã®ããã¹ã ããŒã¿ããŸã æ®ã£ãŠããã®ã§ããã€ãã©ã€ã³ãå床å®è¡ããã ãã§æžã¿ãŸãã
ä»äºã®åå²
äœãåŠãã ã®ã: ãžã§ããæåã§æé©åããããšãããã³ã³ãã¥ãŒã¿ãŒã«ä»»ããŠãã ããã
XNUMX ã€ã®æè²äœã§ã¯ãŒã¯ãããŒããããã°ããŸããããä»åºŠã¯ä»ã®ãã¹ãŠã®ããŒã¿ãåŠçããå¿
èŠããããŸãã
å€æã®ããã«è€æ°ã® EC2 ã€ã³ã¹ã¿ã³ã¹ãèµ·åããããšèããŠããŸããããåæã«ã(Spark ãäžåè¡¡ãªããŒãã£ã·ã§ã³ã«æ©ãŸãããã®ãšåãããã«) ç°ãªãåŠçãžã§ãéã§éåžžã«äžåè¡¡ãªè² è·ãçºçããã®ã§ã¯ãªãããšå¿é
ããŠããŸããã ããã«ãAWS ã¢ã«ãŠã³ãã«ã¯ããã©ã«ãã§ã€ã³ã¹ã¿ã³ã¹æ°ã 10 åã«å¶éãããŠãããããæè²äœããšã« XNUMX ã€ã®ã€ã³ã¹ã¿ã³ã¹ãçæããããšã«èå³ããããŸããã§ããã
ããã§ãåŠçãžã§ããæé©åããã¹ã¯ãªããã R ã§äœæããããšã«ããŸããã
ãŸããS3 ã«åæè²äœãå ããã¹ãã¬ãŒãžå®¹éãèšç®ããããã«äŸé ŒããŸããã
library(aws.s3)
library(tidyverse)
chr_sizes <- get_bucket_df(
bucket = '...', prefix = '...', max = Inf
) %>%
mutate(Size = as.numeric(Size)) %>%
filter(Size != 0) %>%
mutate(
# Extract chromosome from the file name
chr = str_extract(Key, 'chr.{1,4}.csv') %>%
str_remove_all('chr|.csv')
) %>%
group_by(chr) %>%
summarise(total_size = sum(Size)/1e+9) # Divide to get value in GB
# A tibble: 27 x 2
chr total_size
<chr> <dbl>
1 0 163.
2 1 967.
3 10 541.
4 11 611.
5 12 542.
6 13 364.
7 14 375.
8 15 372.
9 16 434.
10 17 443.
# ⊠with 17 more rows
次ã«ãåèšãµã€ãºãååŸããæè²äœã®é åºãã·ã£ããã«ããããããã°ã«ãŒãã«åå²ããé¢æ°ãæžããŸããã num_jobs
ãã¹ãŠã®åŠçãžã§ãã®ãµã€ãºãã©ã®ããã«ç°ãªããã瀺ããŸãã
num_jobs <- 7
# How big would each job be if perfectly split?
job_size <- sum(chr_sizes$total_size)/7
shuffle_job <- function(i){
chr_sizes %>%
sample_frac() %>%
mutate(
cum_size = cumsum(total_size),
job_num = ceiling(cum_size/job_size)
) %>%
group_by(job_num) %>%
summarise(
job_chrs = paste(chr, collapse = ','),
total_job_size = sum(total_size)
) %>%
mutate(sd = sd(total_job_size)) %>%
nest(-sd)
}
shuffle_job(1)
# A tibble: 1 x 2
sd data
<dbl> <list>
1 153. <tibble [7 Ã 3]>
次ã«ãpurrr ã䜿çšã㊠XNUMX åã®ã·ã£ããã«ãå®è¡ããæè¯ã®ãã®ãéžæããŸããã
1:1000 %>%
map_df(shuffle_job) %>%
filter(sd == min(sd)) %>%
pull(data) %>%
pluck(1)
ãã®ãããæçµçã«ãµã€ãºãéåžžã«äŒŒãäžé£ã®ã¿ã¹ã¯ãäœæãããŸããã ãã®åŸãæ®ã£ãã®ã¯ãåã® Bash ã¹ã¯ãªããã倧ããªã«ãŒãã§ã©ããããããšã ãã§ãã for
ã ãã®æé©åã®äœæã«ã¯çŽ 10 åããããŸããã ããã¯ãã¿ã¹ã¯ã®ãã©ã³ã¹ã厩ããŠããå Žåã«æåã§ã¿ã¹ã¯ãäœæããå Žåã«ãããè²»çšãããã¯ããã«å°ãªããªããŸãã ãããã£ãŠããã®äºåçãªæé©åã¯æ£ããã£ããšæããŸãã
for DESIRED_CHR in "16" "9" "7" "21" "MT"
do
# Code for processing a single chromosome
fi
æåŸã«ã·ã£ããããŠã³ã³ãã³ããè¿œå ããŸãã
sudo shutdown -h now
...ãããŠãã¹ãŠãããŸããããŸããïŒ AWS CLI ã䜿çšããŠã次ã®ãªãã·ã§ã³ã䜿çšããŠã€ã³ã¹ã¿ã³ã¹ãèµ·åããŸããã user_data
åŠçããã¿ã¹ã¯ã® Bash ã¹ã¯ãªããã圌ãã«æž¡ããŸããã ãããã¯èªåçã«å®è¡ããã·ã£ããããŠã³ããããããè¿œå ã®åŠçèœåã«ãéãæãå¿
èŠã¯ãããŸããã§ããã
aws ec2 run-instances ...
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=<<job_name>>}]"
--user-data file://<<job_script_loc>>
梱å ããŸãããïŒ
äœãåŠãã ã®ã: 䜿ãããããšæè»æ§ãé«ããããã«ãAPI ã¯ã·ã³ãã«ã§ããå¿ èŠããããŸãã
ã€ãã«ãããŒã¿ãæ£ããå Žæãšåœ¢åŒã§ååŸã§ããŸããã ããšã¯ãååã䜿ããããããã«ãããŒã¿ã䜿çšããããã»ã¹ãå¯èœãªéãç°¡çŽ åããããšã ãã§ããã ãªã¯ãšã¹ããäœæããããã®ã·ã³ãã«ãª API ãäœæããããšæããŸããã å°æ¥çã«åãæ¿ããããšã«ããå Žåã .rds
Parquet ãã¡ã€ã«ã«å€æããå Žåãããã¯ååã§ã¯ãªãç§ã«ãšã£ãŠåé¡ã«ãªãã¯ãã§ãã ãã®ããã«ãå
éš R ããã±ãŒãžãäœæããããšã«ããŸããã
é¢æ°ãäžå¿ã«ç·šæãããããã€ãã®ããŒã¿ ã¢ã¯ã»ã¹é¢æ°ã®ã¿ãå«ãéåžžã«åçŽãªããã±ãŒãžãæ§ç¯ããŠææžåãã get_snp
ã åååãã®ãŠã§ããµã€ããäœããŸãã
ã¹ããŒããã£ãã·ã³ã°
äœãåŠãã ã®ã: ããŒã¿ãååã«æºåãããŠããã°ããã£ãã·ã¥ã¯ç°¡åã§ãã
äž»èŠãªã¯ãŒã¯ãããŒã® XNUMX ã€ãåãåæã¢ãã«ã SNP ããã±ãŒãžã«é©çšãããããããã³ã°ãæå©ã«äœ¿çšããããšã«ããŸããã SNP çµç±ã§ããŒã¿ãéä¿¡ããå Žåãã°ã«ãŒã (bin) ããã®ãã¹ãŠã®æ å ±ãè¿ããããªããžã§ã¯ãã«æ·»ä»ãããŸãã ã€ãŸããå€ãã¯ãšãªã¯ (çè«äžã¯) æ°ããã¯ãšãªã®åŠçãé«éåã§ããŸãã
# Part of get_snp()
...
# Test if our current snp data has the desired snp.
already_have_snp <- desired_snp %in% prev_snp_results$snps_in_bin
if(!already_have_snp){
# Grab info on the bin of the desired snp
snp_results <- get_snp_bin(desired_snp)
# Download the snp's bin data
snp_results$bin_data <- aws.s3::s3readRDS(object = snp_results$data_loc)
} else {
# The previous snp data contained the right bin so just use it
snp_results <- prev_snp_results
}
...
ããã±ãŒãžãæ§ç¯ãããšãã«ãããŸããŸãªæ¹æ³ã䜿çšããå Žåã®é床ãæ¯èŒããããã«å€ãã®ãã³ãããŒã¯ãå®è¡ããŸããã äºæãã¬çµæãçããå Žåãããããããããç¡èŠããªãããšããå§ãããŸãã äŸãã°ã dplyr::filter
ã€ã³ããã¯ã¹ããŒã¹ã®ãã£ã«ã¿ãªã³ã°ã䜿çšããŠè¡ããã£ããã£ãããããã¯ããã«é«éã§ããããã£ã«ã¿ãªã³ã°ãããããŒã¿ ãã¬ãŒã ããåäžã®åãååŸããã®ã¯ãã€ã³ããã¯ã¹æ§æã䜿çšãããããã¯ããã«é«éã§ããã
ãªããžã§ã¯ãã«æ³šæããŠãã ãã prev_snp_results
ããŒãå«ãŸããŠããŸã snps_in_bin
ã ããã¯ã°ã«ãŒã (ãã³) å
ã®ãã¹ãŠã®äžæã® SNP ã®é
åã§ããã以åã®ã¯ãšãªããã®ããŒã¿ãæ¢ã«ååšãããã©ããããã°ãã確èªã§ããŸãã ãŸãã次ã®ã³ãŒãã䜿çšãããšãã°ã«ãŒã (ãã³) å
ã®ãã¹ãŠã® SNP ãç°¡åã«ã«ãŒãã§ããŸãã
# Get bin-mates
snps_in_bin <- my_snp_results$snps_in_bin
for(current_snp in snps_in_bin){
my_snp_results <- get_snp(current_snp, my_snp_results)
# Do something with results
}
çµæ
ä»ã§ã¯ã以åã¯ã¢ã¯ã»ã¹ã§ããªãã£ãã¢ãã«ãã·ããªãªãå®è¡ã§ããããã«ãªããŸãã (ãããŠæ¬æ Œçã«å®è¡ãå§ããŠããŸã)ã äžçªè¯ãã®ã¯ãç 究宀ã®ååãè€éãªããšãèããå¿ èŠããªãããšã§ãã æ©èœããæ©èœãããã ãã§ãã
ããã±ãŒãžã§ã¯è©³çŽ°ã¯çããŠããŸãããææ¥ç§ãçªç¶ããªããªã£ãå Žåã§ãç解ã§ããããã«ãããŒã¿åœ¢åŒãã·ã³ãã«ã«ããããåªããŸãã...
é床ãèããåäžããŸããã ç§ãã¡ã¯éåžžãæ©èœçã«éèŠãªã²ãã æçãã¹ãã£ã³ããŸãã 以åã¯ãããå®è¡ã§ããŸããã§ãã (ã³ã¹ããé«ãããããšãå€æããŸãã)ãããããçŸåšã¯ã°ã«ãŒã (ãã³) æ§é ãšãã£ãã·ã¥ã®ãããã§ã0,1 ã€ã® SNP ã®ãªã¯ãšã¹ãã«ãããæéã¯å¹³å 3 ç§æªæºã§ãããŒã¿äœ¿çšéãå€§å¹ ã«åæžãããŠããŸãã SXNUMX ã®ã³ã¹ãã¯ããŒãããã»ã©äœãã§ãã
æè¿ãç 究宀㧠25 TB 以äžã®çã®ãžã§ãã¿ã€ãã³ã° ããŒã¿ãåŠçããå¿ èŠããããŸããã ç§ã䜿ãå§ãããšããspark ã®äœ¿çšã«ã¯ 8 åããããSNP ã®ã¯ãšãªã« 20 ãã«ã®è²»çšãããããŸããã AWK+䜿çšåŸ
#rstats åŠçã«ãããæé㯠10 åã® 0.00001 ç§æªæºã§ãã³ã¹ã㯠XNUMX ãã«ã§ãã å人çãªïŒããã°ããŒã¿ åã€ãpic.twitter.com/ANOXVGrmkk â ããã¯ã»ã¹ãã¬ã€ã€ãŒ (@NicholasStrayer)
2019 幎 5 æ 30 æ¥
ãŸãšã
ãã®èšäºã¯ãŸã£ããã¬ã€ãã§ã¯ãããŸããã 解決çã¯åå¥ã®ãã®ã§ãããã»ãŒç¢ºå®ã«æé©ã§ã¯ãªãããšãå€æããŸããã ãšããããæ è¡èšã§ãã ä»ã®äººãã¡ã«ã¯ããã®ãããªæ±ºå®ã¯é ã®äžã§å®å šã«åœ¢æããããã®ã§ã¯ãªããè©Šè¡é¯èª€ã®çµæã§ããããšãç解ããŠããããããšæããŸãã ãŸããããŒã¿ ãµã€ãšã³ãã£ã¹ããæ¢ããŠããå Žåã¯ããããã®ããŒã«ãå¹æçã«äœ¿çšããã«ã¯çµéšãå¿ èŠã§ãããçµéšã«ã¯è²»çšããããããšã«çæããŠãã ããã ç§ã«ã¯ãéãæãäœè£ããã£ãã®ã§å¹žãã§ãããç§ãããåãä»äºãã§ããä»ã®å€ãã®äººã¯ããéããªãããã«ææŠããæ©äŒãããªãã§ãããã
ããã°ããŒã¿ããŒã«ã¯å€çšéã§ãã æéãããã°ãã¹ããŒããªããŒã¿ ã¯ãªãŒãã³ã°ãä¿åãæœåºææ³ã䜿çšããŠãããé«éãªãœãªã¥ãŒã·ã§ã³ãäœæããããšãã»ãŒç¢ºå®ã«å¯èœã§ãã æçµçã«ã¯è²»çšå¯Ÿå¹æã®åæã«è¡ãçããŸãã
ç§ãåŠãã ããš:
- äžåºŠã« 25 TB ã解æããå®äŸ¡ãªæ¹æ³ã¯ãããŸããã
- Parquet ãã¡ã€ã«ã®ãµã€ãºãšãã®æ§æã«ã¯æ³šæããŠãã ããã
- Spark ã®ããŒãã£ã·ã§ã³ã¯ãã©ã³ã¹ããšãå¿ èŠããããŸãã
- äžè¬ã«ã2,5 äžåã®ããŒãã£ã·ã§ã³ãäœæããããšããªãã§ãã ããã
- Spark ã®ã»ããã¢ãããšåæ§ã«ã䞊ã¹æ¿ãã¯äŸç¶ãšããŠå°é£ã§ãã
- ç¹æ®ãªããŒã¿ã«ã¯ç¹æ®ãªãœãªã¥ãŒã·ã§ã³ãå¿ èŠãªå ŽåããããŸãã
- Spark ã®éçŽã¯é«éã§ãããããŒãã£ã·ã§ãã³ã°ã«ã¯äŸç¶ãšããŠã³ã¹ããããããŸãã
- åºæ¬ãæãããšãã¯ç ããªãã§ãã ããããããã 1980 幎代ã«èª°ãããã§ã«ããªãã®åé¡ã解決ããŠããŸããã
gnu parallel
- ããã¯éæ³ã®ãããªãã®ã§ãã誰ããããã䜿ãã¹ãã§ãã- Spark ã¯éå§çž®ããŒã¿ã奜ã¿ãããŒãã£ã·ã§ã³ã®çµåã奜ã¿ãŸããã
- åçŽãªåé¡ã解決ããå ŽåãSpark ã®ãªãŒããŒãããã¯å€ãããŸãã
- AWK ã®é£æ³é åã¯éåžžã«å¹ççã§ãã
- é£çµ¡ã§ãã
stdin
Оstdout
R ã¹ã¯ãªããããååŸããããããã€ãã©ã€ã³ã§äœ¿çšããŸãã - ã¹ããŒã ãã¹ã®å®è£ ã®ãããã§ãS3 ã¯å€ãã®ãã¡ã€ã«ãåŠçã§ããŸãã
- æéãç¡é§ã«ããäž»ãªçç±ã¯ãä¿ç®¡æ¹æ³ãææå°æ©ã«æé©åããããšã§ãã
- ã¿ã¹ã¯ãæåã§æé©åããããšãããã³ã³ãã¥ãŒã¿ã«ä»»ããŠãã ããã
- 䜿ãããããšæè»æ§ãé«ããããã«ãAPI ã¯ã·ã³ãã«ã§ããå¿ èŠããããŸãã
- ããŒã¿ãååã«æºåãããŠããã°ããã£ãã·ã¥ã¯ç°¡åã§ãã
åºæïŒ habr.com