Ifundwa kanjani lesi sihloko: Ngiyaxolisa ngombhalo mude futhi unesiphithiphithi. Ukuze ungonge isikhathi, ngiqala isahluko ngasinye ngelethulo esithi “Engikufundile,” esifingqa ingqikithi yesahluko ngomusho owodwa noma emibili.
“Mane ungibonise ikhambi!” Uma nje ufuna ukubona ukuthi ngivelaphi, yeqela esahlukweni esithi “Ukuba Nobuhlakani Kakhudlwana,” kodwa ngicabanga ukuthi kuthakazelisa kakhulu futhi kuwusizo ukufunda ngokwehluleka.
Ngisanda kunikwa umsebenzi wokumisa inqubo yokucubungula ivolumu enkulu yokulandelana kwe-DNA eluhlaza (ngobuchwepheshe i-chip ye-SNP). Isidingo kwakuwukuthola ngokushesha idatha mayelana nendawo enikeziwe yofuzo (ebizwa ngokuthi i-SNP) yokumodela okulandelayo neminye imisebenzi. Ngisebenzisa i-R ne-AWK, ngakwazi ukuhlanza nokuhlela idatha ngendlela engokwemvelo, ngasheshisa kakhulu ukucutshungulwa kwemibuzo. Lokhu kwakungelula kimi futhi kwakudinga ukuphindaphinda kaningi. Lesi sihloko sizokusiza ukuba ugweme amanye amaphutha ami futhi sikubonise ukuthi ngigcine ngani.
Okokuqala, ezinye izincazelo eziyisingeniso.
Idatha
Isikhungo sethu sokucubungula ulwazi lwezofuzo enyuvesi sinikeze idatha ewuhlobo lwe-25 TB TSV. Ngawathola ahlukaniswe amaphakheji angu-5, acindezelwe yi-Gzip, ngalinye elaliqukethe cishe amafayela angama-gigabyte amane angu-240. Umugqa ngamunye ubuqukethe idatha ye-SNP eyodwa evela kumuntu oyedwa. Sekukonke, idatha ku ~ 2,5 million SNPs kanye ~ 60 abantu abayizinkulungwane ezidlulisiwe. Ngaphezu kolwazi lwe-SNP, amafayela aqukethe amakholomu amaningi anezinombolo ezibonisa izici ezihlukahlukene, njengokuqina kokufunda, imvamisa yama-alleles ahlukene, njll. Sekukonke bekunamakholomu angaba ngu-30 anamanani ahlukile.
Injongo
Njenganoma iyiphi iphrojekthi yokuphatha idatha, into ebaluleke kakhulu kwakuwukunquma ukuthi idatha izosetshenziswa kanjani. Esimweni esinjalo sizokhetha kakhulu amamodeli nokugeleza komsebenzi kwe-SNP okusekelwe ku-SNP. Okusho ukuthi, sizodinga kuphela idatha ku-SNP eyodwa ngesikhathi. Kwadingeka ngifunde ukubuyisa wonke amarekhodi ahlotshaniswa neyodwa ye-2,5 million SNPs kalula, ngokushesha futhi eshibhile ngangokunokwenzeka.
Ungakwenzi kanjani lokhu
Ukucaphuna i-cliché efanelekile:
Angizange ngihluleke izikhathi eziyinkulungwane, ngisanda kuthola izindlela eziyinkulungwane zokugwema ukudlulisa idatha ngefomethi elungele imibuzo.
Zama kuqala
Yini engiyifundile: Ayikho indlela eshibhile yokuhlaziya i-25 TB ngesikhathi.
Ngemva kokuthatha izifundo "Izindlela Ezithuthukisiwe Zokucubungula Idatha Enkulu" e-Vanderbilt University, ngangiqinisekile ukuthi iqhinga lalisesikhwameni. Cishe kuzothatha ihora noma amabili ukusetha iseva ye-Hive ukuze isebenzise yonke idatha futhi ibike umphumela. Njengoba idatha yethu igcinwe ku-AWS S3, ngisebenzise isevisi
Ngemva kokuthi ngibonise i-Athena idatha yami nefomethi yayo, ngenze ezinye izivivinyo ngemibuzo efana nalena:
select * from intensityData limit 10;
Futhi ngokushesha wathola imiphumela eyakhiwe kahle. Ilungile.
Kuze kube yilapho sizama ukusebenzisa idatha emsebenzini wethu...
Ngacelwa ukuthi ngikhiphe lonke ulwazi lwe-SNP ukuze ngihlole imodeli. Ngiphendule umbuzo:
select * from intensityData
where snp = 'rs123456';
... futhi waqala ukulinda. Ngemva kwemizuzu eyisishiyagalombili kanye ne-4 TB yedatha eceliwe, ngithole umphumela. I-Athena ikhokhisa ngevolumu yedatha etholiwe, u-$5 ngeterabhayithi ngayinye. Ngakho-ke lesi sicelo esisodwa sibiza u-$20 nemizuzu eyisishiyagalombili yokulinda. Ukuze siqhube imodeli kuyo yonke idatha, kwakudingeka silinde iminyaka engu-38 futhi sikhokhe amaRandi ayizigidi ezingu-50. Ngokusobala, lokhu kwakungafaneleki kithi.
Kwakudingeka ukusebenzisa iParquet ...
Yini engiyifundile: Qaphela ubukhulu bamafayela akho e-Parquet kanye nenhlangano yawo.
Ngiqale ngazama ukulungisa isimo ngokuguqula wonke ama-TSV abe
Ngenza umsebenzi olula
Kuyathakazelisa ukuthi uhlobo lokucindezela lwe-Parquet (futhi olunconyiwe), olusheshayo, aluhlukaniseki. Ngakho-ke, umenzi wefa ngamunye wayebambelele emsebenzini wokukhulula nokulanda idathasethi ephelele yedatha engu-3,5 GB.
Masiqonde inkinga
Yini engiyifundile: Ukuhlunga kunzima, ikakhulukazi uma idatha isatshalaliswa.
Kimina kwabonakala sengathi manje ngase ngiwuqonda umnyombo wenkinga. Bengidinga kuphela ukuhlunga idatha ngekholomu ye-SNP, hhayi ngabantu. Khona-ke ama-SNP amaningana azogcinwa ku-chunk yedatha ehlukile, bese umsebenzi we-Parquet "smart" "uvuleka kuphela uma inani lisebangeni" lizozibonakalisa kuyo yonke inkazimulo yalo. Ngeshwa, ukuhlela izigidigidi zemigqa ehlakazekile eqoqweni kubonakale kuwumsebenzi onzima.
Mina ngithatha ikilasi le-algorithms ekolishi: "Uh, akekho onendaba nobunkimbinkimbi bekhompyutha bawo wonke lawa ma-algorithms wokuhlela"
Ngizama ukuhlunga kukholamu ku-20TB
#inhlansi ithebula: "Kungani kuthatha isikhathi eside kangaka?"#IdathaSayensi umzabalazo.— Nick Strayer (@NicholasStrayer)
Mashi 11, 2019
I-AWS ayifuni ngempela ukubuyisela imali ngenxa yesizathu sokuthi "Ngingumfundi ophazamisekile". Ngemuva kokuthi ngigijime ukuhlunga ku-Amazon Glue, isebenze izinsuku ezi-2 futhi yaphahlazeka.
Kuthiwani ngokuhlukanisa?
Yini engiyifundile: Ama-Partitions ku-Spark kufanele alinganisele.
Ngabe sengiqhamuka nomqondo wokuhlukanisa idatha kuma-chromosome. Kunezingu-23 zazo (kanye nezinye eziningi uma ucabangela i-mitochondrial DNA kanye nezifunda ezingenamephu).
Lokhu kuzokuvumela ukuthi uhlukanise idatha ibe izingcezu ezincane. Uma wengeza umugqa owodwa emsebenzini wokuthekelisa we-Spark kusikripthi seGlue partition_by = "chr"
, khona-ke idatha kufanele ihlukaniswe ngamabhakede.
I-genome iqukethe izingcezu eziningi ezibizwa ngokuthi ama-chromosome.
Ngeshwa, akuzange kusebenze. Ama-Chromosome anosayizi abahlukahlukene, okusho amanani ahlukene olwazi. Lokhu kusho ukuthi imisebenzi uSpark ayithumele kubasebenzi ayizange ilinganisele futhi iqedwe kancane ngoba amanye ama-node asheshe aqeda futhi engasebenzi. Nokho, imisebenzi yaqedwa. Kodwa lapho ucela i-SNP eyodwa, ukungalingani kwaphinde kwabangela izinkinga. Izindleko zokucubungula ama-SNP kuma-chromosome amakhulu (okungukuthi, lapho sifuna ukuthola khona idatha) zehle kuphela cishe ngesici esingu-10. Kuningi, kodwa akwanele.
Kuthiwani uma siyihlukanisa ibe izingxenye ezincane nakakhulu?
Yini engiyifundile: Ungalokothi uzame ukwenza ama-partitions ayizigidi ezingu-2,5 nhlobo.
Nganquma ukuphuma konke futhi ngihlukanise i-SNP ngayinye. Lokhu kwaqinisekisa ukuthi ama-partitions ayenobukhulu obulinganayo. KWAKUNGUMBONO OMBI. Ngisebenzise i-Glue futhi ngengeza umugqa ongenacala partition_by = 'snp'
. Umsebenzi waqala futhi waqala ukwenziwa. Ngemva kosuku ngabheka ngabona ukuthi kwakungakabhalwa lutho ku-S3, ngakho ngawubulala umsebenzi. Kubukeka sengathi i-Glue ibibhala amafayela aphakathi nendawo endaweni efihliwe ku-S3, amafayela amaningi, mhlawumbe izigidi ezimbalwa. Ngenxa yalokho, iphutha lami labiza ngaphezu kwamadola ayinkulungwane futhi alizange limjabulise umeluleki wami.
Ukwahlukanisa + ukuhlunga
Yini engiyifundile: Ukuhlunga kusenzima, njengoba kunjalo nokushuna i-Spark.
Umzamo wami wokugcina wokuhlukanisa wawungihilela ukwehlukanisa ama-chromosome bese ngihlela ukwahlukanisa ngakunye. Ngokombono, lokhu kuzosheshisa umbuzo ngamunye ngoba idatha ye-SNP efiselekayo bekufanele ibe phakathi kwezingcezu ezimbalwa ze-Parquet phakathi kwebanga elinikeziwe. Ngeshwa, ukuhlunga ngisho nedatha ehlukanisiwe kube umsebenzi onzima. Ngenxa yalokho, ngishintshele ku-EMR ngeqoqo langokwezifiso futhi ngasebenzisa izimo ezinamandla eziyisishiyagalombili (C5.4xl) kanye ne-Sparklyr ukuze ngidale ukuhamba komsebenzi okuvumelana nezimo...
# Sparklyr snippet to partition by chr and sort w/in partition
# Join the raw data with the snp bins
raw_data
group_by(chr) %>%
arrange(Position) %>%
Spark_write_Parquet(
path = DUMP_LOC,
mode = 'overwrite',
partition_by = c('chr')
)
...nokho, umsebenzi ubungakaqedwa. Ngiyilungiselele ngezindlela ezihlukene: ngandisa isabelo senkumbulo kumuntu ngamunye owaba umbuzo, ama-node asetshenzisiwe anenani elikhulu lenkumbulo, okuguquguqukayo okusakazwayo okusetshenzisiwe (okuguquguqukayo kokusakaza), kodwa isikhathi ngasinye lokhu kuba yizinyathelo eziwuhhafu, futhi kancane kancane abafayo baqala. ukuhluleka kuze kuphele yonke into.
Isibuyekezo: ngakho-ke iqala.
pic.twitter.com/agY4GU2ru5 — Nick Strayer (@NicholasStrayer)
Kwangathi 15, 2019
Sengiya ngokuya ngisungula
Yini engiyifundile: Kwesinye isikhathi idatha ebalulekile idinga izixazululo ezikhethekile.
I-SNP ngayinye inenani lesikhundla. Lena inombolo ehambisana nenani lezisekelo ezihambisana nechromosome yayo. Lena indlela enhle nengokwemvelo yokuhlela idatha yethu. Ekuqaleni ngangifuna ukuhlukanisa ngezifunda zechromosome ngayinye. Isibonelo, izikhundla 1 - 2000, 2001 - 4000, njll. Kodwa inkinga ukuthi ama-SNP awasakazwa ngokulinganayo kuwo wonke ama-chromosome, ngakho-ke osayizi beqembu bazohluka kakhulu.
Ngenxa yalokho, ngifinyelele ekuqhekekeni kwezikhundla ngezigaba (isikhundla). Ngisebenzisa idatha esivele ilandiwe, ngenze isicelo sokuthola uhlu lwama-SNP ahlukile, izikhundla zawo nama-chromosome. Ngabe sengihlunga idatha ngaphakathi kwekhromozomu ngayinye futhi ngaqoqa ama-SNP ngamaqembu (umgqomo) wosayizi othile. Ake sithi ama-SNP ayi-1000 lilinye. Lokhu kunginike ubudlelwano be-SNP-to-group-per-chromosome.
Ekugcineni, ngenza amaqembu (bin) we-75 SNPs, isizathu sizochazwa ngezansi.
snp_to_bin <- unique_snps %>%
group_by(chr) %>%
arrange(position) %>%
mutate(
rank = 1:n()
bin = floor(rank/snps_per_bin)
) %>%
ungroup()
Qala uzame nge-Spark
Yini engiyifundile: Ukuhlanganiswa kwe-Spark kuyashesha, kodwa ukuhlukanisa kusabiza.
Bengifuna ukufunda lolu hlaka lwedatha oluncane (imigqa eyizigidi ezingu-2,5) lube yi-Spark, ngiluhlanganise nedatha eluhlaza, bese ngiluhlukanisa ngekholomu esanda kungezwa. bin
.
# Join the raw data with the snp bins
data_w_bin <- raw_data %>%
left_join(sdf_broadcast(snp_to_bin), by ='snp_name') %>%
group_by(chr_bin) %>%
arrange(Position) %>%
Spark_write_Parquet(
path = DUMP_LOC,
mode = 'overwrite',
partition_by = c('chr_bin')
)
ngisebenzise sdf_broadcast()
, ngakho u-Spark uyazi ukuthi kufanele athumele uhlaka lwedatha kuwo wonke ama-node. Lokhu kuyasiza uma idatha incane ngosayizi futhi idingeka kuyo yonke imisebenzi. Uma kungenjalo, i-Spark izama ukuhlakanipha futhi isabalalise idatha njengoba kudingeka, okungase kubangele ukwehla.
Futhi, umqondo wami awuzange usebenze: imisebenzi yasebenza isikhathi esithile, yaqeda inyunyana, futhi, njengabaphathi abaqaliswe ngokuhlukanisa, baqala ukwehluleka.
Ukungeza i-AWK
Yini engiyifundile: Ungalali uma ufundiswa izinto eziyisisekelo. Impela othile useyixazululile inkinga yakho emuva ngeminyaka yawo-1980.
Kuze kube manje, isizathu sakho konke ukwehluleka kwami nge-Spark kwaba ukuhlangana kwedatha kuqoqo. Mhlawumbe isimo singathuthukiswa ngokwelashwa kwangaphambili. Nginqume ukuzama ukuhlukanisa idatha yombhalo ongahluziwe ngiwenze amakholomu ama-chromosome, ngakho ngathemba ukunikeza u-Spark idatha "eyahlukaniswa ngaphambilini".
Ngicinge ku-StackOverflow ukuthi ngingahlukaniswa kanjani ngamavelu ekholomu futhi ngathola stdout
.
Ngibhale umbhalo we-Bash ukuze ngiwuzame. Idawunilode eyodwa yama-TSV apakishiwe, yase iyikhipha isebenzisa gzip
futhi ithunyelwe ku awk
.
gzip -dc path/to/chunk/file.gz |
awk -F 't'
'{print $1",..."$30">"chunked/"$chr"_chr"$15".csv"}'
Kwasebenza!
Ukugcwalisa ama-cores
Yini engiyifundile: gnu parallel
- kuyinto ewumlingo, wonke umuntu kufanele ayisebenzise.
Ukuhlukana kwakuhamba kancane futhi lapho ngiqala htop
ukuhlola ukusetshenziswa kwesibonelo esinamandla (futhi esibizayo) se-EC2, kuvele ukuthi bengisebenzisa umongo owodwa kanye nenkumbulo engaba ngu-200 MB. Ukuze sixazulule inkinga futhi singalahlekelwa yimali eningi, kwakudingeka sithole indlela yokufanisa umsebenzi. Ngenhlanhla, encwadini emangalisayo ngokuphelele gnu parallel
, indlela evumelana nezimo kakhulu yokusebenzisa i-multithreading ku-Unix.
Lapho ngiqala ukwahlukanisa ngisebenzisa inqubo entsha, konke kwakuhamba kahle, kodwa kwakusene-bottleneck - ukulanda izinto ze-S3 kudiski kwakungasheshi kakhulu futhi kungafaniswa ngokugcwele. Ukulungisa lokhu, ngenze lokhu:
- Ngithole ukuthi kungenzeka ukusebenzisa isiteji sokulanda se-S3 ngqo epayipini, ukuqeda ngokuphelele isitoreji esiphakathi kudiski. Lokhu kusho ukuthi ngingakwazi ukugwema ukubhala idatha eluhlaza kudiski futhi ngisebenzise encane, futhi ngenxa yalokho ishibhile, ukugcinwa ku-AWS.
- iqembu
aws configure set default.s3.max_concurrent_requests 50
kwandisa kakhulu inani lezintambo ezisetshenziswa yi-AWS CLI (ngokuzenzakalelayo kukhona u-10). - Ngishintshele kusibonelo se-EC2 esilungiselelwe isivinini senethiwekhi, enohlamvu n egameni. Ngithole ukuthi ukulahlekelwa amandla okucubungula uma usebenzisa i-n-isibonelo kungaphezu kokunxeshezelwa ngokunyuka kwesivinini sokulayisha. Emisebenzini eminingi ngisebenzise i-c5n.4xl.
- Kushintshiwe
gzip
on , leli ithuluzi le-gzip elingenza izinto ezipholile ukufanisa umsebenzi wasekuqaleni ongalingani wamafayela wokuwohloka (lokhu kusize kancane).pigz
# Let S3 use as many threads as it wants
aws configure set default.s3.max_concurrent_requests 50
for chunk_file in $(aws s3 ls $DATA_LOC | awk '{print $4}' | grep 'chr'$DESIRED_CHR'.csv') ; do
aws s3 cp s3://$batch_loc$chunk_file - |
pigz -dc |
parallel --block 100M --pipe
"awk -F 't' '{print $1",..."$30">"chunked/{#}_chr"$15".csv"}'"
# Combine all the parallel process chunks to single files
ls chunked/ |
cut -d '_' -f 2 |
sort -u |
parallel 'cat chunked/*_{} | sort -k5 -n -S 80% -t, | aws s3 cp - '$s3_dest'/batch_'$batch_num'_{}'
# Clean up intermediate data
rm chunked/*
done
Lezi zinyathelo zihlanganiswe nomunye ukwenza yonke into isebenze ngokushesha okukhulu. Ngokwandisa isivinini sokulanda nokuqeda ukubhala kwediski, manje sengingakwazi ukucubungula iphakheji ye-terabyte engu-5 emahoreni ambalwa nje.
Akukho okumnandi njengokubona wonke ama-cores owakhokhelayo ku-AWS esetshenziswa. Ngenxa ye-gnu-parallel ngiyakwazi ukuqaqa futhi ngihlukanise i-19gig csv ngokushesha nje engingakwazi ukuyilanda. Angikwazanga ngisho nokuthola inhlansi yokuqhuba lokhu.
#IdathaSayensi #Linux pic.twitter.com/Nqyba2zqEk — Nick Strayer (@NicholasStrayer)
Kwangathi 17, 2019
Le tweet bekumele ikhulume nge-'TSV'. Maye.
Ukusebenzisa idatha esanda kuncozululwa
Yini engiyifundile: U-Spark uthanda idatha engacindezelwanga futhi akathandi ukuhlanganisa ama-partitions.
Manje idatha yayiku-S3 kufomethi engapakishiwe (funda: kwabelwana ngayo) kanye nefomethi e-odwe kancane, futhi ngangingabuyela ku-Spark futhi. Isimanga sasingilindile: Ngaphinda ngahluleka ukufeza engangikufuna! Kwakunzima kakhulu ukutshela uSpark ukuthi idatha yahlukaniswa kanjani. Futhi ngisho nalapho ngenza lokhu, kwavela ukuthi kwakukhona ukuhlukaniswa okuningi (izinkulungwane ezingu-95), futhi lapho ngisebenzisa coalesce
kwehlise inani labo laba yimikhawulo efanelekile, lokhu kwacekela phansi ukwahlukanisa kwami. Ngiqinisekile ukuthi lokhu kungalungiswa, kodwa ngemva kwezinsuku ezimbalwa zokusesha angikwazanga ukuthola isisombululo. Ekugcineni ngiqede yonke imisebenzi e-Spark, nakuba kwathatha isikhathi futhi amafayela ami e-Parquet ahlukanisiwe ayengemancane kakhulu (~200 KB). Nokho, idatha yayilapho yayidingeka khona.
Incane kakhulu futhi ayilingani, iyamangalisa!
Ihlola imibuzo yendawo ye-Spark
Yini engiyifundile: I-Spark ine-overhead eningi kakhulu lapho ixazulula izinkinga ezilula.
Ngokulanda idatha ngefomethi ehlakaniphile, ngakwazi ukuhlola isivinini. Setha iskripthi sika-R ukuze usebenzise iseva yasendaweni ye-Spark, bese ulayisha uhlaka lwedatha ye-Spark olusuka endaweni yokugcina yeqembu le-Parquet (umgqomo). Ngizamile ukulayisha yonke idatha kodwa angikwazanga ukuthola i-Sparklyr ukuthi ibone ukwahlukanisa.
sc <- Spark_connect(master = "local")
desired_snp <- 'rs34771739'
# Start a timer
start_time <- Sys.time()
# Load the desired bin into Spark
intensity_data <- sc %>%
Spark_read_Parquet(
name = 'intensity_data',
path = get_snp_location(desired_snp),
memory = FALSE )
# Subset bin to snp and then collect to local
test_subset <- intensity_data %>%
filter(SNP_Name == desired_snp) %>%
collect()
print(Sys.time() - start_time)
Ukubulawa kuthathe imizuzwana engu-29,415. Okungcono kakhulu, kodwa akukuhle kakhulu ekuhlolweni kwenqwaba yanoma yini. Ukwengeza, angikwazanga ukusheshisa izinto ngokulondoloza isikhashana ngoba ngenkathi ngizama ukufaka inqolobane yohlaka lwedatha kumemori, i-Spark yayihlala iphahlazeka, ngisho nalapho ngabela inkumbulo engaphezu kuka-50 GB kudathasethi enesisindo esingaphansi kuka-15.
Buyela ku-AWK
Yini engiyifundile: Ama-Associative arrays ku-AWK asebenza kahle kakhulu.
Ngabona ukuthi ngingafinyelela isivinini esiphezulu. Ngakhumbula lokho ngesimangaliso
Ukwenza lokhu, kuskripthi se-AWK ngisebenzise ibhulokhi BEGIN
. Lolu ucezu lwekhodi olusetshenziswa ngaphambi kokuba umugqa wokuqala wedatha udluliselwe endikimbeni eyinhloko yeskripthi.
join_data.awk
BEGIN {
FS=",";
batch_num=substr(chunk,7,1);
chunk_id=substr(chunk,15,2);
while(getline < "snp_to_bin.csv") {bin[$1] = $2}
}
{
print $0 > "chunked/chr_"chr"_bin_"bin[$1]"_"batch_num"_"chunk_id".csv"
}
Ithimba while(getline...)
ilayishe yonke imigqa esuka eqenjini le-CSV (umgqomo), setha ikholomu yokuqala (igama le-SNP) njengokhiye we-associative array bin
kanye nenani lesibili (iqembu) njengenani. Bese kuba block {
}
, okwenziwa kuyo yonke imigqa yefayela elikhulu, umugqa ngamunye uthunyelwa efayeleni lokuphumayo, elithola igama eliyingqayizivele kuye ngeqembu lalo (umgqomo): ..._bin_"bin[$1]"_...
.
Okuguquguqukayo batch_num
и chunk_id
ifanise idatha ehlinzekwe ipayipi, igwema isimo somjaho, kanye nochungechunge lokubulawa ngalunye olusebenzayo parallel
, yabhalela efayeleni layo eliyingqayizivele.
Njengoba ngasakaza yonke idatha eluhlaza kumafolda kuma-chromosome asele ekuhlolweni kwami kwangaphambilini nge-AWK, manje sengingakwazi ukubhala esinye iskripthi se-Bash ukuze ngicubungule i-chromosome eyodwa ngesikhathi futhi ngithumele idatha ehlukanisiwe ejulile ku-S3.
DESIRED_CHR='13'
# Download chromosome data from s3 and split into bins
aws s3 ls $DATA_LOC |
awk '{print $4}' |
grep 'chr'$DESIRED_CHR'.csv' |
parallel "echo 'reading {}'; aws s3 cp "$DATA_LOC"{} - | awk -v chr=""$DESIRED_CHR"" -v chunk="{}" -f split_on_chr_bin.awk"
# Combine all the parallel process chunks to single files and upload to rds using R
ls chunked/ |
cut -d '_' -f 4 |
sort -u |
parallel "echo 'zipping bin {}'; cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R '$S3_DEST'/chr_'$DESIRED_CHR'_bin_{}.rds"
rm chunked/*
Iskripthi sinezigaba ezimbili parallel
.
Esigabeni sokuqala, idatha ifundwa kuwo wonke amafayela aqukethe ulwazi lwekhromozomi efiswayo, bese le datha isatshalaliswa emicucu yonkana, esabalalisa amafayela emaqenjini afanelekile (umgqomo). Ukuze ugweme izimo zomjaho lapho izintambo eziningi zibhala efayeleni elifanayo, i-AWK idlulisa amagama wefayela ukuze ibhale idatha ezindaweni ezihlukene, isb. chr_10_bin_52_batch_2_aa.csv
. Ngenxa yalokho, kudalwa amafayela amaningi amancane kudiski (ngalokhu ngisebenzise imiqulu ye-terabyte EBS).
I-Conveyor kusukela esigabeni sesibili parallel
idlula emaqenjini (umgqomo) futhi ihlanganise amafayela awo ngamanye abe yi-CSV evamile c cat
bese iwathumela ukuthi ayothunyelwa ngaphandle.
Ukusakaza ngo-R?
Yini engiyifundile: Ungaxhumana stdin
и stdout
kusuka kusikripthi sika-R, ngakho-ke sisebenzise epayipini.
Kungenzeka ukuthi uwubonile lo mugqa kusikripthi sakho se-Bash: ...cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R...
. Ihumusha wonke amafayela eqembu ahlanganisiwe (umgqomo) kusikripthi sika-R esingezansi. {}
kuyindlela ekhethekile parallel
, efaka noma iyiphi idatha eyithumelayo ekusakazweni okushiwo ngokuqondile kumyalo ngokwawo. Inketho {#}
inikeza i-ID yochungechunge oluyingqayizivele, futhi {%}
imele inombolo yesikhala somsebenzi (okuphindaphindiwe, kodwa hhayi kanyekanye). Uhlu lwazo zonke izinketho lungatholakala ku
#!/usr/bin/env Rscript
library(readr)
library(aws.s3)
# Read first command line argument
data_destination <- commandArgs(trailingOnly = TRUE)[1]
data_cols <- list(SNP_Name = 'c', ...)
s3saveRDS(
read_csv(
file("stdin"),
col_names = names(data_cols),
col_types = data_cols
),
object = data_destination
)
Lapho okuguquguqukayo file("stdin")
idluliselwe ku readr::read_csv
, idatha ehunyushelwe kusikripthi sika-R ilayishwa ohlakeni, oluba sefomini .rds
- ifayela usebenzisa aws.s3
ibhalwe ngqo ku-S3.
I-RDS ifana nenguqulo encane ye-Parquet, ngaphandle kokugcinwa kwesipikha.
Ngemva kokuqeda umbhalo we-Bash ngathola inqwaba .rds
-amafayela atholakala ku-S3, okungivumele ukuthi ngisebenzise ukucindezela okusebenzayo nezinhlobo ezakhelwe ngaphakathi.
Naphezu kokusebenzisa i-brake R, konke kwasebenza ngokushesha okukhulu. Akumangazi ukuthi izingxenye zika-R ezifunda futhi zibhale idatha zithuthukiswe kakhulu. Ngemva kokuhlola ku-chromosome eyodwa yosayizi omaphakathi, umsebenzi waqedwa esimweni se-C5n.4xl cishe emahoreni amabili.
Imikhawulo ye-S3
Yini engiyifundile: Ngenxa yokusebenzisa indlela ehlakaniphile, i-S3 ingaphatha amafayela amaningi.
Bengikhathazekile ukuthi i-S3 izokwazi yini ukuphatha amafayela amaningi adluliselwe kuyo. Ngingenza amagama amafayela abe nomqondo, kodwa i-S3 izowafuna kanjani?
Amafolda ku-S3 awombukiso nje, empeleni uhlelo alunantshisekelo kuphawu /
.
Kubonakala sengathi i-S3 imele indlela eya efayeleni elithile njengokhiye olula ohlotsheni lwetafula le-hashi noma isizindalwazi esisekelwe kumadokhumenti. Ibhakede lingacatshangwa njengetafula, futhi amafayela angabhekwa njengamarekhodi kulelo thebula.
Njengoba isivinini nokusebenza kahle kubalulekile ekwenzeni inzuzo e-Amazon, akumangazi ukuthi lolu hlelo lwe-key-as-a-file-path luthuthukisiwe. Ngazama ukuthola ibhalansi: ukuze kungadingeki ngenze izicelo eziningi zokuthola, kodwa ukuthi izicelo zenziwa ngokushesha. Kwavela ukuthi kungcono ukwenza amafayili e-bin ayizinkulungwane ezingu-20. Ngicabanga ukuthi uma siqhubeka nokuthuthukisa, singakwazi ukufeza ukwanda kwesivinini (isibonelo, ukwenza ibhakede elikhethekile ledatha, ngaleyo ndlela sinciphise usayizi wetafula lokubheka). Kodwa sasingekho isikhathi noma imali yokuhlola okwengeziwe.
Kuthiwani ngokuhambisana?
Engikufundile: Imbangela yokuqala yokumosha isikhathi ukuthuthukisa indlela yakho yokugcina ngaphambi kwesikhathi.
Kuleli qophelo, kubaluleke kakhulu ukuthi uzibuze: "Kungani usebenzisa ifomethi yefayela lobunikazi?" Isizathu sisejubaneni lokulayisha (amafayela e-CSV e-gzipped athathe izikhathi ezingu-7 ubude ukulayisha) kanye nokuhambisana nokugeleza komsebenzi wethu. Ngingase ngicabange kabusha uma u-R engakwazi ukulayisha kalula amafayela e-Parquet (noma Umcibisholo) ngaphandle komthwalo we-Spark. Wonke umuntu kulebhu yethu usebenzisa u-R, futhi uma ngidinga ukuguqulela idatha kwenye ifomethi, ngisenayo idatha yombhalo yasekuqaleni, ngakho-ke ngingakwazi ukuphinda ngisebenzise ipayipi.
Ukuhlukaniswa komsebenzi
Yini engiyifundile: Ungazami ukukhulisa imisebenzi ngokwenza, vumela ikhompuyutha ikwenze.
Ngisuse iphutha ekuhambeni komsebenzi ku-chromosome eyodwa, manje ngidinga ukucubungula yonke enye idatha.
Bengifuna ukuphakamisa izimo ezimbalwa ze-EC2 zokuguqulwa, kodwa ngesikhathi esifanayo ngangesaba ukuthola umthwalo ongalinganiseli kakhulu emisebenzini ehlukene yokucubungula (njengoba nje uSpark ehlushwa ama-partitions angalingani). Ngaphezu kwalokho, bengingenaso intshisekelo yokukhulisa isibonelo esisodwa ngekhromozomu ngayinye, ngoba kuma-akhawunti e-AWS kunomkhawulo ozenzakalelayo wezimo eziyi-10.
Ngabe senginquma ukubhala umbhalo ku-R ukuze ngithuthukise imisebenzi yokucubungula.
Okokuqala, ngicele i-S3 ukuba ibale ukuthi ikhromozomu ngayinye ina isikhala esingakanani sokulondoloza.
library(aws.s3)
library(tidyverse)
chr_sizes <- get_bucket_df(
bucket = '...', prefix = '...', max = Inf
) %>%
mutate(Size = as.numeric(Size)) %>%
filter(Size != 0) %>%
mutate(
# Extract chromosome from the file name
chr = str_extract(Key, 'chr.{1,4}.csv') %>%
str_remove_all('chr|.csv')
) %>%
group_by(chr) %>%
summarise(total_size = sum(Size)/1e+9) # Divide to get value in GB
# A tibble: 27 x 2
chr total_size
<chr> <dbl>
1 0 163.
2 1 967.
3 10 541.
4 11 611.
5 12 542.
6 13 364.
7 14 375.
8 15 372.
9 16 434.
10 17 443.
# … with 17 more rows
Ngabe sengibhala umsebenzi othatha usayizi ophelele, ishova ukuhleleka kwama-chromosome, iwahlukanise abe amaqembu. num_jobs
futhi ikutshela ukuthi osayizi bayo yonke imisebenzi yokucubungula bahluke kangakanani.
num_jobs <- 7
# How big would each job be if perfectly split?
job_size <- sum(chr_sizes$total_size)/7
shuffle_job <- function(i){
chr_sizes %>%
sample_frac() %>%
mutate(
cum_size = cumsum(total_size),
job_num = ceiling(cum_size/job_size)
) %>%
group_by(job_num) %>%
summarise(
job_chrs = paste(chr, collapse = ','),
total_job_size = sum(total_size)
) %>%
mutate(sd = sd(total_job_size)) %>%
nest(-sd)
}
shuffle_job(1)
# A tibble: 1 x 2
sd data
<dbl> <list>
1 153. <tibble [7 × 3]>
Ngabe sengigijima phakathi kwama-shuffles ayinkulungwane ngisebenzisa i-purrr futhi ngakhetha okuhle kakhulu.
1:1000 %>%
map_df(shuffle_job) %>%
filter(sd == min(sd)) %>%
pull(data) %>%
pluck(1)
Ngakho ngagcina nginesethi yemisebenzi eyayifana kakhulu ngobukhulu. Kwabe sekusele nje ukusonga umbhalo wami we-Bash wangaphambilini ngeluphu enkulu for
. Lokhu kulungiselelwa kuthathe cishe imizuzu eyi-10 ukubhala. Futhi lokhu kuncane kakhulu kunalokho ebengingakusebenzisa ekudaleni imisebenzi uma ibingalingani. Ngakho-ke, ngicabanga ukuthi bengiqinisile ngalokhu kulungiselelwa kokuqala.
for DESIRED_CHR in "16" "9" "7" "21" "MT"
do
# Code for processing a single chromosome
fi
Ekugcineni ngengeza umyalo wokuvala shaqa:
sudo shutdown -h now
... futhi konke kwaphumelela! Ngisebenzisa i-AWS CLI, ngiphakamise izimo ngisebenzisa inketho user_data
wabanika imibhalo ye-Bash yemisebenzi yabo ukuze icutshungulwe. Zazisebenza futhi zacisha ngokuzenzakalelayo, ngakho-ke bengingakhokhi amandla engeziwe okucubungula.
aws ec2 run-instances ...
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=<<job_name>>}]"
--user-data file://<<job_script_loc>>
Asipakishe!
Yini engiyifundile: I-API kufanele ibe lula ukuze kube lula futhi kube lula ukuyisebenzisa.
Ekugcineni ngathola idatha endaweni efanele kanye nefomu. Okwakusele kwakuwukwenza lula inqubo yokusebenzisa idatha ngangokunokwenzeka ukuze kube lula kozakwethu. Bengifuna ukwenza i-API elula yokudala izicelo. Uma esikhathini esizayo nginquma ukushintsha .rds
kumafayili e-Parquet, khona-ke lokhu kufanele kube inkinga kimi, hhayi kozakwethu. Ngalokhu nginqume ukwenza iphakheji ye-R yangaphakathi.
Yakha futhi ubhale phansi iphakheji elula kakhulu equkethe imisebenzi embalwa yokufinyelela idatha ehlelwe ngokusebenza get_snp
. Ngiphinde ngenzela ozakwethu iwebhusayithi
Ukulondoloza isikhashana okuhlakaniphile
Yini engiyifundile: Uma idatha yakho ilungiswe kahle, ukulondoloza isikhashana kuzoba lula!
Njengoba okunye kokuhamba komsebenzi okuyinhloko kusebenzise imodeli yokuhlaziya efanayo kuphakheji ye-SNP, nginqume ukusebenzisa i-binning ukuze ngizuze. Lapho udlulisela idatha nge-SNP, lonke ulwazi oluvela eqenjini (umgqomo) lunamathiselwe entweni ebuyisiwe. Okusho ukuthi, imibuzo emidala (ngokombono) ingasheshisa ukucutshungulwa kwemibuzo emisha.
# Part of get_snp()
...
# Test if our current snp data has the desired snp.
already_have_snp <- desired_snp %in% prev_snp_results$snps_in_bin
if(!already_have_snp){
# Grab info on the bin of the desired snp
snp_results <- get_snp_bin(desired_snp)
# Download the snp's bin data
snp_results$bin_data <- aws.s3::s3readRDS(object = snp_results$data_loc)
} else {
# The previous snp data contained the right bin so just use it
snp_results <- prev_snp_results
}
...
Lapho ngakha iphakheji, ngisebenzise amabhentshimakhi amaningi ukuze ngiqhathanise isivinini lapho ngisebenzisa izindlela ezihlukile. Ngincoma ukuthi ungakunaki lokhu, ngoba ngezinye izikhathi imiphumela ayilindelekile. Ngokwesibonelo, dplyr::filter
bekushesha kakhulu kunokuthwebula imigqa kusetshenziswa ukuhlunga okususelwa kunkomba, futhi ukubuyisa ikholomu eyodwa kuzimele wedatha ehlungiwe kwakushesha kakhulu kunokusebenzisa i-syntax yokukhomba.
Sicela uqaphele ukuthi into prev_snp_results
iqukethe ukhiye snps_in_bin
. Lolu uhlu lwawo wonke ama-SNP ahlukile eqenjini (umgqomo), okukuvumela ukuthi uhlole ngokushesha ukuthi ingabe usunayo yini idatha yombuzo wangaphambilini. Futhi kwenza kube lula ukuxhuma kuwo wonke ama-SNP eqenjini (umgqomo) ngale khodi:
# Get bin-mates
snps_in_bin <- my_snp_results$snps_in_bin
for(current_snp in snps_in_bin){
my_snp_results <- get_snp(current_snp, my_snp_results)
# Do something with results
}
Imiphumela
Manje sesingakwazi (futhi sesiqale ukusebenza kanzima) ukusebenzisa amamodeli nezimo ebesingafinyeleleki kuzo ngaphambilini. Okuhle kakhulu ukuthi ozakwethu baselebhu akudingeki bacabange nganoma yiziphi izinkinga. Banomsebenzi osebenzayo nje.
Futhi nakuba iphakheji libasindisa imininingwane, ngizamile ukwenza ifomethi yedatha ibe lula ngokwanele ukuze bakwazi ukuyithola uma nginyamalala kungazelelwe kusasa...
Isivinini senyuke ngokuphawulekayo. Ngokuvamile siskena izingcezu zegenome ezibalulekile ezisebenzayo. Ngaphambilini, asikwazanga ukwenza lokhu (kuvele kubiza kakhulu), kodwa manje, ngenxa yesakhiwo seqembu (umgqomo) kanye nokugcinwa kwesikhashana, isicelo se-SNP eyodwa sithatha isilinganiso esingaphansi kwamasekhondi angu-0,1, futhi ukusetshenziswa kwedatha kunjalo. eziphansi ukuthi izindleko ze-S3 zingamakinati.
Muva nje ngithole ushintsho ekuxabaneni okungu-25+ TB kwedatha ye-genotyping eluhlaza ngelebhu yami. Ngenkathi ngiqala, ukusebenzisa i-spark kuthathe amaminithi angu-8 futhi kubiza u-$20 ukubuza i-SNP. Ngemuva kokusebenzisa i-AWK+
#ama-stats ukucubungula, manje kuthatha ngaphansi kuka-10 wesekhondi futhi kubiza u-$0.00001. Okwami siqu#Idatha enkulu ukunqoba.pic.twitter.com/ANOXVGrmkk — Nick Strayer (@NicholasStrayer)
Kwangathi 30, 2019
isiphetho
Lesi sihloko asisona umhlahlandlela nhlobo. Isixazululo siphenduke saba ngabanye, futhi cishe sasingekho kahle. Kunalokho, i-travelogue. Ngifuna abanye baqonde ukuthi izinqumo ezinjalo aziveli ngokuphelele ekhanda, ziwumphumela wokuzama nokuphutha. Futhi, uma ufuna usosayensi wedatha, khumbula ukuthi ukusebenzisa lawa mathuluzi kudinga isipiliyoni ngempumelelo, futhi ulwazi lubiza imali. Ngiyajabula ngokuthi nginayo indlela yokukhokha, kodwa abanye abaningi abakwazi ukwenza umsebenzi ofanayo kangcono kunami abasoze balithola ithuba ngenxa yokungabi namali yokuzama ngisho nokuzama.
Amathuluzi amakhulu edatha ahlukahlukene. Uma unesikhathi, cishe ungabhala isixazululo esisheshayo usebenzisa ukuhlanza idatha okuhlakaniphile, ukugcinwa, namasu okukhipha. Ekugcineni kufika ekuhlaziyweni kwezindleko zenzuzo.
Engikufundile:
- ayikho indlela eshibhile yokuhlaziya i-25 TB ngesikhathi;
- qaphela ubukhulu bamafayela akho e-Parquet kanye nenhlangano yawo;
- Izingxenye ze-Spark kufanele zilinganiswe;
- Ngokuvamile, ungalokothi uzame ukwenza izingxenye eziyizigidi ezingu-2,5;
- Ukuhlunga kusenzima, njengoba ukusetha i-Spark;
- ngezinye izikhathi idatha ekhethekile idinga izixazululo ezikhethekile;
- Ukuhlanganiswa kwe-Spark kuyashesha, kodwa ukwahlukanisa kusabiza;
- ungalali uma bekufundisa izinto eziyisisekelo, mhlawumbe othile useyixazululile inkinga yakho emuva ngeminyaka yawo-1980;
gnu parallel
- lokhu kuyinto yemilingo, wonke umuntu kufanele ayisebenzise;- U-Spark uthanda idatha engacindezelwanga futhi akathandi ukuhlanganisa ama-partitions;
- I-Spark ine-overhead eningi kakhulu lapho ixazulula izinkinga ezilula;
- Amalungu afanayo e-AWK asebenza kahle kakhulu;
- ungaxhumana
stdin
иstdout
kusuka kusikripthi esingu-R, ngakho-ke sisebenzise epayipini; - Ngenxa yokusebenzisa indlela ehlakaniphile, i-S3 ingacubungula amafayela amaningi;
- Isizathu esiyinhloko sokumosha isikhathi ukwenza indlela yakho yokugcina ibe ngcono ngaphambi kwesikhathi;
- ungazami ukukhulisa imisebenzi ngesandla, vumela ikhompuyutha ikwenze;
- I-API kufanele ibe lula ukuze kube lula futhi kube lula ukuyisebenzisa;
- Uma idatha yakho ilungiswe kahle, ukulondoloza isikhashana kuzoba lula!
Source: www.habr.com