Indlela yokufunda eli nqaku: Ndicela uxolo ngombhalo omde kwaye unesiphithiphithi. Ukukongela ixesha, ndiqala isahluko ngasinye ngentshayelelo ethi “Oko Ndikufundileyo,” eshwankathela umongo wesahluko kwisivakalisi esinye okanye ezibini.
“Ndibonise isisombululo!” Ukuba ufuna nje ukubona apho ndivela khona, emva koko utsibe kwisahluko esithi "Ukuba ne-Inventive More," kodwa ndicinga ukuba kunomdla kwaye kuluncedo ukufunda malunga nokusilela.
Kutshanje ndinikwe umsebenzi wokuseta inkqubo yokucwangcisa umthamo omkhulu wokulandelelana kwe-DNA eluhlaza (ngokobuchwepheshe i-chip ye-SNP). Isidingo yayikukufumana ngokukhawuleza idatha malunga nendawo enikiweyo yemfuza (ebizwa ngokuba yi-SNP) yomzekelo olandelayo kunye neminye imisebenzi. Ukusebenzisa i-R kunye ne-AWK, ndakwazi ukucoca kunye nokulungelelanisa idatha ngendlela engokwemvelo, ngokukhawuleza ukukhawuleza ukuphendulwa kwemibuzo. Oku bekungelula kum kwaye kufuna ukuphindaphindwa kaninzi. Eli nqaku liza kukunceda uphephe ezinye zeempazamo zam kwaye ndikubonise oko ndigqibe ngako.
Okokuqala, ezinye iinkcazo zentshayelelo.
Iinkcukacha
Iziko lethu leyunivesithi elilungisa ulwazi lwemfuzo lisinike iinkcukacha ezikwimo ye-25 TB TSV. Ndawafumana ahlulahlulwe kwiipakethi ezi-5, zixinzelelwe yi-Gzip, nganye kuzo iqulethe malunga neefayile ze-240 ezine-gigabyte. Umqolo ngamnye uqulethe idatha ye-SNP enye ukusuka kumntu omnye. Lilonke, idatha kwi ~ 2,5 yezigidi ze-SNP kunye ne- ~ 60 amawaka abantu bahanjiswa. Ukongeza kwingcaciso ye-SNP, iifayile zineekholamu ezininzi ezinamanani abonisa iimpawu ezahlukeneyo, ezinjengokuqina kokufunda, ukuphindaphinda kwee-alleles ezahlukeneyo, njl. Lilonke bekukho iikholamu ezingama-30 ezinamaxabiso awodwa.
Injongo
Njengayo nayiphi na iprojekthi yolawulo lwedatha, eyona nto ibalulekileyo yayikukumisela ukuba idatha iya kusetyenziswa njani na. Kule meko siya kukhetha ubukhulu becala iimodeli kunye nokuhamba komsebenzi kwe-SNP esekwe kwi-SNP. Oko kukuthi, siya kufuna kuphela idatha kwi-SNP enye ngexesha. Kwafuneka ndifunde indlela yokubuyisela zonke iirekhodi ezinxulumene nenye ye-2,5 yezigidi ze-SNP ngokulula, ngokukhawuleza, nangexabiso eliphantsi kangangoko.
Akwenziwa njani oku
Ukucaphula i-cliché efanelekileyo:
Khange ndisilele iwaka lamaxesha, ndifumene iindlela eziliwaka zokunqanda ukwahlula iqela ledatha kwifomati yombuzo.
Zama kuqala
Ndifunde ntoni: Akukho ndlela iphantsi yokwahlula i-25 TB ngexesha.
Emva kokuba ndithathe ikhosi "IiNdlela eziPhezulu zokuPhathwa kweDatha enkulu" kwiYunivesithi yaseVanderbilt, ndandiqinisekile ukuba iqhinga lalisesingxobeni. Kuya kuthatha iyure okanye ezimbini ukuseta iseva yeHive ukuba iqhube kuyo yonke idatha kwaye ichaze isiphumo. Ekubeni idatha yethu igcinwe kwi-AWS S3, ndasebenzisa inkonzo
Emva kokuba ndibonise u-Athena idatha yam kunye nefomathi yayo, ndiye ndaqhuba ezinye iimvavanyo ngemibuzo enje:
select * from intensityData limit 10;
Kwaye ngokukhawuleza wafumana iziphumo ezakhiwe kakuhle. Ulungile.
Sade sazama ukusebenzisa idatha emsebenzini wethu...
Ndacelwa ukuba ndikhuphe lonke ulwazi lwe-SNP ukuvavanya imodeli. Ndiphendule umbuzo:
select * from intensityData
where snp = 'rs123456';
... kwaye ndaqala ukulinda. Emva kwemizuzu esibhozo kunye ne-4 TB yedatha eceliwe, ndifumene umphumo. I-Athena ihlawulisa ngomthamo wedatha efunyenweyo, i-$ 5 ngeterabyte nganye. Ngoko esi sicelo sinye sixabisa i-$20 kunye nemizuzu esibhozo yokulinda. Ukuqhuba imodeli kuyo yonke idatha, kwafuneka silinde iminyaka eyi-38 kwaye sihlawule i-$ 50 yezigidi. Ngokucacileyo, oku kwakungafanelekanga kuthi.
Kwakuyimfuneko ukusebenzisa iParquet ...
Ndifunde ntoni: Qaphela ubungakanani beefayile zakho zeParquet kunye nombutho wazo.
Ndazama kuqala ukulungisa imeko ngokuguqula zonke ii-TSVs
Ndenze umsebenzi olula
Okubangela umdla kukuba, ukungagqibeki kweParquet (kwaye kuyacetyiswa) uhlobo loxinzelelo, i-snappy, ayicalulwa. Ngoko ke, umabi-lifa ngamnye wayebambelele kumsebenzi wokukhulula nokukhuphela idatha epheleleyo ye-3,5 GB.
Masiyiqonde ingxaki
Ndifunde ntoni: Ukuhlela kunzima, ngakumbi ukuba idatha isasazwa.
Kwakubonakala ngathi ngoku ndiyawuqonda umongo wale ngxaki. Ndandifuna kuphela ukuhlenga idatha ngekholomu ye-SNP, kungekhona ngabantu. Emva koko ii-SNP ezininzi ziya kugcinwa kwi-chunk yedatha eyahlukileyo, kwaye emva koko umsebenzi weParquet "smart" "uvule kuphela ukuba ixabiso likuluhlu" liya kuzibonakalisa kulo lonke uzuko lwayo. Ngelishwa, ukuhlenga iibhiliyoni zemiqolo ethe saa kwiqela ngalinye kwabonakala kungumsebenzi onzima.
Ndithatha iklasi ye-algorithms ekholejini: “Ewe, akukho mntu ukukhathaleleyo ukuntsonkotha kokuntsonkotha kwazo zonke ezi algorithms zokuhlela”
Ndizama ukuhlela kwikholamu kwi-20TB
#intlantsi itheyibhile: "Kutheni le nto ithatha ixesha elide kangaka?"#IdathaSayensi umzabalazo.— uNick Strayer (@NicholasStrayer)
Matshi 11, 2019
Ngokuqinisekileyo i-AWS ayifuni kukhupha imali ngenxa yesizathu esithi "Ndingumfundi ophazamisekileyo". Emva kokuba ndibaleke ndihlela kwiAmazon Glue, yabaleka iintsuku ezi-2 kwaye yantlitheka.
Kuthekani ngokwahlulahlula?
Ndifunde ntoni: Izahlulo kwi-Spark kufuneka zilinganiswe.
Emva koko ndeza nombono wokwahlulahlula idatha kwiichromosomes. Kukho i-23 kubo (kunye nezinye ezininzi ukuba uthathela ingqalelo i-DNA ye-mitochondrial kunye nemimandla engabonakaliyo).
Oku kuya kukuvumela ukuba wahlule idatha ibe ngamaqhekeza amancinci. Ukuba wongeza nje umgca omnye kumsebenzi wokuthumela ngaphandle kweSpark kwiskripthi seGlue partition_by = "chr"
, ngoko idatha kufuneka ihlulwe kwiibhakethi.
Igenome inamaqhekeza amaninzi abizwa ngokuba ziichromosomes.
Ngelishwa, ayizange isebenze. Iichromosomes zinobukhulu obahlukeneyo, nto leyo ethetha izixa ezahlukeneyo zolwazi. Oku kuthetha ukuba imisebenzi eyathunyelwa nguSpark kubasebenzi ayizange ilungelelaniswe kwaye igqitywe ngokucothayo ngenxa yokuba ezinye iindawo zokuhlala zagqitywa kwangethuba kwaye zazingenzi nto. Noko ke, imisebenzi yagqitywa. Kodwa xa ucela i-SNP enye, ukungalingani kwakhona kubangele iingxaki. Iindleko zokucubungula ii-SNP kwiichromosomes ezinkulu (oko kukuthi, apho sifuna ukufumana idatha) ziye zehla kuphela malunga ne-10. Kaninzi, kodwa akwanelanga.
Kuthekani ukuba siyahlulahlula ibe zizinto ezincinci?
Ndifunde ntoni: Ungaze uzame ukwenza izahlulelo ezizigidi ezi-2,5 kwaphela.
Ndagqiba ekubeni ndiphume yonke kwaye ndohlule i-SNP nganye. Oku kwaqinisekisa ukuba izahlulo zazilingana ngobukhulu. YAYINGUMBONO OMBI. Ndasebenzisa iGlue kwaye ndongeza umgca ongenatyala partition_by = 'snp'
. Umsebenzi waqala kwaye waqala ukwenziwa. Emva kwemini ndiye ndajonga ndabona ukuba akukabikho nto ibhaliweyo kuS3, ndawubulala lo msebenzi. Kubonakala ngathi iGlue yayibhala iifayile eziphakathi kwindawo efihliweyo kwi-S3, iifayile ezininzi, mhlawumbi isibini sezigidi. Ngenxa yoko, impazamo yam yabiza ngaphezulu kwewaka leedola kwaye ayizange imkholise umcebisi wam.
Ukwahlula + ukuhlela
Ndifunde ntoni: Ukuhlela kusenzima, njengoko kunjalo ukulungisa iSpark.
Ilinge lam lokugqibela lokwahlulahlula lindibandakanye ekwahluleni iichromosome ndize ndihlele isahlulelo ngasinye. Ngokwethiyori, oku kuya kukhawulezisa umbuzo ngamnye kuba idatha efunwayo ye-SNP kwakufuneka ibe ngaphakathi kweechunks ezimbalwa zeParquet ngaphakathi koluhlu olunikiweyo. Ngelishwa, ukuhlelwa kwedatha eyohluliweyo kuye kwaba ngumsebenzi onzima. Ngenxa yoko, ndatshintshela kwi-EMR kwi-cluster yesiko kwaye ndasebenzisa iimeko ezisibhozo ezinamandla (C5.4xl) kunye ne-Sparklyr ukudala ukuhamba komsebenzi okuguquguqukayo ...
# Sparklyr snippet to partition by chr and sort w/in partition
# Join the raw data with the snp bins
raw_data
group_by(chr) %>%
arrange(Position) %>%
Spark_write_Parquet(
path = DUMP_LOC,
mode = 'overwrite',
partition_by = c('chr')
)
...nangona kunjalo umsebenzi wawungekagqitywa. Ndiyiqwalasele ngeendlela ezahlukeneyo: ukwandisa ulwabiwo lwememori kumenzi wombuzo ngamnye, iindawo ezisetyenzisiweyo ezinomthamo omkhulu wememori, iinguqu ezisetyenzisiweyo zokusasaza (iinguqu zokusasaza), kodwa ixesha ngalinye ezi zijike zaba zisiqingatha, kwaye ngokuthe ngcembe abenzi bokufa baqala ukwenza. kusilela de kuphele yonke into.
Hlaziya: ngoko iqala.
pic.twitter.com/agY4GU2ru5 — uNick Strayer (@NicholasStrayer)
Ngamana 15, 2019
Ndiya ndisiba nobuchule ngakumbi
Ndifunde ntoni: Ngamanye amaxesha idatha ekhethekileyo ifuna izisombululo ezikhethekileyo.
I-SNP nganye inexabiso lendawo. Eli linani elihambelana nenani leziseko ezisecaleni kwechromosome yayo. Le yindlela entle nendalo yokucwangcisa idatha yethu. Ekuqaleni ndandifuna ukwahlula ngokwemimandla yechromosome nganye. Ngokomzekelo, izikhundla 1 - 2000, 2001 - 4000, njl. Kodwa ingxaki kukuba ii-SNP azisasazwanga ngokulinganayo kuzo zonke iichromosomes, ngoko ke ubukhulu beqela buya kwahluka kakhulu.
Ngenxa yoko, ndifike ekuqhekekeni kwezikhundla ngokweendidi (inqanaba). Ndisebenzisa idatha esele ikhutshiwe, ndiqhube isicelo sokufumana uluhlu lwee-SNP ezizodwa, izikhundla zabo kunye neechromosomes. Emva koko ndahlunga idatha ngaphakathi kwechromosome nganye kwaye ndaqokelela ii-SNP kumaqela (umgqomo) wobungakanani obunikeziweyo. Masithi i-1000 ye-SNP nganye. Oku kwandinika ubudlelwane be-SNP-kwiqela-ngekromozomu.
Ekugqibeleni, ndenze amaqela (bin) e-75 SNPs, isizathu siya kuchazwa ngezantsi.
snp_to_bin <- unique_snps %>%
group_by(chr) %>%
arrange(position) %>%
mutate(
rank = 1:n()
bin = floor(rank/snps_per_bin)
) %>%
ungroup()
Qala uzame ngeSpark
Ndifunde ntoni: Ukudityaniswa kwe-Spark kuyakhawuleza, kodwa ukwahlula kusabiza.
Bendifuna ukufunda le ncinci (i-2,5 yezigidi zeerowu) isakhelo sedatha kwi-Spark, siyidibanise nedatha ekrwada, emva koko ndiyahlule ngekholamu entsha eyongeziweyo. bin
.
# Join the raw data with the snp bins
data_w_bin <- raw_data %>%
left_join(sdf_broadcast(snp_to_bin), by ='snp_name') %>%
group_by(chr_bin) %>%
arrange(Position) %>%
Spark_write_Parquet(
path = DUMP_LOC,
mode = 'overwrite',
partition_by = c('chr_bin')
)
ndidla ngoku sdf_broadcast()
, ngoko ke uSpark uyazi ukuba kufuneka athumele isakhelo sedatha kuzo zonke iindawo. Oku kuluncedo ukuba idatha incinci ngobukhulu kwaye ifuneka kuyo yonke imisebenzi. Ngaphandle koko, i-Spark izama ukuba krelekrele kwaye isasaze idatha njengoko ifuneka, nto leyo enokubangela ukucotha.
Kwaye kwakhona, ingcamango yam ayizange isebenze: imisebenzi yasebenza ixesha elithile, yagqiba umanyano, kwaye ke, njengabagwebi abaqaliswe ngokuhlukana, baqala ukusilela.
Ukongeza i-AWK
Ndifunde ntoni: Ungalali xa ufundiswa iziseko. Ngokuqinisekileyo umntu sele eyisombulule ingxaki yakho phaya ngeminyaka yoo-1980.
Ukuza kuthi ga kweli nqanaba, isizathu sakho konke ukungaphumeleli kwam kunye ne-Spark yayiyi-jumble yedatha kwiqela. Mhlawumbi imeko inokuphuculwa ngonyango lwangaphambili. Ndaye ndagqiba ekubeni ndizame ukwahlula idatha yokubhaliweyo ekrwada kwiikholamu zeechromosome ngethemba lokubonelela nge-Spark ngedatha "eyahlulwe kwangaphambili".
Ndikhangele kwiStackOverflow indlela yokwahlulahlula ngamaxabiso ekholamu kwaye ndifunyenwe stdout
.
Ndibhale iskripthi seBash ukuze ndizame. Khuphela enye yee-TSV ezipakishiweyo, emva koko uyikhuphe usebenzisa gzip
kwaye ithunyelwe ku awk
.
gzip -dc path/to/chunk/file.gz |
awk -F 't'
'{print $1",..."$30">"chunked/"$chr"_chr"$15".csv"}'
Isebenzile!
Ukuzalisa ii-cores
Ndifunde ntoni: gnu parallel
- yinto enomlingo, wonke umntu kufuneka ayisebenzise.
Ukwahlukana kwakucotha kakhulu kwaye xa ndiqala htop
ukujonga usebenziso olunamandla (kwaye lubiza) umzekelo weEC2, kwafumaniseka ukuba bendisebenzisa undoqo omnye kwaye malunga ne 200 MB yenkumbulo. Ukuze sicombulule le ngxaki kwaye singaphulukani nemali eninzi, kwafuneka sicinge ngendlela yokulinganisa umsebenzi. Ngethamsanqa, kwincwadi emangalisayo ngokupheleleyo gnu parallel
, indlela ebhetyebhetye kakhulu yokuphumeza ufundo oluninzi kwi Unix.
Xa ndiqala ukwahlula ngokusebenzisa inkqubo entsha, yonke into yayilungile, kodwa kwakusekho i-bottleneck - ukukhuphela izinto ze-S3 kwidiski kwakungekho ngokukhawuleza kwaye kungahambelani ngokupheleleyo. Ukulungisa oku, ndenze oku:
- Ndifumanise ukuba kunokwenzeka ukuphumeza inqanaba lokukhuphela le-S3 ngokuthe ngqo kumbhobho, ukuphelisa ngokupheleleyo ukugcinwa okuphakathi kwidiski. Oku kuthetha ukuba ndinokuphepha ukubhala idatha ekrwada kwidiski kwaye ndisebenzise nokuba incinci, kwaye ke inexabiso eliphantsi, ukugcinwa kwi-AWS.
- iqela
aws configure set default.s3.max_concurrent_requests 50
landa kakhulu inani lemisonto esetyenziswa yi-AWS CLI (ngokungagqibekanga kukho i-10). - Nditshintshele kumzekelo weEC2 olungiselelwe isantya sothungelwano, unobumba u-n egameni. Ndifumene ukuba ukulahlekelwa kwamandla okusebenza xa usebenzisa i-n-imeko ingaphezulu kokuhlawulwa ngokunyuka kwesantya sokulayisha. Kwimisebenzi emininzi ndisebenzisa i-c5n.4xl.
- itshintshiwe
gzip
phezu , esi sisixhobo se-gzip esinokwenza izinto ezipholileyo ukufanisa umsebenzi wokuqala ongangqamanisiyo weefayile zokuthoba (oku kuncede kancinane).pigz
# Let S3 use as many threads as it wants
aws configure set default.s3.max_concurrent_requests 50
for chunk_file in $(aws s3 ls $DATA_LOC | awk '{print $4}' | grep 'chr'$DESIRED_CHR'.csv') ; do
aws s3 cp s3://$batch_loc$chunk_file - |
pigz -dc |
parallel --block 100M --pipe
"awk -F 't' '{print $1",..."$30">"chunked/{#}_chr"$15".csv"}'"
# Combine all the parallel process chunks to single files
ls chunked/ |
cut -d '_' -f 2 |
sort -u |
parallel 'cat chunked/*_{} | sort -k5 -n -S 80% -t, | aws s3 cp - '$s3_dest'/batch_'$batch_num'_{}'
# Clean up intermediate data
rm chunked/*
done
La manyathelo adityaniswa kunye ukuze yonke into isebenze ngokukhawuleza. Ngokunyusa isantya sokukhuphela kunye nokuphelisa ukubhalwa kwediski, ngoku ndingaqhuba ipakethe ye-terabyte eyi-5 kwiiyure nje ezimbalwa.
Akukho nto imnandi njengokubona zonke ii-cores ozihlawulayo kwi-AWS zisetyenziswa. Ndiyabulela kwi-gnu-parallel ndiyakwazi ukuvula unzip kwaye ndihlukanise i-19gig csv ngokukhawuleza njengoko ndinokuyikhuphela. Andikwazanga nokuba yintlantsi yokuqhuba le nto.
#IdathaSayensi #Linux pic.twitter.com/Nqyba2zqEk — uNick Strayer (@NicholasStrayer)
Ngamana 17, 2019
Le tweet bekufanele ukuba ikhankanye 'TSV'. Yeha.
Ukusebenzisa idatha esanda kucazululwa
Ndifunde ntoni: USpark uthanda idatha engaxinzelelwanga kwaye akathandi ukudibanisa izahlulo.
Ngoku idatha yayikwi-S3 kwi-unpacked (funda: ekwabelwana ngayo) kunye nefomathi ehleliweyo, kwaye ndingabuyela kwi-Spark kwakhona. Isimanga sasindilindile: Ndiphinde ndasilela ukufezekisa le nto bendiyifuna! Kwakunzima kakhulu ukuxelela uSpark kanye indlela idatha eyahlulwa ngayo. Kwaye nangona ndenza oku, kwavela ukuba kukho izahlulo ezininzi (i-95 lamawaka), kwaye xa ndisebenzisa coalesce
yanciphisa inani labo ukuya kwimida efanelekileyo, oku kwatshabalalisa ukwahlula kwam. Ndiqinisekile ukuba oku kunokulungiswa, kodwa emva kweentsuku ezimbalwa zokukhangela andizange ndifumane sisombululo. Ekugqibeleni ndagqiba yonke imisebenzi e-Spark, nangona kuthathe ixesha kwaye iifayile zeParquet ezahluliweyo zazingencinci kakhulu (~ 200 KB). Nangona kunjalo, idatha yayilapho yayifuneka khona.
Incinci kakhulu kwaye ayilingani, iyamangalisa!
Ukuvavanya imibuzo yeSpark yasekuhlaleni
Ndifunde ntoni: Intlantsi ineentloko ezininzi kakhulu xa usombulula iingxaki ezilula.
Ngokukhuphela idatha kwifomathi ehlakaniphile, ndakwazi ukuvavanya isantya. Cwangcisa okushicilelweyo kwe-R ukusebenzisa iseva yeSpark yobulali, kwaye emva koko ilayishe isakhelo sedatha ye-Spark kwindawo echaziweyo yokugcina iqela leParquet (umgqomo). Ndizamile ukulayisha yonke idatha kodwa andikwazanga ukufumana u-Sparklyr ukuba aqaphele ukwahlulahlula.
sc <- Spark_connect(master = "local")
desired_snp <- 'rs34771739'
# Start a timer
start_time <- Sys.time()
# Load the desired bin into Spark
intensity_data <- sc %>%
Spark_read_Parquet(
name = 'intensity_data',
path = get_snp_location(desired_snp),
memory = FALSE )
# Subset bin to snp and then collect to local
test_subset <- intensity_data %>%
filter(SNP_Name == desired_snp) %>%
collect()
print(Sys.time() - start_time)
Ukubulawa kuthathe 29,415 imizuzwana. Ingcono kakhulu, kodwa ayilunganga kakhulu kuvavanyo lobuninzi bayo nantoni na. Ukongeza, andizange ndikwazi ukukhawulezisa izinto kunye ne-caching kuba xa ndizama ukugcina isakhelo sedatha kwimemori, i-Spark yayihlala iphahlazeka, nangona ndabela ngaphezu kwe-50 GB yememori kwi-dataset enobunzima obungaphantsi kwe-15.
Buyela kwi-AWK
Ndifunde ntoni: Ii-associative arrays kwi-AWK zisebenza kakuhle kakhulu.
Ndabona ukuba ndingafikelela kwisantya esiphezulu. Ndayikhumbula loo nto ngokumangalisayo
Ukwenza oku, kwiskripthi se-AWK ndisebenzise ibhloko BEGIN
. Eli liqhekeza lekhowudi eyenziwa ngaphambi kokuba umgca wokuqala wedatha udluliselwe kumzimba oyintloko wescript.
join_data.awk
BEGIN {
FS=",";
batch_num=substr(chunk,7,1);
chunk_id=substr(chunk,15,2);
while(getline < "snp_to_bin.csv") {bin[$1] = $2}
}
{
print $0 > "chunked/chr_"chr"_bin_"bin[$1]"_"batch_num"_"chunk_id".csv"
}
Iqela while(getline...)
ilayishwe yonke imiqolo kwiqela le CSV (umgqomo), seta ikholamu yokuqala (igama le-SNP) njengesitshixo soluhlu lonxulumano. bin
kunye nexabiso lesibini (iqela) njengexabiso. Emva koko kwibhloko {
}
, eyenziwa kuyo yonke imigca yefayile engundoqo, ilayini nganye ithunyelwa kwifayile yemveliso, efumana igama elilodwa ngokuxhomekeke kwiqela layo (umgqomo): ..._bin_"bin[$1]"_...
.
Izinto eziguquguqukayo batch_num
и chunk_id
ifanise idata enikelwe ngumbhobho, ukuphepha imeko yomdyarho, kunye nomsonto wophumezo ngamnye osebenzayo parallel
, yabhalela eyayo ifayile ekhethekileyo.
Ekubeni ndisasaze yonke idatha ekrwada kwiifolda kwiichromosomes ezishiyekile kuvavanyo lwam lwangaphambili nge-AWK, ngoku ndingabhala esinye iskripthi se-Bash ukwenza ichromosome enye ngexesha kwaye ndithumele idatha eyahlulwe ngokunzulu kwi-S3.
DESIRED_CHR='13'
# Download chromosome data from s3 and split into bins
aws s3 ls $DATA_LOC |
awk '{print $4}' |
grep 'chr'$DESIRED_CHR'.csv' |
parallel "echo 'reading {}'; aws s3 cp "$DATA_LOC"{} - | awk -v chr=""$DESIRED_CHR"" -v chunk="{}" -f split_on_chr_bin.awk"
# Combine all the parallel process chunks to single files and upload to rds using R
ls chunked/ |
cut -d '_' -f 4 |
sort -u |
parallel "echo 'zipping bin {}'; cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R '$S3_DEST'/chr_'$DESIRED_CHR'_bin_{}.rds"
rm chunked/*
Iscript sinamacandelo amabini parallel
.
Kwicandelo lokuqala, idatha ifundwa kuzo zonke iifayile eziqulethe ulwazi kwichromosome efunwayo, ngoko le datha isasazwa kwimicu, ehambisa iifayile kumaqela afanelekileyo (umgqomo). Ukuze ugweme iimeko zobuhlanga xa imicu emininzi ibhala kwifayile efanayo, i-AWK idlulisela amagama eefayile ukubhala idatha kwiindawo ezahlukeneyo, umz. chr_10_bin_52_batch_2_aa.csv
. Ngenxa yoko, ezininzi iifayile ezincinci zenziwe kwidiski (kule nto ndasebenzisa i-terabyte EBS volumes).
Umthumeli ukusuka kwicandelo lesibini parallel
idlula ngamaqela (umgqomo) kwaye idibanisa iifayile zabo ngabanye kwi CSV eqhelekileyo c cat
kwaye emva koko uzithumele kumazwe angaphandle.
Usasazo kwi-R?
Ndifunde ntoni: Ungaqhagamshelana stdin
и stdout
ukusuka kumbhalo we-R, kwaye ke uyisebenzise kumbhobho.
Usenokuba uqaphele lo mgca kwiskripthi sakho se-Bash: ...cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R...
. Iguqulela zonke iifayile zeqela ezidibeneyo (umgqomo) kwiskripthi se-R esingezantsi. {}
bubuchule obukhethekileyo parallel
, efaka nayiphi na idatha eyithumelayo kumsinga ochaziweyo ngqo kumyalelo ngokwawo. Ukhetho {#}
inikeza i-ID yomsonto eyodwa, kwaye {%}
imele inombolo yesithuba somsebenzi (ngokuphindaphindiweyo, kodwa zange ngaxeshanye). Uluhlu lwazo zonke iinketho zinokufunyanwa kwi
#!/usr/bin/env Rscript
library(readr)
library(aws.s3)
# Read first command line argument
data_destination <- commandArgs(trailingOnly = TRUE)[1]
data_cols <- list(SNP_Name = 'c', ...)
s3saveRDS(
read_csv(
file("stdin"),
col_names = names(data_cols),
col_types = data_cols
),
object = data_destination
)
Xa kuguquguquka file("stdin")
igqithiselwe kwi readr::read_csv
, idatha eguqulelwe kwiskripthi se-R ilayishwe kwisakhelo, esiyifom ke ngoko .rds
-ifayile usebenzisa aws.s3
ibhalwe ngqo kwi-S3.
I-RDS yinto efana nenguqulo encinci yeParquet, ngaphandle kwefrills yokugcina isithethi.
Emva kokugqiba iskripthi seBash ndifumene inyanda .rds
-iifayile ezibekwe kwi-S3, eyandivumela ukuba ndisebenzise ucinezelo olusebenzayo kunye neentlobo ezakhelwe ngaphakathi.
Ngaphandle kokusetyenziswa kwebrake R, yonke into yasebenza ngokukhawuleza. Akumangalisi ukuba, iindawo ze-R ezifunda kwaye zibhale idatha ziphuculwe kakhulu. Emva kokuvavanya kwi-chromosome enye yobukhulu obuphakathi, umsebenzi ugqityiwe kwi-C5n.4xl umzekelo malunga neeyure ezimbini.
S3 Unyino
Ndifunde ntoni: Enkosi ekusebenziseni indlela ehlakaniphile, i-S3 inokusingatha iifayile ezininzi.
Ndandinexhala lokuba ingaba i-S3 yayiza kukwazi ukusingatha iifayile ezininzi ezidluliselwe kuyo. Ndingenza amagama efayile abe nengqiqo, kodwa i-S3 ingabajonga njani?
Iifolda kwi-S3 zezomboniso nje, enyanisweni inkqubo ayinamdla kwisimboli /
.
Kubonakala ngathi i-S3 imele indlela eya kwifayile ethile njengesitshixo esilula kuhlobo lwetafile ye-hash okanye isiseko sedatha esekwe kuxwebhu. Ibhakethi inokucingelwa njengetafile, kwaye iifayile zinokuqwalaselwa njengeerekhodi kuloo tafile.
Kuba isantya kunye nokusebenza kakuhle kubalulekile ukwenza imali eAmazon, ayimangalisi into yokuba le nkqubo ye-key-as-a-file-path system is freaking optimized. Ndazama ukufumana ibhalansi: ukwenzela ukuba andizange ndenze izicelo ezininzi zokufumana, kodwa ukuba izicelo zenziwa ngokukhawuleza. Kwavela ukuba kungcono ukwenza malunga neefayile ze-bin ezingamawaka angama-20. Ndicinga ukuba ukuba siqhubeka nokwandisa, sinokufikelela ukunyuka kwesantya (umzekelo, ukwenza ibhakethi ekhethekileyo nje kwidatha, ngoko ukunciphisa ubungakanani betafile yokujonga). Kodwa kwakungekho xesha okanye imali yovavanyo olungakumbi.
Kuthekani ngokuhambelana komnqamlezo?
Into endiyifundileyo: Esona sizathu sokuchitha ixesha kukwandisa indlela yakho yokugcina ngaphambi kwexesha.
Ngeli xesha, kubaluleke kakhulu ukuba uzibuze: "Kutheni usebenzisa ifomathi yefayile yobunikazi?" Isizathu silele kwisantya sokulayisha (iifayile ze-CSV ze-gzipped zithathe amaxesha angama-7 ubude ukulayisha) kunye nokuhambelana nokuhamba komsebenzi wethu. Ndingaphinda ndiqwalasele ukuba i-R inokulayisha ngokulula iifayile zeParquet (okanye i-Arrow) ngaphandle komthwalo weSpark. Wonke umntu kwilebhu yethu usebenzisa i-R, kwaye ukuba ndifuna ukuguqulela idatha kwenye ifomathi, ndisenayo idatha yombhalo wokuqala, ukuze ndikwazi ukuqhuba umbhobho kwakhona.
Ulwahlulo lomsebenzi
Ndifunde ntoni: Sukuzama ukwandisa imisebenzi ngesandla, vumela ikhompyuter iyenze.
Ndiyilungisile ingxaki yokuhamba komsebenzi kwichromosome enye, ngoku kufuneka ndicubungule yonke enye idatha.
Ndandifuna ukuphakamisa iimeko ezininzi ze-EC2 zokuguqulwa, kodwa kwangaxeshanye ndandisoyika ukufumana umthwalo ongalungelelananga kakhulu kwimisebenzi eyahlukeneyo yokulungisa (kanye njengokuba uSpark wabandezeleka kukwahlulwahlulwa okungalungelelananga). Ukongeza, bendingenamdla wokuphakamisa umzekelo omnye ngechromosome nganye, kuba kwiiakhawunti ze-AWS kukho umda ongagqibekanga wezihlandlo ezili-10.
Emva koko ndaye ndagqiba ekubeni ndibhale iskripthi ku-R ukunyusa imisebenzi yokucubungula.
Okokuqala, ndicele i-S3 ukuba ibale ukuba ingakanani indawo yokugcina ekwichromozomi nganye.
library(aws.s3)
library(tidyverse)
chr_sizes <- get_bucket_df(
bucket = '...', prefix = '...', max = Inf
) %>%
mutate(Size = as.numeric(Size)) %>%
filter(Size != 0) %>%
mutate(
# Extract chromosome from the file name
chr = str_extract(Key, 'chr.{1,4}.csv') %>%
str_remove_all('chr|.csv')
) %>%
group_by(chr) %>%
summarise(total_size = sum(Size)/1e+9) # Divide to get value in GB
# A tibble: 27 x 2
chr total_size
<chr> <dbl>
1 0 163.
2 1 967.
3 10 541.
4 11 611.
5 12 542.
6 13 364.
7 14 375.
8 15 372.
9 16 434.
10 17 443.
# … with 17 more rows
Emva koko ndabhala umsebenzi othatha ubungakanani obupheleleyo, ushuffle ulandelelwano lweechromosomes, uzahlule ngokwamaqela. num_jobs
kwaye ikuxelela ukuba zahluke njani iisayizi zayo yonke imisebenzi yokucubungula.
num_jobs <- 7
# How big would each job be if perfectly split?
job_size <- sum(chr_sizes$total_size)/7
shuffle_job <- function(i){
chr_sizes %>%
sample_frac() %>%
mutate(
cum_size = cumsum(total_size),
job_num = ceiling(cum_size/job_size)
) %>%
group_by(job_num) %>%
summarise(
job_chrs = paste(chr, collapse = ','),
total_job_size = sum(total_size)
) %>%
mutate(sd = sd(total_job_size)) %>%
nest(-sd)
}
shuffle_job(1)
# A tibble: 1 x 2
sd data
<dbl> <list>
1 153. <tibble [7 × 3]>
Emva koko ndabaleka kwiwaka le-shuffles ndisebenzisa i-purrr kwaye ndakhetha eyona ilungileyo.
1:1000 %>%
map_df(shuffle_job) %>%
filter(sd == min(sd)) %>%
pull(data) %>%
pluck(1)
Ngoko ndagqibela ngeqela lemisebenzi eyayifana kakhulu ngobukhulu. Emva koko konke okwakusele kukusonga iskripthi sam sangaphambili seBash kwilophu enkulu for
. Olu lungiselelo luthathe malunga nemizuzu eli-10 ukubhala. Kwaye oku kungaphantsi kakhulu kunokuba bendiza kuchitha ekudaleni imisebenzi ngesandla ukuba bebengalungelelananga. Ke ngoko, ndicinga ukuba bendichanile ngolu lungiselelo lwangaphambili.
for DESIRED_CHR in "16" "9" "7" "21" "MT"
do
# Code for processing a single chromosome
fi
Ekugqibeleni ndongeza umyalelo wokuvala:
sudo shutdown -h now
... kwaye yonke into yasebenza! Ukusebenzisa i-AWS CLI, ndiphakamise iimeko ngokusebenzisa ukhetho user_data
babanike imibhalo yeBash yemisebenzi yabo ukuze iqwalaselwe. Baye babaleka kwaye bavala ngokuzenzekelayo, ngoko ke ndandingahlawuleli amandla okuqhubela phambili.
aws ec2 run-instances ...
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=<<job_name>>}]"
--user-data file://<<job_script_loc>>
Masipakishe!
Ndifunde ntoni: I-API kufuneka ibe lula ngenxa yokukhululeka kunye nokuguquguquka kokusetyenziswa.
Ekugqibeleni ndifumene idatha kwindawo efanelekileyo kunye nefom. Konke okwakusele kukwenza lula inkqubo yokusebenzisa idatha kangangoko ukwenza kube lula kubalingane bam. Bendifuna ukwenza i-API elula yokwenza izicelo. Ukuba kwixesha elizayo ndigqiba ekubeni nditshintshe .rds
kwiifayile zeParquet, ke oku kufanele ukuba yingxaki kum, hayi kubalingane bam. Kule nto ndigqibe ekubeni ndenze iphakheji ye-R yangaphakathi.
Yakha kwaye ubhale umqulu olula kakhulu oqulathe nje imisebenzi embalwa yofikelelo lwedatha ecwangciswe malunga nomsebenzi get_snp
. Kananjalo ndenzele oogxa bam iwebhusayithi
Smart caching
Ndifunde ntoni: Ukuba idatha yakho ilungiswe kakuhle, i-caching iya kuba lula!
Kuba enye yeendlela eziphambili zokuqhutywa komsebenzi isebenzise imodeli yohlalutyo olufanayo kwiphakheji ye-SNP, ndigqibe kwelokuba ndisebenzise i-binning ukuze ndizuze. Xa uhambisa idatha nge-SNP, lonke ulwazi oluvela kwiqela (bin) lufakwe kwinto ebuyisiweyo. Oko kukuthi, imibuzo emidala inoku (kwithiyori) ikhawulezise ukuqwalaselwa kwemibuzo emitsha.
# Part of get_snp()
...
# Test if our current snp data has the desired snp.
already_have_snp <- desired_snp %in% prev_snp_results$snps_in_bin
if(!already_have_snp){
# Grab info on the bin of the desired snp
snp_results <- get_snp_bin(desired_snp)
# Download the snp's bin data
snp_results$bin_data <- aws.s3::s3readRDS(object = snp_results$data_loc)
} else {
# The previous snp data contained the right bin so just use it
snp_results <- prev_snp_results
}
...
Xa ndisakha ipakethe, ndiqhube iibenchmarks ezininzi ukuthelekisa isantya xa usebenzisa iindlela ezahlukeneyo. Ndincoma ukuba ungayihoyi le nto, kuba ngamanye amaxesha iziphumo zingalindelekanga. Umzekelo, dplyr::filter
ibikhawuleza kakhulu kunokuthatha imiqolo usebenzisa isihluzo esisekwe kwisalathiso, kwaye ukubuyisela ikholamu enye kwisakhelo sedata esihluziweyo kwakukhawuleza kakhulu kunokusebenzisa i-syntax yesalathiso.
Nceda uqaphele ukuba into prev_snp_results
iqulethe isitshixo snps_in_bin
. Olu luluhlu lwazo zonke ii-SNP ezizodwa kwiqela (umgqomo), ekuvumela ukuba ukhangele ngokukhawuleza ukuba sele unayo idatha evela kumbuzo wangaphambili. Ikwakwenza kube lula ukulophu kuzo zonke ii-SNP kwiqela (umgqomo) ngale khowudi:
# Get bin-mates
snps_in_bin <- my_snp_results$snps_in_bin
for(current_snp in snps_in_bin){
my_snp_results <- get_snp(current_snp, my_snp_results)
# Do something with results
}
Iziphumo
Ngoku sinako (kwaye sele siqalisile ngokunzulu) ukuqhuba iimodeli kunye neemeko ebezingafikeleleki kuthi ngaphambili. Eyona nto ingcono kukuba oogxa bam baselebhu akufuneki bacinge ngazo naziphi na iingxaki. Banomsebenzi osebenzayo nje.
Kwaye nangona iphakheji ibagcina iinkcukacha, ndizamile ukwenza ifomathi yedatha ilula ngokwaneleyo ukuze bakwazi ukuyifumanisa ukuba ndinyamalale ngequbuliso ngomso...
Isantya sinyuke ngokuphawulekayo. Sidla ngokuskena amaqhekeza abalulekileyo ejenome. Ngaphambili, asikwazanga ukwenza oku (kwaba kubiza kakhulu), kodwa ngoku, ngenxa yesakhiwo seqela (umgqomo) kunye ne-caching, isicelo se-SNP enye sithatha umyinge ongaphantsi kwemizuzwana ye-0,1, kwaye ukusetyenziswa kwedatha kunjalo. eziphantsi ukuba iindleko S3 ngamandongomane.
Kutshanje ndiye ndafaka utshintsho kwi-25+ TB yedatha ye-genotyping eluhlaza kwilebhu yam. Xa ndiqala, ukusebenzisa i-spark kuthathe i-8 min & ibiza i-$ 20 ukubuza i-SNP. Emva kokusebenzisa i-AWK +
#izinto zokuqala ukusetyenzwa, ngoku ithatha ngaphantsi kwe-10 yesibini kwaye ixabisa i-$ 0.00001. Eyam yobuqu#Idatha enkulu phumelela.pic.twitter.com/ANOXVGrmkk — uNick Strayer (@NicholasStrayer)
Ngamana 30, 2019
isiphelo
Eli nqaku ayisosikhokelo kwaphela. Isisombululo siye saba ngumntu, kwaye phantse ngokuqinisekileyo asilunganga. Kunoko, yi-travelogue. Ndifuna abanye baqonde ukuba izigqibo ezinjalo azibonakali ngokupheleleyo entloko, zisisiphumo sovavanyo kunye nephutha. Kwakhona, ukuba ufuna inzululwazi yedatha, khumbula ukuba ukusebenzisa ezi zixhobo ngokufanelekileyo kufuna amava, kwaye amava abiza imali. Ndiyavuya kuba ndibenayo indlela yokuhlawula, kodwa abanye abaninzi abanokwenza umsebenzi ofanayo bhetele kunam abasokuze balifumane ithuba ngenxa yokungabi namali yokuzama nokuzama.
Izixhobo zedatha ezinkulu zinezinto ezininzi. Ukuba unalo ixesha, ungaphantse ngokuqinisekileyo ubhale isisombululo esikhawulezayo usebenzisa ukucoca idatha ehlakaniphile, ukugcinwa, kunye neendlela zokukhupha. Ekugqibeleni kuhla kuhlalutyo lweendleko-inzuzo.
Into endiyifundileyo:
- akukho ndlela iphantsi yokwahlula i-25 TB ngexesha;
- qaphela ubungakanani beefayile zakho zeParquet kunye nombutho wazo;
- Izahlulo kwi-Spark kufuneka zilinganiswe;
- Ngokubanzi, ungaze uzame ukwenza izahlulo ezizizigidi ezi-2,5;
- Ukuhlela kusenzima, njengoko kuseta iSpark;
- ngamanye amaxesha idatha ekhethekileyo ifuna izisombululo ezikhethekileyo;
- Ukuhlanganiswa kwe-Spark kuyakhawuleza, kodwa ukwahlula kusabiza;
- musa ukulala xa bekufundisa izinto ezisisiseko, umntu mhlawumbi sele eyicombulula ingxaki yakho phaya ngeminyaka yoo-1980;
gnu parallel
- oku kuyinto yomlingo, wonke umntu kufuneka ayisebenzise;- U-Spark uthanda idatha engaxinzelelwanga kwaye akathandi ukudibanisa izahlulo;
- I-Spark ine-overhead eninzi kakhulu xa isombulula iingxaki ezilula;
- Ii-AWK's associative arrays zisebenza kakuhle kakhulu;
- ungaqhagamshelana
stdin
иstdout
ukusuka kumbhalo we-R, kwaye ke ngoko uyisebenzise kumbhobho; - Enkosi ekuphunyezweni kwendlela ehlakaniphile, i-S3 inokuqhuba iifayile ezininzi;
- Esona sizathu siphambili sokuchitha ixesha kukuphucula kwangethuba indlela yakho yokugcina;
- ungazami ukwenza imisebenzi ngesandla, yiyeke ikhompyuter iyenze;
- I-API kufuneka ibe lula ngenxa yokukhululeka kunye nokuguquguquka kokusetyenziswa;
- Ukuba idatha yakho ilungiswe kakuhle, i-caching iya kuba lula!
umthombo: www.habr.com