Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R

Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R
Indlela yokufunda eli nqaku: Ndicela uxolo ngombhalo omde kwaye unesiphithiphithi. Ukukongela ixesha, ndiqala isahluko ngasinye ngentshayelelo ethi “Oko Ndikufundileyo,” eshwankathela umongo wesahluko kwisivakalisi esinye okanye ezibini.

“Ndibonise isisombululo!” Ukuba ufuna nje ukubona apho ndivela khona, emva koko utsibe kwisahluko esithi "Ukuba ne-Inventive More," kodwa ndicinga ukuba kunomdla kwaye kuluncedo ukufunda malunga nokusilela.

Kutshanje ndinikwe umsebenzi wokuseta inkqubo yokucwangcisa umthamo omkhulu wokulandelelana kwe-DNA eluhlaza (ngokobuchwepheshe i-chip ye-SNP). Isidingo yayikukufumana ngokukhawuleza idatha malunga nendawo enikiweyo yemfuza (ebizwa ngokuba yi-SNP) yomzekelo olandelayo kunye neminye imisebenzi. Ukusebenzisa i-R kunye ne-AWK, ndakwazi ukucoca kunye nokulungelelanisa idatha ngendlela engokwemvelo, ngokukhawuleza ukukhawuleza ukuphendulwa kwemibuzo. Oku bekungelula kum kwaye kufuna ukuphindaphindwa kaninzi. Eli nqaku liza kukunceda uphephe ezinye zeempazamo zam kwaye ndikubonise oko ndigqibe ngako.

Okokuqala, ezinye iinkcazo zentshayelelo.

Iinkcukacha

Iziko lethu leyunivesithi elilungisa ulwazi lwemfuzo lisinike iinkcukacha ezikwimo ye-25 TB TSV. Ndawafumana ahlulahlulwe kwiipakethi ezi-5, zixinzelelwe yi-Gzip, nganye kuzo iqulethe malunga neefayile ze-240 ezine-gigabyte. Umqolo ngamnye uqulethe idatha ye-SNP enye ukusuka kumntu omnye. Lilonke, idatha kwi ~ 2,5 yezigidi ze-SNP kunye ne- ~ 60 amawaka abantu bahanjiswa. Ukongeza kwingcaciso ye-SNP, iifayile zineekholamu ezininzi ezinamanani abonisa iimpawu ezahlukeneyo, ezinjengokuqina kokufunda, ukuphindaphinda kwee-alleles ezahlukeneyo, njl. Lilonke bekukho iikholamu ezingama-30 ezinamaxabiso awodwa.

Injongo

Njengayo nayiphi na iprojekthi yolawulo lwedatha, eyona nto ibalulekileyo yayikukumisela ukuba idatha iya kusetyenziswa njani na. Kule meko siya kukhetha ubukhulu becala iimodeli kunye nokuhamba komsebenzi kwe-SNP esekwe kwi-SNP. Oko kukuthi, siya kufuna kuphela idatha kwi-SNP enye ngexesha. Kwafuneka ndifunde indlela yokubuyisela zonke iirekhodi ezinxulumene nenye ye-2,5 yezigidi ze-SNP ngokulula, ngokukhawuleza, nangexabiso eliphantsi kangangoko.

Akwenziwa njani oku

Ukucaphula i-cliché efanelekileyo:

Khange ndisilele iwaka lamaxesha, ndifumene iindlela eziliwaka zokunqanda ukwahlula iqela ledatha kwifomati yombuzo.

Zama kuqala

Ndifunde ntoni: Akukho ndlela iphantsi yokwahlula i-25 TB ngexesha.

Emva kokuba ndithathe ikhosi "IiNdlela eziPhezulu zokuPhathwa kweDatha enkulu" kwiYunivesithi yaseVanderbilt, ndandiqinisekile ukuba iqhinga lalisesingxobeni. Kuya kuthatha iyure okanye ezimbini ukuseta iseva yeHive ukuba iqhube kuyo yonke idatha kwaye ichaze isiphumo. Ekubeni idatha yethu igcinwe kwi-AWS S3, ndasebenzisa inkonzo Athena, ekuvumela ukuba usebenzise imibuzo yeHive SQL kwidatha ye-S3. Awudingi ukuseta / ukuphakamisa iqela le-Hive, kwaye uhlawula kuphela idatha oyifunayo.

Emva kokuba ndibonise u-Athena idatha yam kunye nefomathi yayo, ndiye ndaqhuba ezinye iimvavanyo ngemibuzo enje:

select * from intensityData limit 10;

Kwaye ngokukhawuleza wafumana iziphumo ezakhiwe kakuhle. Ulungile.

Sade sazama ukusebenzisa idatha emsebenzini wethu...

Ndacelwa ukuba ndikhuphe lonke ulwazi lwe-SNP ukuvavanya imodeli. Ndiphendule umbuzo:


select * from intensityData 
where snp = 'rs123456';

... kwaye ndaqala ukulinda. Emva kwemizuzu esibhozo kunye ne-4 TB yedatha eceliwe, ndifumene umphumo. I-Athena ihlawulisa ngomthamo wedatha efunyenweyo, i-$ 5 ngeterabyte nganye. Ngoko esi sicelo sinye sixabisa i-$20 kunye nemizuzu esibhozo yokulinda. Ukuqhuba imodeli kuyo yonke idatha, kwafuneka silinde iminyaka eyi-38 kwaye sihlawule i-$ 50 yezigidi. Ngokucacileyo, oku kwakungafanelekanga kuthi.

Kwakuyimfuneko ukusebenzisa iParquet ...

Ndifunde ntoni: Qaphela ubungakanani beefayile zakho zeParquet kunye nombutho wazo.

Ndazama kuqala ukulungisa imeko ngokuguqula zonke ii-TSVs Iifayile zeParquet. Zikulungele ukusebenza kunye neeseti ezinkulu zedatha kuba ulwazi kuzo lugcinwe kwifom yekholomu: ikholomu nganye ilele kwimemori yayo / idiski yecandelo, ngokungafaniyo neefayile ezibhaliweyo, apho imiqolo iqulethe izinto zekholamu nganye. Kwaye ukuba ufuna ukufumana into, funda nje ikholamu efunekayo. Ukongeza, ifayile nganye igcina uluhlu lwamaxabiso kwikholamu, ke ukuba ixabiso olikhangelayo alikho kuluhlu lwekholamu, iSpark ayizukuchitha xesha iskena yonke ifayile.

Ndenze umsebenzi olula Iglu yeAWS ukuguqula ii-TSV zethu kwiParquet kwaye ulahle iifayile ezintsha kwi-Athena. Kuthathe malunga neeyure ezi-5. Kodwa xa ndandisenza isicelo, kwathabatha ixesha elilinganayo kunye nemali encinane ukugqiba. Inyani yeyokuba i-Spark, izama ukukhulisa umsebenzi, yakhupha nje ichunk enye ye-TSV kwaye yayibeka kweyayo iParquet chunk. Kwaye ngenxa yokuba i-chunk nganye yayinkulu ngokwaneleyo ukuba ibambe zonke iirekhodi zabantu abaninzi, ifayile nganye iqulethe zonke ii-SNP, ngoko ke i-Spark kwafuneka ivule zonke iifayile ukuze ikhuphe ulwazi olufunekayo.

Okubangela umdla kukuba, ukungagqibeki kweParquet (kwaye kuyacetyiswa) uhlobo loxinzelelo, i-snappy, ayicalulwa. Ngoko ke, umabi-lifa ngamnye wayebambelele kumsebenzi wokukhulula nokukhuphela idatha epheleleyo ye-3,5 GB.

Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R

Masiyiqonde ingxaki

Ndifunde ntoni: Ukuhlela kunzima, ngakumbi ukuba idatha isasazwa.

Kwakubonakala ngathi ngoku ndiyawuqonda umongo wale ngxaki. Ndandifuna kuphela ukuhlenga idatha ngekholomu ye-SNP, kungekhona ngabantu. Emva koko ii-SNP ezininzi ziya kugcinwa kwi-chunk yedatha eyahlukileyo, kwaye emva koko umsebenzi weParquet "smart" "uvule kuphela ukuba ixabiso likuluhlu" liya kuzibonakalisa kulo lonke uzuko lwayo. Ngelishwa, ukuhlenga iibhiliyoni zemiqolo ethe saa kwiqela ngalinye kwabonakala kungumsebenzi onzima.

Ngokuqinisekileyo i-AWS ayifuni kukhupha imali ngenxa yesizathu esithi "Ndingumfundi ophazamisekileyo". Emva kokuba ndibaleke ndihlela kwiAmazon Glue, yabaleka iintsuku ezi-2 kwaye yantlitheka.

Kuthekani ngokwahlulahlula?

Ndifunde ntoni: Izahlulo kwi-Spark kufuneka zilinganiswe.

Emva koko ndeza nombono wokwahlulahlula idatha kwiichromosomes. Kukho i-23 kubo (kunye nezinye ezininzi ukuba uthathela ingqalelo i-DNA ye-mitochondrial kunye nemimandla engabonakaliyo).
Oku kuya kukuvumela ukuba wahlule idatha ibe ngamaqhekeza amancinci. Ukuba wongeza nje umgca omnye kumsebenzi wokuthumela ngaphandle kweSpark kwiskripthi seGlue partition_by = "chr", ngoko idatha kufuneka ihlulwe kwiibhakethi.

Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R
Igenome inamaqhekeza amaninzi abizwa ngokuba ziichromosomes.

Ngelishwa, ayizange isebenze. Iichromosomes zinobukhulu obahlukeneyo, nto leyo ethetha izixa ezahlukeneyo zolwazi. Oku kuthetha ukuba imisebenzi eyathunyelwa nguSpark kubasebenzi ayizange ilungelelaniswe kwaye igqitywe ngokucothayo ngenxa yokuba ezinye iindawo zokuhlala zagqitywa kwangethuba kwaye zazingenzi nto. Noko ke, imisebenzi yagqitywa. Kodwa xa ucela i-SNP enye, ukungalingani kwakhona kubangele iingxaki. Iindleko zokucubungula ii-SNP kwiichromosomes ezinkulu (oko kukuthi, apho sifuna ukufumana idatha) ziye zehla kuphela malunga ne-10. Kaninzi, kodwa akwanelanga.

Kuthekani ukuba siyahlulahlula ibe zizinto ezincinci?

Ndifunde ntoni: Ungaze uzame ukwenza izahlulelo ezizigidi ezi-2,5 kwaphela.

Ndagqiba ekubeni ndiphume yonke kwaye ndohlule i-SNP nganye. Oku kwaqinisekisa ukuba izahlulo zazilingana ngobukhulu. YAYINGUMBONO OMBI. Ndasebenzisa iGlue kwaye ndongeza umgca ongenatyala partition_by = 'snp'. Umsebenzi waqala kwaye waqala ukwenziwa. Emva kwemini ndiye ndajonga ndabona ukuba akukabikho nto ibhaliweyo kuS3, ndawubulala lo msebenzi. Kubonakala ngathi iGlue yayibhala iifayile eziphakathi kwindawo efihliweyo kwi-S3, iifayile ezininzi, mhlawumbi isibini sezigidi. Ngenxa yoko, impazamo yam yabiza ngaphezulu kwewaka leedola kwaye ayizange imkholise umcebisi wam.

Ukwahlula + ukuhlela

Ndifunde ntoni: Ukuhlela kusenzima, njengoko kunjalo ukulungisa iSpark.

Ilinge lam lokugqibela lokwahlulahlula lindibandakanye ekwahluleni iichromosome ndize ndihlele isahlulelo ngasinye. Ngokwethiyori, oku kuya kukhawulezisa umbuzo ngamnye kuba idatha efunwayo ye-SNP kwakufuneka ibe ngaphakathi kweechunks ezimbalwa zeParquet ngaphakathi koluhlu olunikiweyo. Ngelishwa, ukuhlelwa kwedatha eyohluliweyo kuye kwaba ngumsebenzi onzima. Ngenxa yoko, ndatshintshela kwi-EMR kwi-cluster yesiko kwaye ndasebenzisa iimeko ezisibhozo ezinamandla (C5.4xl) kunye ne-Sparklyr ukudala ukuhamba komsebenzi okuguquguqukayo ...

# Sparklyr snippet to partition by chr and sort w/in partition
# Join the raw data with the snp bins
raw_data
  group_by(chr) %>%
  arrange(Position) %>% 
  Spark_write_Parquet(
    path = DUMP_LOC,
    mode = 'overwrite',
    partition_by = c('chr')
  )

...nangona kunjalo umsebenzi wawungekagqitywa. Ndiyiqwalasele ngeendlela ezahlukeneyo: ukwandisa ulwabiwo lwememori kumenzi wombuzo ngamnye, iindawo ezisetyenzisiweyo ezinomthamo omkhulu wememori, iinguqu ezisetyenzisiweyo zokusasaza (iinguqu zokusasaza), kodwa ixesha ngalinye ezi zijike zaba zisiqingatha, kwaye ngokuthe ngcembe abenzi bokufa baqala ukwenza. kusilela de kuphele yonke into.

Ndiya ndisiba nobuchule ngakumbi

Ndifunde ntoni: Ngamanye amaxesha idatha ekhethekileyo ifuna izisombululo ezikhethekileyo.

I-SNP nganye inexabiso lendawo. Eli linani elihambelana nenani leziseko ezisecaleni kwechromosome yayo. Le yindlela entle nendalo yokucwangcisa idatha yethu. Ekuqaleni ndandifuna ukwahlula ngokwemimandla yechromosome nganye. Ngokomzekelo, izikhundla 1 - 2000, 2001 - 4000, njl. Kodwa ingxaki kukuba ii-SNP azisasazwanga ngokulinganayo kuzo zonke iichromosomes, ngoko ke ubukhulu beqela buya kwahluka kakhulu.

Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R

Ngenxa yoko, ndifike ekuqhekekeni kwezikhundla ngokweendidi (inqanaba). Ndisebenzisa idatha esele ikhutshiwe, ndiqhube isicelo sokufumana uluhlu lwee-SNP ezizodwa, izikhundla zabo kunye neechromosomes. Emva koko ndahlunga idatha ngaphakathi kwechromosome nganye kwaye ndaqokelela ii-SNP kumaqela (umgqomo) wobungakanani obunikeziweyo. Masithi i-1000 ye-SNP nganye. Oku kwandinika ubudlelwane be-SNP-kwiqela-ngekromozomu.

Ekugqibeleni, ndenze amaqela (bin) e-75 SNPs, isizathu siya kuchazwa ngezantsi.

snp_to_bin <- unique_snps %>% 
  group_by(chr) %>% 
  arrange(position) %>% 
  mutate(
    rank = 1:n()
    bin = floor(rank/snps_per_bin)
  ) %>% 
  ungroup()

Qala uzame ngeSpark

Ndifunde ntoni: Ukudityaniswa kwe-Spark kuyakhawuleza, kodwa ukwahlula kusabiza.

Bendifuna ukufunda le ncinci (i-2,5 yezigidi zeerowu) isakhelo sedatha kwi-Spark, siyidibanise nedatha ekrwada, emva koko ndiyahlule ngekholamu entsha eyongeziweyo. bin.


# Join the raw data with the snp bins
data_w_bin <- raw_data %>%
  left_join(sdf_broadcast(snp_to_bin), by ='snp_name') %>%
  group_by(chr_bin) %>%
  arrange(Position) %>% 
  Spark_write_Parquet(
    path = DUMP_LOC,
    mode = 'overwrite',
    partition_by = c('chr_bin')
  )

ndidla ngoku sdf_broadcast(), ngoko ke uSpark uyazi ukuba kufuneka athumele isakhelo sedatha kuzo zonke iindawo. Oku kuluncedo ukuba idatha incinci ngobukhulu kwaye ifuneka kuyo yonke imisebenzi. Ngaphandle koko, i-Spark izama ukuba krelekrele kwaye isasaze idatha njengoko ifuneka, nto leyo enokubangela ukucotha.

Kwaye kwakhona, ingcamango yam ayizange isebenze: imisebenzi yasebenza ixesha elithile, yagqiba umanyano, kwaye ke, njengabagwebi abaqaliswe ngokuhlukana, baqala ukusilela.

Ukongeza i-AWK

Ndifunde ntoni: Ungalali xa ufundiswa iziseko. Ngokuqinisekileyo umntu sele eyisombulule ingxaki yakho phaya ngeminyaka yoo-1980.

Ukuza kuthi ga kweli nqanaba, isizathu sakho konke ukungaphumeleli kwam kunye ne-Spark yayiyi-jumble yedatha kwiqela. Mhlawumbi imeko inokuphuculwa ngonyango lwangaphambili. Ndaye ndagqiba ekubeni ndizame ukwahlula idatha yokubhaliweyo ekrwada kwiikholamu zeechromosome ngethemba lokubonelela nge-Spark ngedatha "eyahlulwe kwangaphambili".

Ndikhangele kwiStackOverflow indlela yokwahlulahlula ngamaxabiso ekholamu kwaye ndifunyenwe impendulo entle kangaka. Nge-AWK unokwahlula ifayile yokubhaliweyo ngamaxabiso ekholamu ngokuyibhala kwiskripthi kunokuthumela iziphumo kuyo. stdout.

Ndibhale iskripthi seBash ukuze ndizame. Khuphela enye yee-TSV ezipakishiweyo, emva koko uyikhuphe usebenzisa gzip kwaye ithunyelwe ku awk.

gzip -dc path/to/chunk/file.gz |
awk -F 't' 
'{print $1",..."$30">"chunked/"$chr"_chr"$15".csv"}'

Isebenzile!

Ukuzalisa ii-cores

Ndifunde ntoni: gnu parallel - yinto enomlingo, wonke umntu kufuneka ayisebenzise.

Ukwahlukana kwakucotha kakhulu kwaye xa ndiqala htopukujonga usebenziso olunamandla (kwaye lubiza) umzekelo weEC2, kwafumaniseka ukuba bendisebenzisa undoqo omnye kwaye malunga ne 200 MB yenkumbulo. Ukuze sicombulule le ngxaki kwaye singaphulukani nemali eninzi, kwafuneka sicinge ngendlela yokulinganisa umsebenzi. Ngethamsanqa, kwincwadi emangalisayo ngokupheleleyo INzululwazi yeDatha kumgca woMyalelo Ndifumene isahluko sikaJeron Janssens malunga nokuhambelana. Kuyo ndafunda ngayo gnu parallel, indlela ebhetyebhetye kakhulu yokuphumeza ufundo oluninzi kwi Unix.

Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R
Xa ndiqala ukwahlula ngokusebenzisa inkqubo entsha, yonke into yayilungile, kodwa kwakusekho i-bottleneck - ukukhuphela izinto ze-S3 kwidiski kwakungekho ngokukhawuleza kwaye kungahambelani ngokupheleleyo. Ukulungisa oku, ndenze oku:

  1. Ndifumanise ukuba kunokwenzeka ukuphumeza inqanaba lokukhuphela le-S3 ngokuthe ngqo kumbhobho, ukuphelisa ngokupheleleyo ukugcinwa okuphakathi kwidiski. Oku kuthetha ukuba ndinokuphepha ukubhala idatha ekrwada kwidiski kwaye ndisebenzise nokuba incinci, kwaye ke inexabiso eliphantsi, ukugcinwa kwi-AWS.
  2. iqela aws configure set default.s3.max_concurrent_requests 50 landa kakhulu inani lemisonto esetyenziswa yi-AWS CLI (ngokungagqibekanga kukho i-10).
  3. Nditshintshele kumzekelo weEC2 olungiselelwe isantya sothungelwano, unobumba u-n egameni. Ndifumene ukuba ukulahlekelwa kwamandla okusebenza xa usebenzisa i-n-imeko ingaphezulu kokuhlawulwa ngokunyuka kwesantya sokulayisha. Kwimisebenzi emininzi ndisebenzisa i-c5n.4xl.
  4. itshintshiwe gzip phezu pigz, esi sisixhobo se-gzip esinokwenza izinto ezipholileyo ukufanisa umsebenzi wokuqala ongangqamanisiyo weefayile zokuthoba (oku kuncede kancinane).

# Let S3 use as many threads as it wants
aws configure set default.s3.max_concurrent_requests 50

for chunk_file in $(aws s3 ls $DATA_LOC | awk '{print $4}' | grep 'chr'$DESIRED_CHR'.csv') ; do

        aws s3 cp s3://$batch_loc$chunk_file - |
        pigz -dc |
        parallel --block 100M --pipe  
        "awk -F 't' '{print $1",..."$30">"chunked/{#}_chr"$15".csv"}'"

       # Combine all the parallel process chunks to single files
        ls chunked/ |
        cut -d '_' -f 2 |
        sort -u |
        parallel 'cat chunked/*_{} | sort -k5 -n -S 80% -t, | aws s3 cp - '$s3_dest'/batch_'$batch_num'_{}'
        
         # Clean up intermediate data
       rm chunked/*
done

La manyathelo adityaniswa kunye ukuze yonke into isebenze ngokukhawuleza. Ngokunyusa isantya sokukhuphela kunye nokuphelisa ukubhalwa kwediski, ngoku ndingaqhuba ipakethe ye-terabyte eyi-5 kwiiyure nje ezimbalwa.

Le tweet bekufanele ukuba ikhankanye 'TSV'. Yeha.

Ukusebenzisa idatha esanda kucazululwa

Ndifunde ntoni: USpark uthanda idatha engaxinzelelwanga kwaye akathandi ukudibanisa izahlulo.

Ngoku idatha yayikwi-S3 kwi-unpacked (funda: ekwabelwana ngayo) kunye nefomathi ehleliweyo, kwaye ndingabuyela kwi-Spark kwakhona. Isimanga sasindilindile: Ndiphinde ndasilela ukufezekisa le nto bendiyifuna! Kwakunzima kakhulu ukuxelela uSpark kanye indlela idatha eyahlulwa ngayo. Kwaye nangona ndenza oku, kwavela ukuba kukho izahlulo ezininzi (i-95 lamawaka), kwaye xa ndisebenzisa coalesce yanciphisa inani labo ukuya kwimida efanelekileyo, oku kwatshabalalisa ukwahlula kwam. Ndiqinisekile ukuba oku kunokulungiswa, kodwa emva kweentsuku ezimbalwa zokukhangela andizange ndifumane sisombululo. Ekugqibeleni ndagqiba yonke imisebenzi e-Spark, nangona kuthathe ixesha kwaye iifayile zeParquet ezahluliweyo zazingencinci kakhulu (~ 200 KB). Nangona kunjalo, idatha yayilapho yayifuneka khona.

Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R
Incinci kakhulu kwaye ayilingani, iyamangalisa!

Ukuvavanya imibuzo yeSpark yasekuhlaleni

Ndifunde ntoni: Intlantsi ineentloko ezininzi kakhulu xa usombulula iingxaki ezilula.

Ngokukhuphela idatha kwifomathi ehlakaniphile, ndakwazi ukuvavanya isantya. Cwangcisa okushicilelweyo kwe-R ukusebenzisa iseva yeSpark yobulali, kwaye emva koko ilayishe isakhelo sedatha ye-Spark kwindawo echaziweyo yokugcina iqela leParquet (umgqomo). Ndizamile ukulayisha yonke idatha kodwa andikwazanga ukufumana u-Sparklyr ukuba aqaphele ukwahlulahlula.

sc <- Spark_connect(master = "local")

desired_snp <- 'rs34771739'

# Start a timer
start_time <- Sys.time()

# Load the desired bin into Spark
intensity_data <- sc %>% 
  Spark_read_Parquet(
    name = 'intensity_data', 
    path = get_snp_location(desired_snp),
    memory = FALSE )

# Subset bin to snp and then collect to local
test_subset <- intensity_data %>% 
  filter(SNP_Name == desired_snp) %>% 
  collect()

print(Sys.time() - start_time)

Ukubulawa kuthathe 29,415 imizuzwana. Ingcono kakhulu, kodwa ayilunganga kakhulu kuvavanyo lobuninzi bayo nantoni na. Ukongeza, andizange ndikwazi ukukhawulezisa izinto kunye ne-caching kuba xa ndizama ukugcina isakhelo sedatha kwimemori, i-Spark yayihlala iphahlazeka, nangona ndabela ngaphezu kwe-50 GB yememori kwi-dataset enobunzima obungaphantsi kwe-15.

Buyela kwi-AWK

Ndifunde ntoni: Ii-associative arrays kwi-AWK zisebenza kakuhle kakhulu.

Ndabona ukuba ndingafikelela kwisantya esiphezulu. Ndayikhumbula loo nto ngokumangalisayo Isifundo se-AWK nguBruce Barnett Ndifunde ngenqaku elipholileyo elithi "uluhlu oludibeneyo" Ngokusisiseko, ezi zimbini zexabiso eliphambili, ezo ngenxa yesizathu esithile zibizwa ngokwahlukileyo kwi-AWK, kwaye ngoko ke andizange ndicinge kakhulu ngabo. Roman Cheplyaka ukhumbule ukuba igama elithi "i-associative arrays" lidala kakhulu kunegama elithi "key-value pair". Nokuba wena jonga ixabiso eliphambili kwiNgram kaGoogle, awuyi kulibona eli gama apho, kodwa uya kufumana uluhlu oludibeneyo! Ukongeza, "i-key-value pair" idla ngokunxulunyaniswa nogcino-lwazi, ngoko kuyavakala ngakumbi ukuyithelekisa ne-hashmap. Ndabona ukuba ndingasebenzisa ezi zixhobo zokudibanisa ukudibanisa ii-SNP zam kunye netafile yomgqomo kunye nedatha eluhlaza ngaphandle kokusebenzisa i-Spark.

Ukwenza oku, kwiskripthi se-AWK ndisebenzise ibhloko BEGIN. Eli liqhekeza lekhowudi eyenziwa ngaphambi kokuba umgca wokuqala wedatha udluliselwe kumzimba oyintloko wescript.

join_data.awk
BEGIN {
  FS=",";
  batch_num=substr(chunk,7,1);
  chunk_id=substr(chunk,15,2);
  while(getline < "snp_to_bin.csv") {bin[$1] = $2}
}
{
  print $0 > "chunked/chr_"chr"_bin_"bin[$1]"_"batch_num"_"chunk_id".csv"
}

Iqela while(getline...) ilayishwe yonke imiqolo kwiqela le CSV (umgqomo), seta ikholamu yokuqala (igama le-SNP) njengesitshixo soluhlu lonxulumano. bin kunye nexabiso lesibini (iqela) njengexabiso. Emva koko kwibhloko { }, eyenziwa kuyo yonke imigca yefayile engundoqo, ilayini nganye ithunyelwa kwifayile yemveliso, efumana igama elilodwa ngokuxhomekeke kwiqela layo (umgqomo): ..._bin_"bin[$1]"_....

Izinto eziguquguqukayo batch_num и chunk_id ifanise idata enikelwe ngumbhobho, ukuphepha imeko yomdyarho, kunye nomsonto wophumezo ngamnye osebenzayo parallel, yabhalela eyayo ifayile ekhethekileyo.

Ekubeni ndisasaze yonke idatha ekrwada kwiifolda kwiichromosomes ezishiyekile kuvavanyo lwam lwangaphambili nge-AWK, ngoku ndingabhala esinye iskripthi se-Bash ukwenza ichromosome enye ngexesha kwaye ndithumele idatha eyahlulwe ngokunzulu kwi-S3.

DESIRED_CHR='13'

# Download chromosome data from s3 and split into bins
aws s3 ls $DATA_LOC |
awk '{print $4}' |
grep 'chr'$DESIRED_CHR'.csv' |
parallel "echo 'reading {}'; aws s3 cp "$DATA_LOC"{} - | awk -v chr=""$DESIRED_CHR"" -v chunk="{}" -f split_on_chr_bin.awk"

# Combine all the parallel process chunks to single files and upload to rds using R
ls chunked/ |
cut -d '_' -f 4 |
sort -u |
parallel "echo 'zipping bin {}'; cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R '$S3_DEST'/chr_'$DESIRED_CHR'_bin_{}.rds"
rm chunked/*

Iscript sinamacandelo amabini parallel.

Kwicandelo lokuqala, idatha ifundwa kuzo zonke iifayile eziqulethe ulwazi kwichromosome efunwayo, ngoko le datha isasazwa kwimicu, ehambisa iifayile kumaqela afanelekileyo (umgqomo). Ukuze ugweme iimeko zobuhlanga xa imicu emininzi ibhala kwifayile efanayo, i-AWK idlulisela amagama eefayile ukubhala idatha kwiindawo ezahlukeneyo, umz. chr_10_bin_52_batch_2_aa.csv. Ngenxa yoko, ezininzi iifayile ezincinci zenziwe kwidiski (kule nto ndasebenzisa i-terabyte EBS volumes).

Umthumeli ukusuka kwicandelo lesibini parallel idlula ngamaqela (umgqomo) kwaye idibanisa iifayile zabo ngabanye kwi CSV eqhelekileyo c catkwaye emva koko uzithumele kumazwe angaphandle.

Usasazo kwi-R?

Ndifunde ntoni: Ungaqhagamshelana stdin и stdout ukusuka kumbhalo we-R, kwaye ke uyisebenzise kumbhobho.

Usenokuba uqaphele lo mgca kwiskripthi sakho se-Bash: ...cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R.... Iguqulela zonke iifayile zeqela ezidibeneyo (umgqomo) kwiskripthi se-R esingezantsi. {} bubuchule obukhethekileyo parallel, efaka nayiphi na idatha eyithumelayo kumsinga ochaziweyo ngqo kumyalelo ngokwawo. Ukhetho {#} inikeza i-ID yomsonto eyodwa, kwaye {%} imele inombolo yesithuba somsebenzi (ngokuphindaphindiweyo, kodwa zange ngaxeshanye). Uluhlu lwazo zonke iinketho zinokufunyanwa kwi uxwebhu.

#!/usr/bin/env Rscript
library(readr)
library(aws.s3)

# Read first command line argument
data_destination <- commandArgs(trailingOnly = TRUE)[1]

data_cols <- list(SNP_Name = 'c', ...)

s3saveRDS(
  read_csv(
        file("stdin"), 
        col_names = names(data_cols),
        col_types = data_cols 
    ),
  object = data_destination
)

Xa kuguquguquka file("stdin") igqithiselwe kwi readr::read_csv, idatha eguqulelwe kwiskripthi se-R ilayishwe kwisakhelo, esiyifom ke ngoko .rds-ifayile usebenzisa aws.s3 ibhalwe ngqo kwi-S3.

I-RDS yinto efana nenguqulo encinci yeParquet, ngaphandle kwefrills yokugcina isithethi.

Emva kokugqiba iskripthi seBash ndifumene inyanda .rds-iifayile ezibekwe kwi-S3, eyandivumela ukuba ndisebenzise ucinezelo olusebenzayo kunye neentlobo ezakhelwe ngaphakathi.

Ngaphandle kokusetyenziswa kwebrake R, yonke into yasebenza ngokukhawuleza. Akumangalisi ukuba, iindawo ze-R ezifunda kwaye zibhale idatha ziphuculwe kakhulu. Emva kokuvavanya kwi-chromosome enye yobukhulu obuphakathi, umsebenzi ugqityiwe kwi-C5n.4xl umzekelo malunga neeyure ezimbini.

S3 Unyino

Ndifunde ntoni: Enkosi ekusebenziseni indlela ehlakaniphile, i-S3 inokusingatha iifayile ezininzi.

Ndandinexhala lokuba ingaba i-S3 yayiza kukwazi ukusingatha iifayile ezininzi ezidluliselwe kuyo. Ndingenza amagama efayile abe nengqiqo, kodwa i-S3 ingabajonga njani?

Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R
Iifolda kwi-S3 zezomboniso nje, enyanisweni inkqubo ayinamdla kwisimboli /. Ukusuka kwiphepha le-S3 FAQ.

Kubonakala ngathi i-S3 imele indlela eya kwifayile ethile njengesitshixo esilula kuhlobo lwetafile ye-hash okanye isiseko sedatha esekwe kuxwebhu. Ibhakethi inokucingelwa njengetafile, kwaye iifayile zinokuqwalaselwa njengeerekhodi kuloo tafile.

Kuba isantya kunye nokusebenza kakuhle kubalulekile ukwenza imali eAmazon, ayimangalisi into yokuba le nkqubo ye-key-as-a-file-path system is freaking optimized. Ndazama ukufumana ibhalansi: ukwenzela ukuba andizange ndenze izicelo ezininzi zokufumana, kodwa ukuba izicelo zenziwa ngokukhawuleza. Kwavela ukuba kungcono ukwenza malunga neefayile ze-bin ezingamawaka angama-20. Ndicinga ukuba ukuba siqhubeka nokwandisa, sinokufikelela ukunyuka kwesantya (umzekelo, ukwenza ibhakethi ekhethekileyo nje kwidatha, ngoko ukunciphisa ubungakanani betafile yokujonga). Kodwa kwakungekho xesha okanye imali yovavanyo olungakumbi.

Kuthekani ngokuhambelana komnqamlezo?

Into endiyifundileyo: Esona sizathu sokuchitha ixesha kukwandisa indlela yakho yokugcina ngaphambi kwexesha.

Ngeli xesha, kubaluleke kakhulu ukuba uzibuze: "Kutheni usebenzisa ifomathi yefayile yobunikazi?" Isizathu silele kwisantya sokulayisha (iifayile ze-CSV ze-gzipped zithathe amaxesha angama-7 ubude ukulayisha) kunye nokuhambelana nokuhamba komsebenzi wethu. Ndingaphinda ndiqwalasele ukuba i-R inokulayisha ngokulula iifayile zeParquet (okanye i-Arrow) ngaphandle komthwalo weSpark. Wonke umntu kwilebhu yethu usebenzisa i-R, kwaye ukuba ndifuna ukuguqulela idatha kwenye ifomathi, ndisenayo idatha yombhalo wokuqala, ukuze ndikwazi ukuqhuba umbhobho kwakhona.

Ulwahlulo lomsebenzi

Ndifunde ntoni: Sukuzama ukwandisa imisebenzi ngesandla, vumela ikhompyuter iyenze.

Ndiyilungisile ingxaki yokuhamba komsebenzi kwichromosome enye, ngoku kufuneka ndicubungule yonke enye idatha.
Ndandifuna ukuphakamisa iimeko ezininzi ze-EC2 zokuguqulwa, kodwa kwangaxeshanye ndandisoyika ukufumana umthwalo ongalungelelananga kakhulu kwimisebenzi eyahlukeneyo yokulungisa (kanye njengokuba uSpark wabandezeleka kukwahlulwahlulwa okungalungelelananga). Ukongeza, bendingenamdla wokuphakamisa umzekelo omnye ngechromosome nganye, kuba kwiiakhawunti ze-AWS kukho umda ongagqibekanga wezihlandlo ezili-10.

Emva koko ndaye ndagqiba ekubeni ndibhale iskripthi ku-R ukunyusa imisebenzi yokucubungula.

Okokuqala, ndicele i-S3 ukuba ibale ukuba ingakanani indawo yokugcina ekwichromozomi nganye.

library(aws.s3)
library(tidyverse)

chr_sizes <- get_bucket_df(
  bucket = '...', prefix = '...', max = Inf
) %>% 
  mutate(Size = as.numeric(Size)) %>% 
  filter(Size != 0) %>% 
  mutate(
    # Extract chromosome from the file name 
    chr = str_extract(Key, 'chr.{1,4}.csv') %>%
             str_remove_all('chr|.csv')
  ) %>% 
  group_by(chr) %>% 
  summarise(total_size = sum(Size)/1e+9) # Divide to get value in GB



# A tibble: 27 x 2
   chr   total_size
   <chr>      <dbl>
 1 0           163.
 2 1           967.
 3 10          541.
 4 11          611.
 5 12          542.
 6 13          364.
 7 14          375.
 8 15          372.
 9 16          434.
10 17          443.
# … with 17 more rows

Emva koko ndabhala umsebenzi othatha ubungakanani obupheleleyo, ushuffle ulandelelwano lweechromosomes, uzahlule ngokwamaqela. num_jobs kwaye ikuxelela ukuba zahluke njani iisayizi zayo yonke imisebenzi yokucubungula.

num_jobs <- 7
# How big would each job be if perfectly split?
job_size <- sum(chr_sizes$total_size)/7

shuffle_job <- function(i){
  chr_sizes %>%
    sample_frac() %>% 
    mutate(
      cum_size = cumsum(total_size),
      job_num = ceiling(cum_size/job_size)
    ) %>% 
    group_by(job_num) %>% 
    summarise(
      job_chrs = paste(chr, collapse = ','),
      total_job_size = sum(total_size)
    ) %>% 
    mutate(sd = sd(total_job_size)) %>% 
    nest(-sd)
}

shuffle_job(1)



# A tibble: 1 x 2
     sd data            
  <dbl> <list>          
1  153. <tibble [7 × 3]>

Emva koko ndabaleka kwiwaka le-shuffles ndisebenzisa i-purrr kwaye ndakhetha eyona ilungileyo.

1:1000 %>% 
  map_df(shuffle_job) %>% 
  filter(sd == min(sd)) %>% 
  pull(data) %>% 
  pluck(1)

Ngoko ndagqibela ngeqela lemisebenzi eyayifana kakhulu ngobukhulu. Emva koko konke okwakusele kukusonga iskripthi sam sangaphambili seBash kwilophu enkulu for. Olu lungiselelo luthathe malunga nemizuzu eli-10 ukubhala. Kwaye oku kungaphantsi kakhulu kunokuba bendiza kuchitha ekudaleni imisebenzi ngesandla ukuba bebengalungelelananga. Ke ngoko, ndicinga ukuba bendichanile ngolu lungiselelo lwangaphambili.

for DESIRED_CHR in "16" "9" "7" "21" "MT"
do
# Code for processing a single chromosome
fi

Ekugqibeleni ndongeza umyalelo wokuvala:

sudo shutdown -h now

... kwaye yonke into yasebenza! Ukusebenzisa i-AWS CLI, ndiphakamise iimeko ngokusebenzisa ukhetho user_data babanike imibhalo yeBash yemisebenzi yabo ukuze iqwalaselwe. Baye babaleka kwaye bavala ngokuzenzekelayo, ngoko ke ndandingahlawuleli amandla okuqhubela phambili.

aws ec2 run-instances ...
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=<<job_name>>}]" 
--user-data file://<<job_script_loc>>

Masipakishe!

Ndifunde ntoni: I-API kufuneka ibe lula ngenxa yokukhululeka kunye nokuguquguquka kokusetyenziswa.

Ekugqibeleni ndifumene idatha kwindawo efanelekileyo kunye nefom. Konke okwakusele kukwenza lula inkqubo yokusebenzisa idatha kangangoko ukwenza kube lula kubalingane bam. Bendifuna ukwenza i-API elula yokwenza izicelo. Ukuba kwixesha elizayo ndigqiba ekubeni nditshintshe .rds kwiifayile zeParquet, ke oku kufanele ukuba yingxaki kum, hayi kubalingane bam. Kule nto ndigqibe ekubeni ndenze iphakheji ye-R yangaphakathi.

Yakha kwaye ubhale umqulu olula kakhulu oqulathe nje imisebenzi embalwa yofikelelo lwedatha ecwangciswe malunga nomsebenzi get_snp. Kananjalo ndenzele oogxa bam iwebhusayithi pkgdown, ukuze babone ngokulula imizekelo kunye namaxwebhu.

Ukwahlulahlula i-25TB usebenzisa i-AWK kunye ne-R

Smart caching

Ndifunde ntoni: Ukuba idatha yakho ilungiswe kakuhle, i-caching iya kuba lula!

Kuba enye yeendlela eziphambili zokuqhutywa komsebenzi isebenzise imodeli yohlalutyo olufanayo kwiphakheji ye-SNP, ndigqibe kwelokuba ndisebenzise i-binning ukuze ndizuze. Xa uhambisa idatha nge-SNP, lonke ulwazi oluvela kwiqela (bin) lufakwe kwinto ebuyisiweyo. Oko kukuthi, imibuzo emidala inoku (kwithiyori) ikhawulezise ukuqwalaselwa kwemibuzo emitsha.

# Part of get_snp()
...
  # Test if our current snp data has the desired snp.
  already_have_snp <- desired_snp %in% prev_snp_results$snps_in_bin

  if(!already_have_snp){
    # Grab info on the bin of the desired snp
    snp_results <- get_snp_bin(desired_snp)

    # Download the snp's bin data
    snp_results$bin_data <- aws.s3::s3readRDS(object = snp_results$data_loc)
  } else {
    # The previous snp data contained the right bin so just use it
    snp_results <- prev_snp_results
  }
...

Xa ndisakha ipakethe, ndiqhube iibenchmarks ezininzi ukuthelekisa isantya xa usebenzisa iindlela ezahlukeneyo. Ndincoma ukuba ungayihoyi le nto, kuba ngamanye amaxesha iziphumo zingalindelekanga. Umzekelo, dplyr::filter ibikhawuleza kakhulu kunokuthatha imiqolo usebenzisa isihluzo esisekwe kwisalathiso, kwaye ukubuyisela ikholamu enye kwisakhelo sedata esihluziweyo kwakukhawuleza kakhulu kunokusebenzisa i-syntax yesalathiso.

Nceda uqaphele ukuba into prev_snp_results iqulethe isitshixo snps_in_bin. Olu luluhlu lwazo zonke ii-SNP ezizodwa kwiqela (umgqomo), ekuvumela ukuba ukhangele ngokukhawuleza ukuba sele unayo idatha evela kumbuzo wangaphambili. Ikwakwenza kube lula ukulophu kuzo zonke ii-SNP kwiqela (umgqomo) ngale khowudi:

# Get bin-mates
snps_in_bin <- my_snp_results$snps_in_bin

for(current_snp in snps_in_bin){
  my_snp_results <- get_snp(current_snp, my_snp_results)
  # Do something with results 
}

Iziphumo

Ngoku sinako (kwaye sele siqalisile ngokunzulu) ukuqhuba iimodeli kunye neemeko ebezingafikeleleki kuthi ngaphambili. Eyona nto ingcono kukuba oogxa bam baselebhu akufuneki bacinge ngazo naziphi na iingxaki. Banomsebenzi osebenzayo nje.

Kwaye nangona iphakheji ibagcina iinkcukacha, ndizamile ukwenza ifomathi yedatha ilula ngokwaneleyo ukuze bakwazi ukuyifumanisa ukuba ndinyamalale ngequbuliso ngomso...

Isantya sinyuke ngokuphawulekayo. Sidla ngokuskena amaqhekeza abalulekileyo ejenome. Ngaphambili, asikwazanga ukwenza oku (kwaba kubiza kakhulu), kodwa ngoku, ngenxa yesakhiwo seqela (umgqomo) kunye ne-caching, isicelo se-SNP enye sithatha umyinge ongaphantsi kwemizuzwana ye-0,1, kwaye ukusetyenziswa kwedatha kunjalo. eziphantsi ukuba iindleko S3 ngamandongomane.

isiphelo

Eli nqaku ayisosikhokelo kwaphela. Isisombululo siye saba ngumntu, kwaye phantse ngokuqinisekileyo asilunganga. Kunoko, yi-travelogue. Ndifuna abanye baqonde ukuba izigqibo ezinjalo azibonakali ngokupheleleyo entloko, zisisiphumo sovavanyo kunye nephutha. Kwakhona, ukuba ufuna inzululwazi yedatha, khumbula ukuba ukusebenzisa ezi zixhobo ngokufanelekileyo kufuna amava, kwaye amava abiza imali. Ndiyavuya kuba ndibenayo indlela yokuhlawula, kodwa abanye abaninzi abanokwenza umsebenzi ofanayo bhetele kunam abasokuze balifumane ithuba ngenxa yokungabi namali yokuzama nokuzama.

Izixhobo zedatha ezinkulu zinezinto ezininzi. Ukuba unalo ixesha, ungaphantse ngokuqinisekileyo ubhale isisombululo esikhawulezayo usebenzisa ukucoca idatha ehlakaniphile, ukugcinwa, kunye neendlela zokukhupha. Ekugqibeleni kuhla kuhlalutyo lweendleko-inzuzo.

Into endiyifundileyo:

  • akukho ndlela iphantsi yokwahlula i-25 TB ngexesha;
  • qaphela ubungakanani beefayile zakho zeParquet kunye nombutho wazo;
  • Izahlulo kwi-Spark kufuneka zilinganiswe;
  • Ngokubanzi, ungaze uzame ukwenza izahlulo ezizizigidi ezi-2,5;
  • Ukuhlela kusenzima, njengoko kuseta iSpark;
  • ngamanye amaxesha idatha ekhethekileyo ifuna izisombululo ezikhethekileyo;
  • Ukuhlanganiswa kwe-Spark kuyakhawuleza, kodwa ukwahlula kusabiza;
  • musa ukulala xa bekufundisa izinto ezisisiseko, umntu mhlawumbi sele eyicombulula ingxaki yakho phaya ngeminyaka yoo-1980;
  • gnu parallel - oku kuyinto yomlingo, wonke umntu kufuneka ayisebenzise;
  • U-Spark uthanda idatha engaxinzelelwanga kwaye akathandi ukudibanisa izahlulo;
  • I-Spark ine-overhead eninzi kakhulu xa isombulula iingxaki ezilula;
  • Ii-AWK's associative arrays zisebenza kakuhle kakhulu;
  • ungaqhagamshelana stdin и stdout ukusuka kumbhalo we-R, kwaye ke ngoko uyisebenzise kumbhobho;
  • Enkosi ekuphunyezweni kwendlela ehlakaniphile, i-S3 inokuqhuba iifayile ezininzi;
  • Esona sizathu siphambili sokuchitha ixesha kukuphucula kwangethuba indlela yakho yokugcina;
  • ungazami ukwenza imisebenzi ngesandla, yiyeke ikhompyuter iyenze;
  • I-API kufuneka ibe lula ngenxa yokukhululeka kunye nokuguquguquka kokusetyenziswa;
  • Ukuba idatha yakho ilungiswe kakuhle, i-caching iya kuba lula!

umthombo: www.habr.com

Yongeza izimvo