Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R

Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R
Ifundwa kanjani lesi sihloko: Ngiyaxolisa ngombhalo mude futhi unesiphithiphithi. Ukuze ungonge isikhathi, ngiqala isahluko ngasinye ngelethulo esithi “Engikufundile,” esifingqa ingqikithi yesahluko ngomusho owodwa noma emibili.

“Mane ungibonise ikhambi!” Uma nje ufuna ukubona ukuthi ngivelaphi, yeqela esahlukweni esithi “Ukuba Nobuhlakani Kakhudlwana,” kodwa ngicabanga ukuthi kuthakazelisa kakhulu futhi kuwusizo ukufunda ngokwehluleka.

Ngisanda kunikwa umsebenzi wokumisa inqubo yokucubungula ivolumu enkulu yokulandelana kwe-DNA eluhlaza (ngobuchwepheshe i-chip ye-SNP). Isidingo kwakuwukuthola ngokushesha idatha mayelana nendawo enikeziwe yofuzo (ebizwa ngokuthi i-SNP) yokumodela okulandelayo neminye imisebenzi. Ngisebenzisa i-R ne-AWK, ngakwazi ukuhlanza nokuhlela idatha ngendlela engokwemvelo, ngasheshisa kakhulu ukucutshungulwa kwemibuzo. Lokhu kwakungelula kimi futhi kwakudinga ukuphindaphinda kaningi. Lesi sihloko sizokusiza ukuba ugweme amanye amaphutha ami futhi sikubonise ukuthi ngigcine ngani.

Okokuqala, ezinye izincazelo eziyisingeniso.

Idatha

Isikhungo sethu sokucubungula ulwazi lwezofuzo enyuvesi sinikeze idatha ewuhlobo lwe-25 TB TSV. Ngawathola ahlukaniswe amaphakheji angu-5, acindezelwe yi-Gzip, ngalinye elaliqukethe cishe amafayela angama-gigabyte amane angu-240. Umugqa ngamunye ubuqukethe idatha ye-SNP eyodwa evela kumuntu oyedwa. Sekukonke, idatha ku ~ 2,5 million SNPs kanye ~ 60 abantu abayizinkulungwane ezidlulisiwe. Ngaphezu kolwazi lwe-SNP, amafayela aqukethe amakholomu amaningi anezinombolo ezibonisa izici ezihlukahlukene, njengokuqina kokufunda, imvamisa yama-alleles ahlukene, njll. Sekukonke bekunamakholomu angaba ngu-30 anamanani ahlukile.

Injongo

Njenganoma iyiphi iphrojekthi yokuphatha idatha, into ebaluleke kakhulu kwakuwukunquma ukuthi idatha izosetshenziswa kanjani. Esimweni esinjalo sizokhetha kakhulu amamodeli nokugeleza komsebenzi kwe-SNP okusekelwe ku-SNP. Okusho ukuthi, sizodinga kuphela idatha ku-SNP eyodwa ngesikhathi. Kwadingeka ngifunde ukubuyisa wonke amarekhodi ahlotshaniswa neyodwa ye-2,5 million SNPs kalula, ngokushesha futhi eshibhile ngangokunokwenzeka.

Ungakwenzi kanjani lokhu

Ukucaphuna i-cliché efanelekile:

Angizange ngihluleke izikhathi eziyinkulungwane, ngisanda kuthola izindlela eziyinkulungwane zokugwema ukudlulisa idatha ngefomethi elungele imibuzo.

Zama kuqala

Yini engiyifundile: Ayikho indlela eshibhile yokuhlaziya i-25 TB ngesikhathi.

Ngemva kokuthatha izifundo "Izindlela Ezithuthukisiwe Zokucubungula Idatha Enkulu" e-Vanderbilt University, ngangiqinisekile ukuthi iqhinga lalisesikhwameni. Cishe kuzothatha ihora noma amabili ukusetha iseva ye-Hive ukuze isebenzise yonke idatha futhi ibike umphumela. Njengoba idatha yethu igcinwe ku-AWS S3, ngisebenzise isevisi Athena, okukuvumela ukuthi usebenzise imibuzo ye-Hive SQL kudatha ye-S3. Awudingi ukusetha/ukukhulisa iqoqo le-Hive, futhi ukhokhela kuphela idatha oyifunayo.

Ngemva kokuthi ngibonise i-Athena idatha yami nefomethi yayo, ngenze ezinye izivivinyo ngemibuzo efana nalena:

select * from intensityData limit 10;

Futhi ngokushesha wathola imiphumela eyakhiwe kahle. Ilungile.

Kuze kube yilapho sizama ukusebenzisa idatha emsebenzini wethu...

Ngacelwa ukuthi ngikhiphe lonke ulwazi lwe-SNP ukuze ngihlole imodeli. Ngiphendule umbuzo:


select * from intensityData 
where snp = 'rs123456';

... futhi waqala ukulinda. Ngemva kwemizuzu eyisishiyagalombili kanye ne-4 TB yedatha eceliwe, ngithole umphumela. I-Athena ikhokhisa ngevolumu yedatha etholiwe, u-$5 ngeterabhayithi ngayinye. Ngakho-ke lesi sicelo esisodwa sibiza u-$20 nemizuzu eyisishiyagalombili yokulinda. Ukuze siqhube imodeli kuyo yonke idatha, kwakudingeka silinde iminyaka engu-38 futhi sikhokhe amaRandi ayizigidi ezingu-50. Ngokusobala, lokhu kwakungafaneleki kithi.

Kwakudingeka ukusebenzisa iParquet ...

Yini engiyifundile: Qaphela ubukhulu bamafayela akho e-Parquet kanye nenhlangano yawo.

Ngiqale ngazama ukulungisa isimo ngokuguqula wonke ama-TSV abe Amafayela e-Parquet. Akulungele ukusebenza ngamasethi amakhulu edatha ngoba ulwazi olukuwo lugcinwa ngendlela yekholomu: ikholomu ngayinye isesigabeni sayo sememori/idiski, ngokungafani namafayela ombhalo, lapho imigqa iqukethe izakhi zekholomu ngayinye. Futhi uma udinga ukuthola okuthile, vele ufunde ikholomu edingekayo. Ukwengeza, ifayela ngalinye ligcina ububanzi bamanani kukholomu, ngakho-ke uma inani olifunayo lingekho ebangeni lekholomu, i-Spark ngeke imoshe isikhathi ngokuskena lonke ifayela.

Ngenza umsebenzi olula Iglue le-AWS ukuguqula ama-TSV ethu abe yi-Parquet futhi wehlise amafayela amasha ku-Athena. Kuthathe cishe amahora ama-5. Kodwa lapho ngiqhuba isicelo, kwathatha cishe isikhathi esifanayo nemali encane ukuqedela. Iqiniso liwukuthi uSpark, ezama ukwenza umsebenzi ngokugcwele, wamane wakhipha isiqephu esisodwa se-TSV wasibeka kweyakhe i-Parquet chunk. Futhi ngenxa yokuthi ingxenye ngayinye yayinkulu ngokwanele ukuqukatha wonke amarekhodi abantu abaningi, ifayela ngalinye laliqukethe wonke ama-SNP, ngakho u-Spark kwakudingeka avule wonke amafayela ukuze akhiphe ulwazi ayeludinga.

Kuyathakazelisa ukuthi uhlobo lokucindezela lwe-Parquet (futhi olunconyiwe), olusheshayo, aluhlukaniseki. Ngakho-ke, umenzi wefa ngamunye wayebambelele emsebenzini wokukhulula nokulanda idathasethi ephelele yedatha engu-3,5 GB.

Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R

Masiqonde inkinga

Yini engiyifundile: Ukuhlunga kunzima, ikakhulukazi uma idatha isatshalaliswa.

Kimina kwabonakala sengathi manje ngase ngiwuqonda umnyombo wenkinga. Bengidinga kuphela ukuhlunga idatha ngekholomu ye-SNP, hhayi ngabantu. Khona-ke ama-SNP amaningana azogcinwa ku-chunk yedatha ehlukile, bese umsebenzi we-Parquet "smart" "uvuleka kuphela uma inani lisebangeni" lizozibonakalisa kuyo yonke inkazimulo yalo. Ngeshwa, ukuhlela izigidigidi zemigqa ehlakazekile eqoqweni kubonakale kuwumsebenzi onzima.

I-AWS ayifuni ngempela ukubuyisela imali ngenxa yesizathu sokuthi "Ngingumfundi ophazamisekile". Ngemuva kokuthi ngigijime ukuhlunga ku-Amazon Glue, isebenze izinsuku ezi-2 futhi yaphahlazeka.

Kuthiwani ngokuhlukanisa?

Yini engiyifundile: Ama-Partitions ku-Spark kufanele alinganisele.

Ngabe sengiqhamuka nomqondo wokuhlukanisa idatha kuma-chromosome. Kunezingu-23 zazo (kanye nezinye eziningi uma ucabangela i-mitochondrial DNA kanye nezifunda ezingenamephu).
Lokhu kuzokuvumela ukuthi uhlukanise idatha ibe izingcezu ezincane. Uma wengeza umugqa owodwa emsebenzini wokuthekelisa we-Spark kusikripthi seGlue partition_by = "chr", khona-ke idatha kufanele ihlukaniswe ngamabhakede.

Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R
I-genome iqukethe izingcezu eziningi ezibizwa ngokuthi ama-chromosome.

Ngeshwa, akuzange kusebenze. Ama-Chromosome anosayizi abahlukahlukene, okusho amanani ahlukene olwazi. Lokhu kusho ukuthi imisebenzi uSpark ayithumele kubasebenzi ayizange ilinganisele futhi iqedwe kancane ngoba amanye ama-node asheshe aqeda futhi engasebenzi. Nokho, imisebenzi yaqedwa. Kodwa lapho ucela i-SNP eyodwa, ukungalingani kwaphinde kwabangela izinkinga. Izindleko zokucubungula ama-SNP kuma-chromosome amakhulu (okungukuthi, lapho sifuna ukuthola khona idatha) zehle kuphela cishe ngesici esingu-10. Kuningi, kodwa akwanele.

Kuthiwani uma siyihlukanisa ibe izingxenye ezincane nakakhulu?

Yini engiyifundile: Ungalokothi uzame ukwenza ama-partitions ayizigidi ezingu-2,5 nhlobo.

Nganquma ukuphuma konke futhi ngihlukanise i-SNP ngayinye. Lokhu kwaqinisekisa ukuthi ama-partitions ayenobukhulu obulinganayo. KWAKUNGUMBONO OMBI. Ngisebenzise i-Glue futhi ngengeza umugqa ongenacala partition_by = 'snp'. Umsebenzi waqala futhi waqala ukwenziwa. Ngemva kosuku ngabheka ngabona ukuthi kwakungakabhalwa lutho ku-S3, ngakho ngawubulala umsebenzi. Kubukeka sengathi i-Glue ibibhala amafayela aphakathi nendawo endaweni efihliwe ku-S3, amafayela amaningi, mhlawumbe izigidi ezimbalwa. Ngenxa yalokho, iphutha lami labiza ngaphezu kwamadola ayinkulungwane futhi alizange limjabulise umeluleki wami.

Ukwahlukanisa + ukuhlunga

Yini engiyifundile: Ukuhlunga kusenzima, njengoba kunjalo nokushuna i-Spark.

Umzamo wami wokugcina wokuhlukanisa wawungihilela ukwehlukanisa ama-chromosome bese ngihlela ukwahlukanisa ngakunye. Ngokombono, lokhu kuzosheshisa umbuzo ngamunye ngoba idatha ye-SNP efiselekayo bekufanele ibe phakathi kwezingcezu ezimbalwa ze-Parquet phakathi kwebanga elinikeziwe. Ngeshwa, ukuhlunga ngisho nedatha ehlukanisiwe kube umsebenzi onzima. Ngenxa yalokho, ngishintshele ku-EMR ngeqoqo langokwezifiso futhi ngasebenzisa izimo ezinamandla eziyisishiyagalombili (C5.4xl) kanye ne-Sparklyr ukuze ngidale ukuhamba komsebenzi okuvumelana nezimo...

# Sparklyr snippet to partition by chr and sort w/in partition
# Join the raw data with the snp bins
raw_data
  group_by(chr) %>%
  arrange(Position) %>% 
  Spark_write_Parquet(
    path = DUMP_LOC,
    mode = 'overwrite',
    partition_by = c('chr')
  )

...nokho, umsebenzi ubungakaqedwa. Ngiyilungiselele ngezindlela ezihlukene: ngandisa isabelo senkumbulo kumuntu ngamunye owaba umbuzo, ama-node asetshenzisiwe anenani elikhulu lenkumbulo, okuguquguqukayo okusakazwayo okusetshenzisiwe (okuguquguqukayo kokusakaza), kodwa isikhathi ngasinye lokhu kuba yizinyathelo eziwuhhafu, futhi kancane kancane abafayo baqala. ukuhluleka kuze kuphele yonke into.

Sengiya ngokuya ngisungula

Yini engiyifundile: Kwesinye isikhathi idatha ebalulekile idinga izixazululo ezikhethekile.

I-SNP ngayinye inenani lesikhundla. Lena inombolo ehambisana nenani lezisekelo ezihambisana nechromosome yayo. Lena indlela enhle nengokwemvelo yokuhlela idatha yethu. Ekuqaleni ngangifuna ukuhlukanisa ngezifunda zechromosome ngayinye. Isibonelo, izikhundla 1 - 2000, 2001 - 4000, njll. Kodwa inkinga ukuthi ama-SNP awasakazwa ngokulinganayo kuwo wonke ama-chromosome, ngakho-ke osayizi beqembu bazohluka kakhulu.

Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R

Ngenxa yalokho, ngifinyelele ekuqhekekeni kwezikhundla ngezigaba (isikhundla). Ngisebenzisa idatha esivele ilandiwe, ngenze isicelo sokuthola uhlu lwama-SNP ahlukile, izikhundla zawo nama-chromosome. Ngabe sengihlunga idatha ngaphakathi kwekhromozomu ngayinye futhi ngaqoqa ama-SNP ngamaqembu (umgqomo) wosayizi othile. Ake sithi ama-SNP ayi-1000 lilinye. Lokhu kunginike ubudlelwano be-SNP-to-group-per-chromosome.

Ekugcineni, ngenza amaqembu (bin) we-75 SNPs, isizathu sizochazwa ngezansi.

snp_to_bin <- unique_snps %>% 
  group_by(chr) %>% 
  arrange(position) %>% 
  mutate(
    rank = 1:n()
    bin = floor(rank/snps_per_bin)
  ) %>% 
  ungroup()

Qala uzame nge-Spark

Yini engiyifundile: Ukuhlanganiswa kwe-Spark kuyashesha, kodwa ukuhlukanisa kusabiza.

Bengifuna ukufunda lolu hlaka lwedatha oluncane (imigqa eyizigidi ezingu-2,5) lube yi-Spark, ngiluhlanganise nedatha eluhlaza, bese ngiluhlukanisa ngekholomu esanda kungezwa. bin.


# Join the raw data with the snp bins
data_w_bin <- raw_data %>%
  left_join(sdf_broadcast(snp_to_bin), by ='snp_name') %>%
  group_by(chr_bin) %>%
  arrange(Position) %>% 
  Spark_write_Parquet(
    path = DUMP_LOC,
    mode = 'overwrite',
    partition_by = c('chr_bin')
  )

ngisebenzise sdf_broadcast(), ngakho u-Spark uyazi ukuthi kufanele athumele uhlaka lwedatha kuwo wonke ama-node. Lokhu kuyasiza uma idatha incane ngosayizi futhi idingeka kuyo yonke imisebenzi. Uma kungenjalo, i-Spark izama ukuhlakanipha futhi isabalalise idatha njengoba kudingeka, okungase kubangele ukwehla.

Futhi, umqondo wami awuzange usebenze: imisebenzi yasebenza isikhathi esithile, yaqeda inyunyana, futhi, njengabaphathi abaqaliswe ngokuhlukanisa, baqala ukwehluleka.

Ukungeza i-AWK

Yini engiyifundile: Ungalali uma ufundiswa izinto eziyisisekelo. Impela othile useyixazululile inkinga yakho emuva ngeminyaka yawo-1980.

Kuze kube manje, isizathu sakho konke ukwehluleka kwami ​​nge-Spark kwaba ukuhlangana kwedatha kuqoqo. Mhlawumbe isimo singathuthukiswa ngokwelashwa kwangaphambili. Nginqume ukuzama ukuhlukanisa idatha yombhalo ongahluziwe ngiwenze amakholomu ama-chromosome, ngakho ngathemba ukunikeza u-Spark idatha "eyahlukaniswa ngaphambilini".

Ngicinge ku-StackOverflow ukuthi ngingahlukaniswa kanjani ngamavelu ekholomu futhi ngathola impendulo enhle kangaka. Nge-AWK ungahlukanisa ifayela lombhalo ngamavelu ekholomu ngokulibhala kuskripthi kunokuthumela imiphumela ku- stdout.

Ngibhale umbhalo we-Bash ukuze ngiwuzame. Idawunilode eyodwa yama-TSV apakishiwe, yase iyikhipha isebenzisa gzip futhi ithunyelwe ku awk.

gzip -dc path/to/chunk/file.gz |
awk -F 't' 
'{print $1",..."$30">"chunked/"$chr"_chr"$15".csv"}'

Kwasebenza!

Ukugcwalisa ama-cores

Yini engiyifundile: gnu parallel - kuyinto ewumlingo, wonke umuntu kufanele ayisebenzise.

Ukuhlukana kwakuhamba kancane futhi lapho ngiqala htopukuhlola ukusetshenziswa kwesibonelo esinamandla (futhi esibizayo) se-EC2, kuvele ukuthi bengisebenzisa umongo owodwa kanye nenkumbulo engaba ngu-200 MB. Ukuze sixazulule inkinga futhi singalahlekelwa yimali eningi, kwakudingeka sithole indlela yokufanisa umsebenzi. Ngenhlanhla, encwadini emangalisayo ngokuphelele Isayensi Yedatha Emgqeni Womyalo Ngithole isahluko sikaJeron Janssens mayelana nokuhambisana. Kuyo ngifunde ngakho gnu parallel, indlela evumelana nezimo kakhulu yokusebenzisa i-multithreading ku-Unix.

Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R
Lapho ngiqala ukwahlukanisa ngisebenzisa inqubo entsha, konke kwakuhamba kahle, kodwa kwakusene-bottleneck - ukulanda izinto ze-S3 kudiski kwakungasheshi kakhulu futhi kungafaniswa ngokugcwele. Ukulungisa lokhu, ngenze lokhu:

  1. Ngithole ukuthi kungenzeka ukusebenzisa isiteji sokulanda se-S3 ngqo epayipini, ukuqeda ngokuphelele isitoreji esiphakathi kudiski. Lokhu kusho ukuthi ngingakwazi ukugwema ukubhala idatha eluhlaza kudiski futhi ngisebenzise encane, futhi ngenxa yalokho ishibhile, ukugcinwa ku-AWS.
  2. iqembu aws configure set default.s3.max_concurrent_requests 50 kwandisa kakhulu inani lezintambo ezisetshenziswa yi-AWS CLI (ngokuzenzakalelayo kukhona u-10).
  3. Ngishintshele kusibonelo se-EC2 esilungiselelwe isivinini senethiwekhi, enohlamvu n egameni. Ngithole ukuthi ukulahlekelwa amandla okucubungula uma usebenzisa i-n-isibonelo kungaphezu kokunxeshezelwa ngokunyuka kwesivinini sokulayisha. Emisebenzini eminingi ngisebenzise i-c5n.4xl.
  4. Kushintshiwe gzip on pigz, leli ithuluzi le-gzip elingenza izinto ezipholile ukufanisa umsebenzi wasekuqaleni ongalingani wamafayela wokuwohloka (lokhu kusize kancane).

# Let S3 use as many threads as it wants
aws configure set default.s3.max_concurrent_requests 50

for chunk_file in $(aws s3 ls $DATA_LOC | awk '{print $4}' | grep 'chr'$DESIRED_CHR'.csv') ; do

        aws s3 cp s3://$batch_loc$chunk_file - |
        pigz -dc |
        parallel --block 100M --pipe  
        "awk -F 't' '{print $1",..."$30">"chunked/{#}_chr"$15".csv"}'"

       # Combine all the parallel process chunks to single files
        ls chunked/ |
        cut -d '_' -f 2 |
        sort -u |
        parallel 'cat chunked/*_{} | sort -k5 -n -S 80% -t, | aws s3 cp - '$s3_dest'/batch_'$batch_num'_{}'
        
         # Clean up intermediate data
       rm chunked/*
done

Lezi zinyathelo zihlanganiswe nomunye ukwenza yonke into isebenze ngokushesha okukhulu. Ngokwandisa isivinini sokulanda nokuqeda ukubhala kwediski, manje sengingakwazi ukucubungula iphakheji ye-terabyte engu-5 emahoreni ambalwa nje.

Le tweet bekumele ikhulume nge-'TSV'. Maye.

Ukusebenzisa idatha esanda kuncozululwa

Yini engiyifundile: U-Spark uthanda idatha engacindezelwanga futhi akathandi ukuhlanganisa ama-partitions.

Manje idatha yayiku-S3 kufomethi engapakishiwe (funda: kwabelwana ngayo) kanye nefomethi e-odwe kancane, futhi ngangingabuyela ku-Spark futhi. Isimanga sasingilindile: Ngaphinda ngahluleka ukufeza engangikufuna! Kwakunzima kakhulu ukutshela uSpark ukuthi idatha yahlukaniswa kanjani. Futhi ngisho nalapho ngenza lokhu, kwavela ukuthi kwakukhona ukuhlukaniswa okuningi (izinkulungwane ezingu-95), futhi lapho ngisebenzisa coalesce kwehlise inani labo laba yimikhawulo efanelekile, lokhu kwacekela phansi ukwahlukanisa kwami. Ngiqinisekile ukuthi lokhu kungalungiswa, kodwa ngemva kwezinsuku ezimbalwa zokusesha angikwazanga ukuthola isisombululo. Ekugcineni ngiqede yonke imisebenzi e-Spark, nakuba kwathatha isikhathi futhi amafayela ami e-Parquet ahlukanisiwe ayengemancane kakhulu (~200 KB). Nokho, idatha yayilapho yayidingeka khona.

Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R
Incane kakhulu futhi ayilingani, iyamangalisa!

Ihlola imibuzo yendawo ye-Spark

Yini engiyifundile: I-Spark ine-overhead eningi kakhulu lapho ixazulula izinkinga ezilula.

Ngokulanda idatha ngefomethi ehlakaniphile, ngakwazi ukuhlola isivinini. Setha iskripthi sika-R ukuze usebenzise iseva yasendaweni ye-Spark, bese ulayisha uhlaka lwedatha ye-Spark olusuka endaweni yokugcina yeqembu le-Parquet (umgqomo). Ngizamile ukulayisha yonke idatha kodwa angikwazanga ukuthola i-Sparklyr ukuthi ibone ukwahlukanisa.

sc <- Spark_connect(master = "local")

desired_snp <- 'rs34771739'

# Start a timer
start_time <- Sys.time()

# Load the desired bin into Spark
intensity_data <- sc %>% 
  Spark_read_Parquet(
    name = 'intensity_data', 
    path = get_snp_location(desired_snp),
    memory = FALSE )

# Subset bin to snp and then collect to local
test_subset <- intensity_data %>% 
  filter(SNP_Name == desired_snp) %>% 
  collect()

print(Sys.time() - start_time)

Ukubulawa kuthathe imizuzwana engu-29,415. Okungcono kakhulu, kodwa akukuhle kakhulu ekuhlolweni kwenqwaba yanoma yini. Ukwengeza, angikwazanga ukusheshisa izinto ngokulondoloza isikhashana ngoba ngenkathi ngizama ukufaka inqolobane yohlaka lwedatha kumemori, i-Spark yayihlala iphahlazeka, ngisho nalapho ngabela inkumbulo engaphezu kuka-50 GB kudathasethi enesisindo esingaphansi kuka-15.

Buyela ku-AWK

Yini engiyifundile: Ama-Associative arrays ku-AWK asebenza kahle kakhulu.

Ngabona ukuthi ngingafinyelela isivinini esiphezulu. Ngakhumbula lokho ngesimangaliso Isifundo se-AWK sikaBruce Barnett Ngifunde ngesici esihle esibizwa ngokuthi “ama-associative arrays" Ngokuyinhloko, lawa amapheya abalulekile-inani, okuthi ngesizathu esithile abizwe ngokuhlukile ku-AWK, ngakho-ke ngandlela-thile angizange ngicabange kakhulu ngawo. Roman Cheplyaka ukhumbule ukuthi igama elithi "amalunga afanayo" lidala kakhulu kunegama elithi "key-value pair". Noma ngabe wena bheka inani elingukhiye ku-Google Ngram, ngeke ulibone leli temu lapho, kodwa uzothola ama-associative arrays! Ngaphezu kwalokho, "i-key-value pair" ivamise ukuhlotshaniswa nezizindalwazi, ngakho-ke kunengqondo kakhulu ukuyiqhathanisa ne-hashmap. Ngabona ukuthi ngingasebenzisa lezi zinhlelo zokuhlanganisa ukuze ngihlobanise ama-SNP ami netafula lomgqomo kanye nedatha eluhlaza ngaphandle kokusebenzisa i-Spark.

Ukwenza lokhu, kuskripthi se-AWK ngisebenzise ibhulokhi BEGIN. Lolu ucezu lwekhodi olusetshenziswa ngaphambi kokuba umugqa wokuqala wedatha udluliselwe endikimbeni eyinhloko yeskripthi.

join_data.awk
BEGIN {
  FS=",";
  batch_num=substr(chunk,7,1);
  chunk_id=substr(chunk,15,2);
  while(getline < "snp_to_bin.csv") {bin[$1] = $2}
}
{
  print $0 > "chunked/chr_"chr"_bin_"bin[$1]"_"batch_num"_"chunk_id".csv"
}

Ithimba while(getline...) ilayishe yonke imigqa esuka eqenjini le-CSV (umgqomo), setha ikholomu yokuqala (igama le-SNP) njengokhiye we-associative array bin kanye nenani lesibili (iqembu) njengenani. Bese kuba block { }, okwenziwa kuyo yonke imigqa yefayela elikhulu, umugqa ngamunye uthunyelwa efayeleni lokuphumayo, elithola igama eliyingqayizivele kuye ngeqembu lalo (umgqomo): ..._bin_"bin[$1]"_....

Okuguquguqukayo batch_num и chunk_id ifanise idatha ehlinzekwe ipayipi, igwema isimo somjaho, kanye nochungechunge lokubulawa ngalunye olusebenzayo parallel, yabhalela efayeleni layo eliyingqayizivele.

Njengoba ngasakaza yonke idatha eluhlaza kumafolda kuma-chromosome asele ekuhlolweni kwami ​​​​kwangaphambilini nge-AWK, manje sengingakwazi ukubhala esinye iskripthi se-Bash ukuze ngicubungule i-chromosome eyodwa ngesikhathi futhi ngithumele idatha ehlukanisiwe ejulile ku-S3.

DESIRED_CHR='13'

# Download chromosome data from s3 and split into bins
aws s3 ls $DATA_LOC |
awk '{print $4}' |
grep 'chr'$DESIRED_CHR'.csv' |
parallel "echo 'reading {}'; aws s3 cp "$DATA_LOC"{} - | awk -v chr=""$DESIRED_CHR"" -v chunk="{}" -f split_on_chr_bin.awk"

# Combine all the parallel process chunks to single files and upload to rds using R
ls chunked/ |
cut -d '_' -f 4 |
sort -u |
parallel "echo 'zipping bin {}'; cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R '$S3_DEST'/chr_'$DESIRED_CHR'_bin_{}.rds"
rm chunked/*

Iskripthi sinezigaba ezimbili parallel.

Esigabeni sokuqala, idatha ifundwa kuwo wonke amafayela aqukethe ulwazi lwekhromozomi efiswayo, bese le datha isatshalaliswa emicucu yonkana, esabalalisa amafayela emaqenjini afanelekile (umgqomo). Ukuze ugweme izimo zomjaho lapho izintambo eziningi zibhala efayeleni elifanayo, i-AWK idlulisa amagama wefayela ukuze ibhale idatha ezindaweni ezihlukene, isb. chr_10_bin_52_batch_2_aa.csv. Ngenxa yalokho, kudalwa amafayela amaningi amancane kudiski (ngalokhu ngisebenzise imiqulu ye-terabyte EBS).

I-Conveyor kusukela esigabeni sesibili parallel idlula emaqenjini (umgqomo) futhi ihlanganise amafayela awo ngamanye abe yi-CSV evamile c catbese iwathumela ukuthi ayothunyelwa ngaphandle.

Ukusakaza ngo-R?

Yini engiyifundile: Ungaxhumana stdin и stdout kusuka kusikripthi sika-R, ngakho-ke sisebenzise epayipini.

Kungenzeka ukuthi uwubonile lo mugqa kusikripthi sakho se-Bash: ...cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R.... Ihumusha wonke amafayela eqembu ahlanganisiwe (umgqomo) kusikripthi sika-R esingezansi. {} kuyindlela ekhethekile parallel, efaka noma iyiphi idatha eyithumelayo ekusakazweni okushiwo ngokuqondile kumyalo ngokwawo. Inketho {#} inikeza i-ID yochungechunge oluyingqayizivele, futhi {%} imele inombolo yesikhala somsebenzi (okuphindaphindiwe, kodwa hhayi kanyekanye). Uhlu lwazo zonke izinketho lungatholakala ku imibhalo.

#!/usr/bin/env Rscript
library(readr)
library(aws.s3)

# Read first command line argument
data_destination <- commandArgs(trailingOnly = TRUE)[1]

data_cols <- list(SNP_Name = 'c', ...)

s3saveRDS(
  read_csv(
        file("stdin"), 
        col_names = names(data_cols),
        col_types = data_cols 
    ),
  object = data_destination
)

Lapho okuguquguqukayo file("stdin") idluliselwe ku readr::read_csv, idatha ehunyushelwe kusikripthi sika-R ilayishwa ohlakeni, oluba sefomini .rds- ifayela usebenzisa aws.s3 ibhalwe ngqo ku-S3.

I-RDS ifana nenguqulo encane ye-Parquet, ngaphandle kokugcinwa kwesipikha.

Ngemva kokuqeda umbhalo we-Bash ngathola inqwaba .rds-amafayela atholakala ku-S3, okungivumele ukuthi ngisebenzise ukucindezela okusebenzayo nezinhlobo ezakhelwe ngaphakathi.

Naphezu kokusebenzisa i-brake R, konke kwasebenza ngokushesha okukhulu. Akumangazi ukuthi izingxenye zika-R ezifunda futhi zibhale idatha zithuthukiswe kakhulu. Ngemva kokuhlola ku-chromosome eyodwa yosayizi omaphakathi, umsebenzi waqedwa esimweni se-C5n.4xl cishe emahoreni amabili.

Imikhawulo ye-S3

Yini engiyifundile: Ngenxa yokusebenzisa indlela ehlakaniphile, i-S3 ingaphatha amafayela amaningi.

Bengikhathazekile ukuthi i-S3 izokwazi yini ukuphatha amafayela amaningi adluliselwe kuyo. Ngingenza amagama amafayela abe nomqondo, kodwa i-S3 izowafuna kanjani?

Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R
Amafolda ku-S3 awombukiso nje, empeleni uhlelo alunantshisekelo kuphawu /. Kusukela ekhasini le-S3 FAQ.

Kubonakala sengathi i-S3 imele indlela eya efayeleni elithile njengokhiye olula ohlotsheni lwetafula le-hashi noma isizindalwazi esisekelwe kumadokhumenti. Ibhakede lingacatshangwa njengetafula, futhi amafayela angabhekwa njengamarekhodi kulelo thebula.

Njengoba isivinini nokusebenza kahle kubalulekile ekwenzeni inzuzo e-Amazon, akumangazi ukuthi lolu hlelo lwe-key-as-a-file-path luthuthukisiwe. Ngazama ukuthola ibhalansi: ukuze kungadingeki ngenze izicelo eziningi zokuthola, kodwa ukuthi izicelo zenziwa ngokushesha. Kwavela ukuthi kungcono ukwenza amafayili e-bin ayizinkulungwane ezingu-20. Ngicabanga ukuthi uma siqhubeka nokuthuthukisa, singakwazi ukufeza ukwanda kwesivinini (isibonelo, ukwenza ibhakede elikhethekile ledatha, ngaleyo ndlela sinciphise usayizi wetafula lokubheka). Kodwa sasingekho isikhathi noma imali yokuhlola okwengeziwe.

Kuthiwani ngokuhambisana?

Engikufundile: Imbangela yokuqala yokumosha isikhathi ukuthuthukisa indlela yakho yokugcina ngaphambi kwesikhathi.

Kuleli qophelo, kubaluleke kakhulu ukuthi uzibuze: "Kungani usebenzisa ifomethi yefayela lobunikazi?" Isizathu sisejubaneni lokulayisha (amafayela e-CSV e-gzipped athathe izikhathi ezingu-7 ubude ukulayisha) kanye nokuhambisana nokugeleza komsebenzi wethu. Ngingase ngicabange kabusha uma u-R engakwazi ukulayisha kalula amafayela e-Parquet (noma Umcibisholo) ngaphandle komthwalo we-Spark. Wonke umuntu kulebhu yethu usebenzisa u-R, futhi uma ngidinga ukuguqulela idatha kwenye ifomethi, ngisenayo idatha yombhalo yasekuqaleni, ngakho-ke ngingakwazi ukuphinda ngisebenzise ipayipi.

Ukuhlukaniswa komsebenzi

Yini engiyifundile: Ungazami ukukhulisa imisebenzi ngokwenza, vumela ikhompuyutha ikwenze.

Ngisuse iphutha ekuhambeni komsebenzi ku-chromosome eyodwa, manje ngidinga ukucubungula yonke enye idatha.
Bengifuna ukuphakamisa izimo ezimbalwa ze-EC2 zokuguqulwa, kodwa ngesikhathi esifanayo ngangesaba ukuthola umthwalo ongalinganiseli kakhulu emisebenzini ehlukene yokucubungula (njengoba nje uSpark ehlushwa ama-partitions angalingani). Ngaphezu kwalokho, bengingenaso intshisekelo yokukhulisa isibonelo esisodwa ngekhromozomu ngayinye, ngoba kuma-akhawunti e-AWS kunomkhawulo ozenzakalelayo wezimo eziyi-10.

Ngabe senginquma ukubhala umbhalo ku-R ukuze ngithuthukise imisebenzi yokucubungula.

Okokuqala, ngicele i-S3 ukuba ibale ukuthi ikhromozomu ngayinye ina isikhala esingakanani sokulondoloza.

library(aws.s3)
library(tidyverse)

chr_sizes <- get_bucket_df(
  bucket = '...', prefix = '...', max = Inf
) %>% 
  mutate(Size = as.numeric(Size)) %>% 
  filter(Size != 0) %>% 
  mutate(
    # Extract chromosome from the file name 
    chr = str_extract(Key, 'chr.{1,4}.csv') %>%
             str_remove_all('chr|.csv')
  ) %>% 
  group_by(chr) %>% 
  summarise(total_size = sum(Size)/1e+9) # Divide to get value in GB



# A tibble: 27 x 2
   chr   total_size
   <chr>      <dbl>
 1 0           163.
 2 1           967.
 3 10          541.
 4 11          611.
 5 12          542.
 6 13          364.
 7 14          375.
 8 15          372.
 9 16          434.
10 17          443.
# … with 17 more rows

Ngabe sengibhala umsebenzi othatha usayizi ophelele, ishova ukuhleleka kwama-chromosome, iwahlukanise abe amaqembu. num_jobs futhi ikutshela ukuthi osayizi bayo yonke imisebenzi yokucubungula bahluke kangakanani.

num_jobs <- 7
# How big would each job be if perfectly split?
job_size <- sum(chr_sizes$total_size)/7

shuffle_job <- function(i){
  chr_sizes %>%
    sample_frac() %>% 
    mutate(
      cum_size = cumsum(total_size),
      job_num = ceiling(cum_size/job_size)
    ) %>% 
    group_by(job_num) %>% 
    summarise(
      job_chrs = paste(chr, collapse = ','),
      total_job_size = sum(total_size)
    ) %>% 
    mutate(sd = sd(total_job_size)) %>% 
    nest(-sd)
}

shuffle_job(1)



# A tibble: 1 x 2
     sd data            
  <dbl> <list>          
1  153. <tibble [7 × 3]>

Ngabe sengigijima phakathi kwama-shuffles ayinkulungwane ngisebenzisa i-purrr futhi ngakhetha okuhle kakhulu.

1:1000 %>% 
  map_df(shuffle_job) %>% 
  filter(sd == min(sd)) %>% 
  pull(data) %>% 
  pluck(1)

Ngakho ngagcina nginesethi yemisebenzi eyayifana kakhulu ngobukhulu. Kwabe sekusele nje ukusonga umbhalo wami we-Bash wangaphambilini ngeluphu enkulu for. Lokhu kulungiselelwa kuthathe cishe imizuzu eyi-10 ukubhala. Futhi lokhu kuncane kakhulu kunalokho ebengingakusebenzisa ekudaleni imisebenzi uma ibingalingani. Ngakho-ke, ngicabanga ukuthi bengiqinisile ngalokhu kulungiselelwa kokuqala.

for DESIRED_CHR in "16" "9" "7" "21" "MT"
do
# Code for processing a single chromosome
fi

Ekugcineni ngengeza umyalo wokuvala shaqa:

sudo shutdown -h now

... futhi konke kwaphumelela! Ngisebenzisa i-AWS CLI, ngiphakamise izimo ngisebenzisa inketho user_data wabanika imibhalo ye-Bash yemisebenzi yabo ukuze icutshungulwe. Zazisebenza futhi zacisha ngokuzenzakalelayo, ngakho-ke bengingakhokhi amandla engeziwe okucubungula.

aws ec2 run-instances ...
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=<<job_name>>}]" 
--user-data file://<<job_script_loc>>

Asipakishe!

Yini engiyifundile: I-API kufanele ibe lula ukuze kube lula futhi kube lula ukuyisebenzisa.

Ekugcineni ngathola idatha endaweni efanele kanye nefomu. Okwakusele kwakuwukwenza lula inqubo yokusebenzisa idatha ngangokunokwenzeka ukuze kube lula kozakwethu. Bengifuna ukwenza i-API elula yokudala izicelo. Uma esikhathini esizayo nginquma ukushintsha .rds kumafayili e-Parquet, khona-ke lokhu kufanele kube inkinga kimi, hhayi kozakwethu. Ngalokhu nginqume ukwenza iphakheji ye-R yangaphakathi.

Yakha futhi ubhale phansi iphakheji elula kakhulu equkethe imisebenzi embalwa yokufinyelela idatha ehlelwe ngokusebenza get_snp. Ngiphinde ngenzela ozakwethu iwebhusayithi pkgdown, ukuze babone kalula izibonelo nemibhalo.

Ukuhlaziya i-25TB kusetshenziswa i-AWK ne-R

Ukulondoloza isikhashana okuhlakaniphile

Yini engiyifundile: Uma idatha yakho ilungiswe kahle, ukulondoloza isikhashana kuzoba lula!

Njengoba okunye kokuhamba komsebenzi okuyinhloko kusebenzise imodeli yokuhlaziya efanayo kuphakheji ye-SNP, nginqume ukusebenzisa i-binning ukuze ngizuze. Lapho udlulisela idatha nge-SNP, lonke ulwazi oluvela eqenjini (umgqomo) lunamathiselwe entweni ebuyisiwe. Okusho ukuthi, imibuzo emidala (ngokombono) ingasheshisa ukucutshungulwa kwemibuzo emisha.

# Part of get_snp()
...
  # Test if our current snp data has the desired snp.
  already_have_snp <- desired_snp %in% prev_snp_results$snps_in_bin

  if(!already_have_snp){
    # Grab info on the bin of the desired snp
    snp_results <- get_snp_bin(desired_snp)

    # Download the snp's bin data
    snp_results$bin_data <- aws.s3::s3readRDS(object = snp_results$data_loc)
  } else {
    # The previous snp data contained the right bin so just use it
    snp_results <- prev_snp_results
  }
...

Lapho ngakha iphakheji, ngisebenzise amabhentshimakhi amaningi ukuze ngiqhathanise isivinini lapho ngisebenzisa izindlela ezihlukile. Ngincoma ukuthi ungakunaki lokhu, ngoba ngezinye izikhathi imiphumela ayilindelekile. Ngokwesibonelo, dplyr::filter bekushesha kakhulu kunokuthwebula imigqa kusetshenziswa ukuhlunga okususelwa kunkomba, futhi ukubuyisa ikholomu eyodwa kuzimele wedatha ehlungiwe kwakushesha kakhulu kunokusebenzisa i-syntax yokukhomba.

Sicela uqaphele ukuthi into prev_snp_results iqukethe ukhiye snps_in_bin. Lolu uhlu lwawo wonke ama-SNP ahlukile eqenjini (umgqomo), okukuvumela ukuthi uhlole ngokushesha ukuthi ingabe usunayo yini idatha yombuzo wangaphambilini. Futhi kwenza kube lula ukuxhuma kuwo wonke ama-SNP eqenjini (umgqomo) ngale khodi:

# Get bin-mates
snps_in_bin <- my_snp_results$snps_in_bin

for(current_snp in snps_in_bin){
  my_snp_results <- get_snp(current_snp, my_snp_results)
  # Do something with results 
}

Imiphumela

Manje sesingakwazi (futhi sesiqale ukusebenza kanzima) ukusebenzisa amamodeli nezimo ebesingafinyeleleki kuzo ngaphambilini. Okuhle kakhulu ukuthi ozakwethu baselebhu akudingeki bacabange nganoma yiziphi izinkinga. Banomsebenzi osebenzayo nje.

Futhi nakuba iphakheji libasindisa imininingwane, ngizamile ukwenza ifomethi yedatha ibe lula ngokwanele ukuze bakwazi ukuyithola uma nginyamalala kungazelelwe kusasa...

Isivinini senyuke ngokuphawulekayo. Ngokuvamile siskena izingcezu zegenome ezibalulekile ezisebenzayo. Ngaphambilini, asikwazanga ukwenza lokhu (kuvele kubiza kakhulu), kodwa manje, ngenxa yesakhiwo seqembu (umgqomo) kanye nokugcinwa kwesikhashana, isicelo se-SNP eyodwa sithatha isilinganiso esingaphansi kwamasekhondi angu-0,1, futhi ukusetshenziswa kwedatha kunjalo. eziphansi ukuthi izindleko ze-S3 zingamakinati.

isiphetho

Lesi sihloko asisona umhlahlandlela nhlobo. Isixazululo siphenduke saba ngabanye, futhi cishe sasingekho kahle. Kunalokho, i-travelogue. Ngifuna abanye baqonde ukuthi izinqumo ezinjalo aziveli ngokuphelele ekhanda, ziwumphumela wokuzama nokuphutha. Futhi, uma ufuna usosayensi wedatha, khumbula ukuthi ukusebenzisa lawa mathuluzi kudinga isipiliyoni ngempumelelo, futhi ulwazi lubiza imali. Ngiyajabula ngokuthi nginayo indlela yokukhokha, kodwa abanye abaningi abakwazi ukwenza umsebenzi ofanayo kangcono kunami abasoze balithola ithuba ngenxa yokungabi namali yokuzama ngisho nokuzama.

Amathuluzi amakhulu edatha ahlukahlukene. Uma unesikhathi, cishe ungabhala isixazululo esisheshayo usebenzisa ukuhlanza idatha okuhlakaniphile, ukugcinwa, namasu okukhipha. Ekugcineni kufika ekuhlaziyweni kwezindleko zenzuzo.

Engikufundile:

  • ayikho indlela eshibhile yokuhlaziya i-25 TB ngesikhathi;
  • qaphela ubukhulu bamafayela akho e-Parquet kanye nenhlangano yawo;
  • Izingxenye ze-Spark kufanele zilinganiswe;
  • Ngokuvamile, ungalokothi uzame ukwenza izingxenye eziyizigidi ezingu-2,5;
  • Ukuhlunga kusenzima, njengoba ukusetha i-Spark;
  • ngezinye izikhathi idatha ekhethekile idinga izixazululo ezikhethekile;
  • Ukuhlanganiswa kwe-Spark kuyashesha, kodwa ukwahlukanisa kusabiza;
  • ungalali uma bekufundisa izinto eziyisisekelo, mhlawumbe othile useyixazululile inkinga yakho emuva ngeminyaka yawo-1980;
  • gnu parallel - lokhu kuyinto yemilingo, wonke umuntu kufanele ayisebenzise;
  • U-Spark uthanda idatha engacindezelwanga futhi akathandi ukuhlanganisa ama-partitions;
  • I-Spark ine-overhead eningi kakhulu lapho ixazulula izinkinga ezilula;
  • Amalungu afanayo e-AWK asebenza kahle kakhulu;
  • ungaxhumana stdin и stdout kusuka kusikripthi esingu-R, ngakho-ke sisebenzise epayipini;
  • Ngenxa yokusebenzisa indlela ehlakaniphile, i-S3 ingacubungula amafayela amaningi;
  • Isizathu esiyinhloko sokumosha isikhathi ukwenza indlela yakho yokugcina ibe ngcono ngaphambi kwesikhathi;
  • ungazami ukukhulisa imisebenzi ngesandla, vumela ikhompuyutha ikwenze;
  • I-API kufanele ibe lula ukuze kube lula futhi kube lula ukuyisebenzisa;
  • Uma idatha yakho ilungiswe kahle, ukulondoloza isikhashana kuzoba lula!

Source: www.habr.com

Engeza amazwana