Ikhowudi yokuqaphela intetho ye-Whisper kunye nenkqubo yoguqulo ivuliwe

Iprojekthi ye-OpenAI, ephuhlisa iiprojekthi zoluntu kummandla wobukrelekrele bokwenziwa, ipapashe uphuhliso olunxulumene nenkqubo yokuqaphela intetho ye-Whisper. Kuthiwa kwintetho ngesiNgesi inkqubo ibonelela ngamanqanaba okuthembeka kunye nokuchaneka kokuqondwa okuzenzekelayo kufutshane nokuqondwa komntu. Ikhowudi yokuphunyezwa kwereferensi esekelwe kwisakhelo sePyTorch kunye neseti yeemodeli esele ziqeqeshiwe, zilungele ukusetyenziswa, zivuliwe. Ikhowudi ivuliwe phantsi kwelayisenisi ye-MIT.

Ukuqeqesha imodeli, iiyure ezingama-680 zedatha yokuthetha zisetyenzisiwe, ziqokelelwe kwiiqoqo ezininzi ezigubungela iilwimi ezahlukeneyo kunye nezifundo. Malunga ne-1/3 yedatha yentetho ebandakanyekayo kuqeqesho ikwiilwimi ezingezizo isiNgesi. Inkqubo ecetywayo iphatha ngokuchanekileyo iimeko ezifana nokubiza amagama abukhali, ingxolo yangasemva, kunye nokusetyenziswa kwejagoni yobugcisa. Ukongeza ekuguquleleni intetho kwisicatshulwa, inkqubo inokuguqulela intetho ukusuka kulo naluphi na ulwimi ukuya kwisiNgesi kwaye ibone ukubonakala kwentetho kumsinga ovakalayo.

Iimodeli zenziwe kwiimpawu ezimbini: imodeli yolwimi lwesiNgesi kunye nemodeli yeelwimi ezininzi, ekwaxhasa iilwimi zesiRashiya, isi-Ukraine kunye neBelarusian. Ngaloo ndlela, umboniso ngamnye uhlukaniswe kwiinketho ezi-5, ezihluke ngobukhulu kunye nenani leeparamitha ezifakwe kwimodeli. Ubukhulu besayizi, bukhulu ukuchaneka kunye nomgangatho wokuqatshelwa, kodwa kunye neemfuneko eziphezulu zobukhulu bememori yevidiyo ye-GPU kunye nokusebenza okuphantsi. Ngokomzekelo, ukhetho oluncinci lubandakanya i-39 yezigidi zeeparameters kwaye lufuna i-1 GB yememori yevidiyo, kwaye ubuninzi bubandakanya i-1550 yezigidi zeeparamitha kwaye idinga i-10 GB yememori yevidiyo. Olona khetho luncinci lunamaxesha angama-32 ngokukhawuleza kunobuninzi.

Ikhowudi yokuqaphela intetho ye-Whisper kunye nenkqubo yoguqulo ivuliwe

Inkqubo isebenzisa iTransformer neural network architecture, equka i-encoder kunye ne-decoder esebenzisana enye kwenye. I-audio ihlelwe phantsi kwiinqununu ze-30-yesibini, eziguqulwa zibe yi-log-Mel spectrogram kwaye ithunyelwe kwi-encoder. Imveliso ye-encoder ithunyelwa kwidikhowuda, eqikelela umboniso wokubhaliweyo oxutywe nemiqondiso ekhethekileyo evumela, kumzekelo omnye jikelele, ukusombulula iingxaki ezinjengobhaqo lolwimi, ubalo lolandelelwano lwamaxesha lokubizwa kwamabinzana, ushicilelo lwentetho kwi. iilwimi ezahlukeneyo, kunye nokuguqulelwa kwisiNgesi.

umthombo: opennet.ru

Yongeza izimvo