Iyo RedPajama purojekiti inogadzira yakavhurika dhatabheti yeakagadzirwa njere masisitimu

Chirongwa chekushandira pamwe cheRedPajama chinounzwa kuti chigadzire modhi yakavhurika yekudzidza muchina uye inoperekedza dzidziso yekupinda iyo inogona kushandiswa kuvaka vakangwara vabatsiri vanokwikwidza nekutengesa zvigadzirwa seChatGPT. Zvinotarisirwa kuti kuvapo kwedhata yakavhurika uye mhando dzemitauro mikuru kuchabvisa zvirambidzo zvezvikwata zvakazvimirira zviri kuita tsvakiridzo mumunda wekudzidza kwemichina, uye zvicharerutsa kusikwa kwehunyanzvi hwekutaurirana masisitimu. Masangano nenharaunda dzakaita seTogether, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research uye MILA QuΓ©bec AI Institute vakabatana nebasa pachirongwa ichi.

Danho rekutanga raive rekuburitswa kweiyo 1 tiririyoni token RedPajama-Data-1.2T dhatabheti yekudzidzisa modhi dzekukurukurirana. Iyo RedPajama seti inoburitsa data kubva kune veruzhinji masosi anoshandiswa neFacebook kugadzira ayo LLaMA modhi (yakazara 1.25 trillion tokens), asi inopihwa pasi perezinesi rakavhurika iro risingadzikisire chiyero chekushandiswa (LLaMA data nemhando dzakapihwa chete kune vaongorori neakakosha. chikumbiro chekushandiswa kusiri kwekutengesa). Iyo inodhawunirodha RedPajama-Data-1T seti ndeye 2.67 TB uye inosanganisira ruzivo kubva kuCommon Crawl indexed mapeji, Wikipedia archives, source code kubva kuGitHub, mabhuku eruzhinji kubva kuraibhurari yeGutenberg, zvinyorwa zvesainzi kubva mudura reArXiv, uye nhaurirano neStack Overflow nezvimwe. Stack Exchange nzvimbo.

Yakagadzirirwa-yakagadzirwa modhi, yakadzidziswa pahwaro hweiyo dataset yakagadziridzwa uye yakagadziridzwa uchishandisa yakagadzirira-yakagadzirwa mienzaniso yenhaurirano muchimiro chekuraira-kuita kubva kumapurojekiti eAlpaca neOpenChatKit, akarongwa kuumbwa mumavhiki mashoma anotevera. Matanho akafanana emhando yemutauro anosanganisira mapurojekiti akavhurika zvishoma LLaMA, Alpaca, Vicuna, uye Koala, pamwe neakavhurwa sosi zvirongwa Pythia, OpenChatKit, Vhura Mubatsiri, uye Dolly.

Pamusoro pezvo, kune akati wandei mapurojekiti ane chekuita nekudzidza muchina:

  • MiniGPT-4 - inotambanudzira echinyakare chatbots nehunyanzvi hunofunga nezve ruzivo rwekuona, izvo zvinokutendera kuti uongorore mifananidzo uye utore zvakanyorwa nemaoko mameseji mukuita kupindirana nehurongwa (semuenzaniso, unogona kubvunza rudzi rwechinhu chinoratidzwa. mumufananidzo, bvunza bot kuti inyore nyaya kubva pane zvinoratidzwa mumufananidzo, kana zvichibva pane schematic sketch, bvunza kugadzira webhusaiti). Iyo MiniGPT-4 kuita yakanyorwa muPython uye yakagoverwa pasi peiyo BSD rezinesi.
  • Facebook yakaburitsa bhuku rekushandisa uye yekudzidza wega (SSL, Kuzvitarisira Kudzidzira, haishandise mavara akagadzirirwa nevanhu uye zvirevo) DINOv2 muchina wekuona modhi yakakodzera kugadzirisa matambudziko e generalized kuona data processing (kuronga kwemifananidzo, kubvisa ruzivo rwezvinhu zviri mukati. mifananidzo, kunzwisisa zviri kuitika pavhidhiyo) uye manipulations padanho repixel (kufanotaura kwakadzama, kupatsanurwa). Iyo modhi inodzidziswa pamuunganidzwa wemamiriyoni e142 mifananidzo. Kuitwa kwacho kwakanyorwa muPython uye kwakagoverwa pasi peCreative Commons Attribution-NonCommercial 4.0 rezinesi rinobvumira kushandiswa kusiri kwekutengesa.
  • GPT4All ibhuku rekushandisa rekukurumidza kuvhura chatbots yekumira-yega pane yavo yehardware (havana kuwana ekunze masevhisi uye vanoshandisa maCPU ane AVX2 rutsigiro kuita). Kubatanidza mhando dzemitauro mikuru yakavakirwa paGPT-J neLLaMa inotsigirwa. Iyo kodhi yakanyorwa muPython uye yakagoverwa pasi peMIT rezinesi.

Source: opennet.ru

Voeg