Morero oa RedPajama o hlahisa dataset e bulehileng bakeng sa litsamaiso tsa bohlale ba maiketsetso

E hlahisitse RedPajama, morero o kopanetsoeng o reretsoeng ho theha mekhoa e bulehileng ea ho ithuta ka mochine le ho tsamaisana le lisebelisoa tsa koetliso tse ka sebelisoang ho theha bathusi ba bohlale ba hlōlisanang le lihlahisoa tsa khoebo tse kang ChatGPT. Ho ba teng ha lintlha tsa mohloli o bulehileng le mefuta e meholo ea lipuo ho lebelletsoe ho lokolla lihlopha tse ikemetseng tsa ho ithuta ka mochini le ho etsa hore ho be bonolo ho theha litsamaiso tse tloaelehileng tsa moqoqo. Mekhatlo le lichaba tse kang Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research le MILA Québec AI Institute e kene mosebetsing ona.

Mohato oa pele e ne e le ho hatisoa ha dataset ea RedPajama-Data-1T bakeng sa ho koetlisa mehlala ea lipuisano, e nang le li-tokens tsa 1.2 trillion. RedPajama suite e hlahisa lintlha tse fumanehang phatlalatsa tse sebelisoang ke Facebook ho theha mofuta oa eona oa LLaMA (o boleng ba li-tokens tsa 1.25 trillion), empa e fanoa ka tlas'a laesense e bulehileng, e bulehileng (lintlha tsa LLaMA le mehlala li fumaneha feela ho bafuputsi ka kopo e khethehileng bakeng sa bao e seng -sebediso ya kgwebo). Setsi sa download sa RedPajama-Data-1T se boholo ba 2.67 TB mme se kenyelletsa tlhahisoleseling ho tsoa maqepheng a marang-rang a tloaelehileng a Crawl, li-archives tsa Wikipedia, khoutu ea mohloli ho tsoa ho GitHub, libuka tsa sechaba tse tsoang laeboraring ea Gutenberg, lingoliloeng tsa mahlale tse tsoang polokelong ea ArXiv, le lipuisano tse tsoang ho. Stack Overflow le libaka tse ling tsa Stack Exchange.

Mehlala e seng e lokisitsoe, e koetlisitsoeng motheong oa data e lokisitsoeng le e ntlafalitsoeng ho sebelisoa mehlala e seng e entsoe ea lipuisano ka mokhoa oa ho phethahatsa litaelo ho tsoa ho merero ea Alpaca le OpenChatKit, e reriloe hore e thehoe libekeng tse 'maloa tse tlang. Mehato e ts'oanang ea mekhoa ea lipuo e kenyelletsa merero e bulehileng ka mokhoa o itseng LLaMA, Alpaca, Vicuna, le Koala, hammoho le matsapa a bulehileng ka botlalo a Pythia, OpenChatKit, Open Assistant, le Dolly.

Ho feta moo, merero e 'maloa e mecha e amanang le ho ithuta ka mochini e ka hlokomeloa:

  • MiniGPT-4 - e holisa li-chatbots tsa setso tse sebelisanang le bokhoni bo nkang tlhaiso-leseling e bonoang, e u lumellang ho sekaseka litšoantšo le ho ela hloko mongolo o ngotsoeng ka letsoho ha o sebelisana le sistimi (mohlala, o ka botsa hore na ke ntho ea mofuta ofe e bontšitsoeng setšoantšong. , kopa bot hore e ngole pale e thehiloeng ho e bontšitsoeng setšoantšong, kapa e itšetlehile ka setšoantšo sa sketch, kopa ho theha websaeteng). Ts'ebetsong ea MiniGPT-4 e ngotsoe ka Python mme e ajoa tlas'a laesense ea BSD.
  • Facebook e hatisitse lisebelisoa le thuto ea ho ithuta (SSL, Thuto ea Boithaopo, ha e sebelise lileibole le litlhaloso tse lokiselitsoeng ke batho nakong ea koetliso) mohlala oa pono ea k'homphieutha DINOv2, e loketseng ho rarolla mathata a ts'ebetso ea boitsebiso bo tloaelehileng ba pono (setšoantšo sa litšoantšo, ho ntša tlhahisoleseding e mabapi le lintho tse litšoantšong, ho utloisisa se etsahalang videong) le manipulations boemong ba pixel (bolelele bo tebileng, karohano). Mohlala o ne o koetliselitsoe pokellong ea litšoantšo tse limilione tse 142. Ts'ebetsong e ngotsoe ka Python 'me e ajoa tlas'a laesense ea Creative Commons Attribution-NonCommercial 4.0, e lumellang tšebeliso e seng ea khoebo.
  • GPT4All ke sesebelisoa sa ho qala kapele li-chatbots tse ikemetseng ho lisebelisoa tsa hau (ha li fihlelle lits'ebeletso tsa kantle mme li sebelisa CPU e nang le ts'ehetso ea AVX2 bakeng sa ts'ebetso). E ts'ehetsa khokahano ea mefuta e meholo ea lipuo e thehiloeng ho GPT-J le LLaMa. Khoutu e ngotsoe ka Python mme e ajoa tlasa laesense ea MIT.

Source: opennet.ru

Eketsa ka tlhaloso