The RedPajama project develops an open dataset for artificial intelligence systems

A RedPajama collaborative project is presented to create open machine learning models and accompanying training inputs that can be used to build intelligent assistants that compete with commercial products such as ChatGPT. It is expected that the presence of open source data and large language models will remove the restrictions of independent teams engaged in research in the field of machine learning, and will simplify the creation of specialized dialogue systems. Organizations and communities such as Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research and MILA QuΓ©bec AI Institute have joined the work on the project.

The first step was the publication of the 1 trillion token RedPajama-Data-1.2T dataset for training conversational models. The RedPajama set reproduces data from public sources used by Facebook to create its LLaMA model (totals 1.25 trillion tokens), but is supplied under an open license that does not limit the scope of use (LLaMA data and models were supplied only to researchers by special request for non-commercial use). The RedPajama-Data-1T downloadable set is 2.67 TB and includes information from Common Crawl indexed web pages, Wikipedia archives, source code from GitHub, public books from the Gutenberg library, scientific articles from the ArXiv archive and discussions with Stack Overflow and other Stack Exchange sites.

Ready-made models, trained on the basis of the prepared dataset and optimized using ready-made examples of dialogs in the form of instruction-execution from the Alpaca and OpenChatKit projects, are planned to be formed in the next few weeks. Similar language model initiatives include the partially open source projects LLaMA, Alpaca, Vicuna, and Koala, as well as the fully open source initiatives Pythia, OpenChatKit, Open Assistant, and Dolly.

Additionally, there are several new projects related to machine learning:

  • MiniGPT-4 - extends traditional conversational chatbots with capabilities that take into account visual information, which allows you to analyze images and take into account handwritten text in the process of interacting with the system (for example, you can ask what kind of object is shown in the picture, ask the bot to write a story based on what is shown in the photo, or based on a schematic sketch, ask to create a website). The MiniGPT-4 implementation is written in Python and distributed under the BSD license.
  • Facebook has published a toolkit and a self-learning (SSL, Self-Supervised Learning, does not use human-prepared labels and annotations) DINOv2 machine vision model suitable for solving problems of generalized visual data processing (image classification, extracting information about objects in images, understanding what is happening on video) and manipulations at the pixel level (depth prediction, segmentation). The model is trained on a collection of 142 million images. The implementation is written in Python and distributed under a Creative Commons Attribution-NonCommercial 4.0 license that allows non-commercial use.
  • GPT4All is a toolkit for quickly launching stand-alone chatbots on their own hardware (they do not access external services and use CPUs with AVX2 support to execute). Connecting large language models based on GPT-J and LLaMa is supported. The code is written in Python and distributed under the MIT license.

Source: opennet.ru

Add a comment