FlexGen is an engine for running ChatGPT-like AI bots on single GPU systems

A group of researchers from Stanford University, the University of California at Berkeley, ETH Zurich, the Higher School of Economics, Carnegie Mellon University, as well as Yandex and Meta, have published the source code for an engine for running large language models on systems with limited resources. For example, the engine provides the ability to create functionality reminiscent of ChatGPT and Copilot by running a pre-trained OPT-175B model covering 175 billion parameters on a regular computer with an NVIDIA RTX3090 gaming graphics card equipped with 24GB of video memory. The code is written in Python, uses the PyTorch framework and is distributed under the Apache 2.0 license.

The package includes an example script for creating bots that allows you to download one of the publicly available language models and immediately start chatting (for example, by running the command "python apps/chatbot.py --model facebook/opt-30b ---percent 0 100 100 0 100 0" ). As a base, it is proposed to use a large language model published by Facebook, trained on the BookCorpus collections (10 thousand books), CC-Stories, Pile (OpenSubtitles, Wikipedia, DM Mathematics, HackerNews, etc.), Pushshift.io (based on Reddit data) ) and CCNewsV2 (news archive). The model covers about 180 billion tokens (800 GB of data). It took 33 days of cluster operation with 992 NVIDIA A100 80GB GPUs to train the model.

When running OPT-175B on a system with a single NVIDIA T4 GPU (16GB), the FlexGen engine demonstrated performance up to 100 times faster than previously offered solutions, making the use of large language models more affordable and allowing them to run on systems without specialized accelerators. At the same time, FlexGen can scale to parallelize calculations in the presence of multiple GPUs. To reduce the size of the model, an additional parameter compression scheme and a model caching mechanism are used.

Currently, FlexGen only supports OPT language models, but in the future, the developers also promise to add support for BLOOM (176 billion parameters, supports 46 languages ​​​​and 13 programming languages), CodeGen (can generate code in 22 programming languages) and GLM. An example of a dialogue with a bot based on FlexGen and the OPT-30B model:

Human: What is the name of the tallest mountain in the world?

Assistant: Everest.

Human: I am planning a trip for our anniversary. What things can we do?

Assistant: Well, there are a number of things you can do for your anniversary. First, you can play cards. Second, you can go for a hike. Third, you can go to a museum.

Source: opennet.ru

Add a comment