Open source code for Jina Embedding, a model for vector representation of text meaning

Jina has open-sourced a machine learning model for vector text representation, jina-embeddings-v2.0, under the Apache 2 license. The model allows you to convert arbitrary text, including up to 8192 characters, into a small sequence of real numbers that form a vector that is compared with the source text and reproduces its semantics (meaning). Jina Embedding was the first open machine learning model to have the same performance as the proprietary text vectorization model from the OpenAI project (text-embedding-ada-002), also capable of processing text with up to 8192 tokens.

The distance between two generated vectors can be used to determine the semantic relationship of the source texts. In practice, the generated vectors can be used to analyze the similarity of texts, organize a search for materials related to the topic (ranking results by semantic proximity), group texts by meaning, generate recommendations (offer a list of similar text strings), identify anomalies, detect plagiarism and classify tests. Examples of areas of use include the use of the model for the analysis of legal documents, for business analytics, in medical research for processing scientific articles, in literary criticism, for parsing financial reports and for improving the quality of chatbot processing of complex issues.

Two versions of the jina-embeddings model are available for download (basic - 0.27 GB and reduced - 0.07 GB), trained on 400 million pairs of text sequences in English, covering various fields of knowledge. During training, sequences with a size of 512 tokens were used, which were extrapolated to a size of 8192 using the ALiBi (Attention with Linear Biases) method.

The basic model includes 137 million parameters and is designed for use on stationary systems with a GPU. The reduced model includes 33 million parameters, provides less accuracy and is aimed at use on mobile devices and systems with a small amount of memory. In the near future they also plan to publish a large model that will cover 435 million parameters. A multilingual version of the model is also in development, currently focusing on support for German and Spanish. A plugin has been prepared separately for using the jina-embeddings model through the LLM toolkit.

Source: opennet.ru

Add a comment