Microsoft has opened the code for the vector search library used by Bing

Microsoft company ΠΎΠΏΡƒΠ±Π»ΠΈΠΊΠΎΠ²Π°Π»Π° machine learning library source code SPTAG (Space Partition Tree And Graph) with an implementation of the approximate nearest neighbor search. Library developed by in the research division of Microsoft Research and the development center of search technologies (Microsoft Search Technology Center). In practice, SPTAG is used in the Bing search engine to determine the most relevant results, taking into account the context of search queries. The code is written in C++ and spreads under the MIT license. Build for Linux and Windows is supported. There is a binding for the Python language.

Despite the fact that the ideas of using vector storage in search engines have been floating around for a long time, in practice their implementation is hindered by the high resource intensity of operations with vectors and limitations in scalability. The combination of deep machine learning methods with approximate nearest neighbor search algorithms has made it possible to bring the performance and scalability of vector systems to a level acceptable for large search engines. For example, in Bing, for a vector index of more than 150 billion vectors, the time to fetch the most relevant results is 8ms.

The library includes tools for building an index and organizing a search for vectors, as well as a set of tools for maintaining a distributed online search system covering very large collections of vectors. Offered the following modules: an index builder for indexing, a searcher for searching using an index distributed in a cluster of several nodes, a server for running handlers on nodes, an Aggregator for combining multiple servers into one, and a client for sending requests. It supports including new vectors in the index and deleting vectors on the fly.

The library assumes that the data processed and presented in the collection is presented in the form of related vectors that can be compared based on Euclidean (L2) or cosine distances. The search query returns vectors with the minimum distance between them and the original vector. SPTAG provides two methods for organizing a vector space: SPTAG-KDT (K-Dimensional Tree (kd-tree) and relative neighborhood graph) and SPTAG-BKT (k-means tree (k-means tree and relative neighborhood graph). The first method requires less resources when working with the index, and the second demonstrates higher accuracy of search results for very large collections of vectors.

At the same time, vector search is not limited to text and can be applied to multimedia information and images, as well as for automatic recommendation generation systems. For example, in one of the prototypes based on the PyTorch framework, a vector system for searching based on the similarity of objects in images was implemented, built using data from several reference collections with images of animals, cats and dogs, which were converted into sets of vectors. When an incoming image is received for search, it is converted using a machine learning model into a vector, based on which, using the SPTAG algorithm, the most similar vectors are selected from the index and the associated images are returned as a result.

Source: opennet.ru

Add a comment