GitHub opened developments on the use of machine learning for code search and analysis

GitHub presented project CodeSearchNet, which prepared machine learning models and data sets necessary for parsing, classifying and analyzing code in various programming languages. CodeSearchNet, similar to ImageNet, includes a large collection of code snippets annotated to formalize what the code does. Components for training models and examples of using CodeSearchNet are written in Python using the Tensorflow framework and spreads under the MIT license.

When creating CodeSearchNet, natural language text parsing technologies were used, enabling machine learning systems to take into account not only syntactic features, but also the meaning of the actions performed by the code. GitHub system applies in experiments on organizing semantic code search using queries on natural language (for example, when requesting "sorting a list of strings", the code with the implementation of the corresponding algorithms is displayed).

The proposed data set includes more than 2 million code-comment links prepared on the basis of the source texts of existing open libraries. The code covers the full source text of individual functions or methods, and the comment describes the actions performed by the function (detailed documentation is provided). Datasets are currently prepared for Python, JavaScript, Ruby, Go, Java, and PHP. Examples of using the proposed datasets for training various types of neural networks are provided, including Neural-Bag-Of-Words, RNN, Self Attention (BERT) and 1D-CNN+Self-Attention Hybrid.

For the development of natural language search mechanisms, the CodeSearchNet Challenge set was additionally prepared, including
99 typical queries with about 4 thousand expert annotations describing the most likely code bindings in the CodeSearchNet Corpus dataset, covering about 6 million methods and functions (set size about 20 GB). The CodeSearchNet Challenge can act as a benchmark for evaluating the effectiveness of certain methods of searching for natural language code. Using tools KubeFlow prepared by
example code search engine.

Source: opennet.ru

Add a comment