Intel opened the ControlFlag machine learning system code to identify bugs in the code

Intel has discovered developments related to the ControlFlag research project, aimed at creating a machine learning system to improve code quality. The toolkit prepared by the project allows, based on a model trained on a large amount of existing code, to detect various errors and anomalies in source texts written in high-level languages ​​such as C/C++. The system is suitable for identifying various kinds of problems in code, from detecting typos and incorrect type combinations, to detecting missed NULL checks in pointers and memory problems. The ControlFlag code is written in C++ and is open source under the MIT license.

The system is self-learning by building a statistical model of the existing code array of open projects published in GitHub and similar public repositories. At the learning stage, the system determines typical patterns for constructing structures in the code and builds a syntactic tree of links between these patterns, which reflects the code execution flow in the program. As a result, a reference decision tree is formed that combines the experience of developing all the analyzed source texts.

The code being tested goes through a similar process of identifying patterns that are checked against a reference decision tree. Large discrepancies with neighboring branches indicate the presence of an anomaly in the template being checked. The system also allows not only to identify an error in the template, but also to suggest a correction. For example, in the OpenSSL code, the construction β€œ(s1 == NULL) ∧ (s2 == NULL)” was found, which occurred only 8 times in the syntax tree, while the nearest branch with the value β€œ(s1 == NULL) || (s2 == NULL)" occurred about 7 thousand times. The system also detected an anomaly β€œ(s1 == NULL) | (s2 == NULL)" which occurred 32 times in the tree.

Intel opened the ControlFlag machine learning system code to identify bugs in the code

When parsing the code snippet "if (x = 7) y = x;" the system has determined that the β€œvariable == number” construction is usually used in the β€œif” statement to compare numeric values, therefore, with a high probability, the indication β€œvariable = number” in the β€œif” expression is caused by a typo. Traditional static analyzers would catch such an error, but unlike them, ControlFlag does not use ready-made rules in which it is difficult to foresee all possible options, but is based on the statistics of using all kinds of constructs in a large number of projects.

As an experiment, using ControlFlag in the source code of the cURL utility, which is often cited as an example of high-quality and verified code, an error unnoticed by static analyzers was detected when using the "s->keepon" structure element, which had a numeric type, but was compared with the boolean value TRUE . In the OpenSSL code, in addition to the aforementioned problem with "(s1 == NULL) ∧ (s2 == NULL)", there were also anomalies in the expressions "(-2 == rv)" (the minus was a typo) and "BIO_puts(bp, ":")

Source: opennet.ru

Add a comment