Intel has published the first major release of the ControlFlag 1.0 tool, which allows you to identify errors and anomalies in source code using a machine learning system trained on a large amount of existing code. Unlike traditional static analyzers, ControlFlag does not apply ready-made rules, in which it is difficult to provide for all possible options, but is based on statistics on the use of various language constructs in a large number of existing projects. The ControlFlag code is written in C++ and is open sourced under the MIT license.
The system is trained by building a statistical model of the existing code array of open-source projects published in GitHub and similar public repositories. At the training stage, the system determines typical patterns for constructing structures in the code and builds a syntactic tree of connections between these patterns, reflecting the flow of code execution in the program. As a result, a reference decision-making tree is formed that combines the development experience of all analyzed source codes. The code under review undergoes a similar process of identifying patterns that are checked against a reference decision tree. Large discrepancies with neighboring branches indicate the presence of an anomaly in the pattern being checked.
As an example of ControlFlag's capabilities, the developers analyzed the source codes of the OpenSSL and cURL projects:
- Anomalous constructs “(s1 == NULL) ∧ (s2 == NULL)” and “(s1 == NULL) | (s2 == NULL)" , which do not match the commonly used pattern "(s1 == NULL) || (s2 == NULL)". The code also identified anomalies in the expressions “(-2 == rv)” (the minus was a typo) and “BIO_puts(bp, “:”) <= 0)” (in the context of checking the successful completion of the function it should have been “== 0").
- In cURL, an error was discovered that was not detected by static analyzers when using the structure element “s->keepon”, which had a numeric type, but was compared with the boolean value TRUE.
Among the features of the ControlFlag 1.0 version, there is full support for standard templates for the C language and the ability to detect anomalies in conditional “if” expressions. For example, when analyzing the code fragment “if (x = 7) y = x;” The system will determine that the “if” statement usually uses the “variable == number” construction to compare numeric values, so it is highly likely that the “variable = number” in the “if” expression is caused by a typo. The kit includes a script that allows you to download existing C language repositories on GitHub and use them to build the model. Ready-made models are also available, allowing you to immediately begin checking the code.
Source: opennet.ru