How to add checks to NoVerify without writing a single line of Go code

A killer feature has appeared in the NoVerify static analyzer: a declarative way to describe inspections that does not require Go programming and code compilation.







To intrigue you, I'll show you a description of a simple but useful inspection:







/** @warning duplicated sub-expressions inside boolean expression */ $x && $x;
      
      





This inspection finds all logical &&



expressions where the left and right operand are identical.







NoVerify is a static analyzer for PHP written in Go . You can read about it in the article โ€œ NoVerify: Linter for PHP from the VKontakte Team โ€. And in this review I will talk about new functionality and how we came to it.













Background



When even for a simple new check you need to write a few dozen lines of code on Go, you start to wonder: is it possible otherwise?







On Go, we have written the type inference, the entire pipeline of the linter, the metadata cache, and many other important elements without which NoVerify is impossible. These components are unique, but tasks like โ€œprohibiting the invocation of function X with a set of arguments Yโ€ do not. Just for such simple tasks the mechanism of dynamic rules has been added.







Dynamic rules allow you to separate complex internals from solving typical problems. The definition file can be stored and versioned separately - it can be edited by people who are not related to the development of NoVerify itself. Each rule implements a code inspection (which we will sometimes call a check).







Yes, if we have a language for describing these rules, you can always write a semantically incorrect template or ignore some type restrictions - and this leads to false positives. Nevertheless, the data race or dereferencing of the nil



pointer through the language of the rules is not entered.







Template Description Language



The description language is syntactically compatible with PHP. This simplifies its study, and also makes it possible to edit rules files using the same PhpStorm.







At the very beginning of the rules file, it is recommended to insert a directive soothing your favorite IDE:







 <?php /** *      , *        PHP-. * * @noinspection ALL */ // ...  โ€”   .
      
      





My first experiment with syntax and possible filters for templates was phpgrep . It may be useful on its own, but inside NoVerify it has become even more interesting, because now it has access to type information.







Some of my colleagues have already tried phpgrep in their work, and this was another argument in favor of choosing just such a syntax .







Phpgrep itself is a gogrep adaptation for PHP (you may also be interested in cgrep ). Using this program, you can search for code through syntax templates .







An alternative would be the structural search and replace (SSR) syntax from PhpStorm. The advantages are obvious - this is an existing format, but I found out about this feature after I implemented phpgrep. You can, of course, provide a technical explanation: there is syntax incompatible with PHP and our parser will not master it, but this convincing "real" reason was discovered after writing the bike.







In fact, there was another option



It could be required to display a template with PHP code almost one to one - or go the other way: invent a new language, for example with the syntax of S-expressions .







 PHP-like Lisp-like ----------------------------- $x = $y | (expr = $x $y) fn($x, 1) | (expr call fn $x 1)          : (or (expr == (type string (expr)) (expr)) (expr == (expr) (type string (expr))))
      
      





In the end, I thought that the readability of the templates is still important, and we can add filters through the phpdoc attributes.







clang-query is an example of a similar idea, but it uses more traditional syntax.










We create and run our own diagnostics!



Let's try to implement our new diagnostics for the analyzer.







To do this, you need installed NoVerify. Take the binary release if you do not have a Go-toolchain in the system (if you have one, you can compile everything from the source).







If you do not install NoVerify, you can continue to read further, but pretend to reproduce the steps listed and admire the result!



Formulation of the problem



PHP has many interesting functions, one of them is parse_str . Her signature:







 //   encoded_string,     //   URL,      //   (  ,    result). parse_str ( string $encoded_string [, array &$result ] ) : void
      
      





You will understand what is wrong here if you look at this example from the documentation:







 $str = "first=value&arr[]=foo+bar&arr[]=baz"; parse_str($str); echo $first; // value echo $arr[0]; // foo bar echo $arr[1]; // baz
      
      





Mmm, the parameters from the string were in the current scope. To avoid this, we will require in our new test to use the second parameter of the function, $result



, so that the result is written to this array.







Create your own diagnostics



Create the myrules.php



file:







 <?php /** @warning parse_str without second argument */ parse_str($_);
      
      





The rules file in general is a list of expressions at the top level, each of which is interpreted as a phpgrep template. A special phpdoc comment is expected for each such template. Only one attribute is required - an error category with a warning text.







There are now four levels in total: error



, warning



, info



and maybe



. The first two are critical: the linter will return a non-zero code after execution if at least one of the critical rules works. After the attribute itself there is a warning text that will be issued by the linter in case the template is triggered.







The template we wrote uses $_



- this is an unnamed template variable. We could name it, for example, $x



, but since we are not doing anything with this variable, we can give it an โ€œemptyโ€ name. The difference between template variables and PHP variables is that the former coincide with absolutely any expression, and not just with a โ€œliteralโ€ variable. This is convenient: we often need to look for unknown expressions, rather than specific variables.







Starting a new diagnostic



Create a small test file for debugging, test.php



:







 <?php function f($x) { parse_str($x); //      }
      
      





Next, run NoVerify with our rules on this file:







 $ noverify -rules myrules.php test.php
      
      





Our warning will look something like this:







 WARNING myrules.php:4: parse_str without second argument at test.php:4 parse_str($x); ^^^^^^^^^^^^^
      
      





The name of the default check is the name of the rules file and the line that defines this check. In our case, this is myrules.php:4



.







You can set your name using the @name <name>



attribute.







@Name example



 /** * @name parseStrResult * @warning parse_str without second argument */ parse_str($_);
      
      





 WARNING parseStrResult: parse_str without second argument at test.php:4 parse_str($x); ^^^^^^^^^^^^^
      
      





Named rules succumb to the laws of other diagnostics:







  • Can be disabled via -exclude-checks



  • -critical



    level can be redefined via -critical








Work with types



The previous example is good for hello world - but often we need to know the types of expressions in order to reduce the number of diagnostic operations







For example, for the in_array function , we ask for the argument $strict=true



when the first argument ( $needle



) is of string type.







For this we have result filters.







One such filter is @type <type> <var>



. It allows you to discard everything that does not fit the enumerated types.







 /** * @warning 3rd arg of in_array must be true when comparing strings * @type string $needle */ in_array($needle, $_);
      
      





Here we gave the name of the first argument to the in_array



call to bind a type filter to it. A warning will be issued only when the type of $needle



is string



.







Filter sets can be combined with the @or



operator:







 /** *     -. * * @warning strings must be compared using '===' operator * @type string $x * @or * @type string $y */ $x == $y;
      
      





In the example above, the pattern will only match those ==



expressions, where any of the operands is of type string



. We can assume that without @or



all filters are combined through @and



, but this is not necessary explicitly.







Limit the scope of diagnosis



For each test, you can specify @scope <name>



:









Suppose we want to report return



outside the function body. In PHP, this sometimes makes sense - for example, when a file is connected from a function ... But in this article we condemn this.







 /** * @warning don't use return outside of functions * @scope root */ return $_;
      
      





Let's see how this rule will behave:







 <?php function f() { return "OK"; } return "NOT OK"; // Gives a warning class C { public function m() { return "ALSO OK"; } }
      
      





Similarly, you can make a request to use *_once



instead of require



and include



:







 /** * @maybe prefer require_once over require * @scope root */ require $_; /** * @maybe prefer include_once over include * @scope root */ include $_;
      
      





When matching patterns, brackets are not taken into account quite consistently. The pattern (($x))



will not find โ€œall expressions in double bracketsโ€, but simply any expressions, ignoring the brackets. However, $x+$y*$z



and ($x+$y)*$z



behave as they should. This feature comes from the difficulties of working with tokens (



and )



, but there is a chance that the order will be restored in one of the next releases.


Grouping Templates



When duplication of phpdoc comments appears on templates, the ability to combine templates comes to the rescue.







A simple example to demonstrate:







It was It became (with grouping)
 / ** @maybe don't use exit or die * /
 die ($ _);

 / ** @maybe don't use exit or die * /
 exit ($ _);
 / ** @maybe don't use exit or die * /
 {
   die ($ _);
   exit ($ _);
 }


Now imagine how unpleasant it would be to describe a rule in the following example without this feature!







 /** * @warning don't compare arrays with numeric types * @type array $x * @type int|float $y * @or * @type int|float $x * @type array $y */ { $x > $y; $x < $y; $x >= $y; $x <= $y; $x == $y; }
      
      





The recording format specified in the article is just one of the proposed options. If you want to participate in the choice, then you have the opportunity: you need to put +1 to those offers that you like most than others. For more details, click here .







How dynamic rules are integrated









At the time of launch, NoVerify tries to find the rules file that is specified in the rules



argument.







Next, this file is parsed as a regular PHP script, and a set of rule objects with phpgrep templates bound to them is collected from the resulting AST .







Then the analyzer starts the work according to the usual scheme - the only difference is that for some checked sections of code it starts a set of bound rules. If the rule is triggered, a warning is displayed.







Success is considered to be the successful matching of the phpgrep template and the passage of at least one of the filter sets (they are separated by @or



).







At this stage, the rules mechanism does not significantly slow down the operation of the linter, even if there are a lot of dynamic rules.







Matching Algorithm



With the naive approach for each AST node, we need to apply all the dynamic rules. This is a very inefficient implementation because most of the work will be wasted: many templates have a specific prefix by which we can cluster the rules.







This is similar to the idea of parallel matching , but instead of honestly building the NFA, we only โ€œparallelizeโ€ the first step of the calculations.







Consider this with an example with three rules:







 /** @warning duplicated then/else parts of ternary */ $_ ? $x : $x; /** @warning don't call explode with delim="" */ explode("", ${"*"}); /** @maybe suspicious empty body of the if statement */ if ($_);
      
      





If we have N elements and M rules, with a naive approach we have N * M operations to perform. In theory, this complexity can be reduced to linear and get O(N)



- if you combine all the patterns into one and perform the match as it does, for example, the regexp package from Go.







However, in practice, I have so far focused on the partial implementation of this approach. It will allow the rules from the file above to be divided into three categories, and to those AST elements to which no rule corresponds, to assign a fourth, empty category. Due to this, no more than one rule is executed for each element.







If we have thousands of rules and we will feel a significant slowdown, the algorithm will be finalized. In the meantime, the simplicity of the solution and the resulting acceleration suit me.







The torment of choice, or A little about the @type



form



Task: to select good syntax for filters within phpdoc annotations.



The current syntax duplicates @var



and @var



, but we may need new operators, for example, "type is not equal." Imagine how it might look.







We have at least two important priorities:







  1. The readable and concise syntax of annotations.
  2. The highest possible support from the IDE without extra effort.


There is a php-annotations plugin for PhpStorm, which adds auto-completion, transition to annotation classes and other usefulness for working with phpdoc comments.







Priority (2) in practice means that you make decisions that do not contradict the expectations of the IDE and plugins. For example, you can make annotations in a format that the php-annotations plugin can recognize:







 /** * Type is a filter that checks that $value * satisfies the given type constraints. * * @Annotation */ class Filter { /** Variable name that is being filtered */ public $value; /** Check that value type is equal to $type */ public $type; /** Check that value text is equal to $text */ public $text; }
      
      





Then applying a filter to types would look something like this:







 @Type($needle, eq=string) @Type($x, not_eq=Foo)
      
      





Users could go to the definition of Filter



, they would be prompted with a list of possible parameters (type / text / etc).







Alternative recording methods, some of which were suggested by colleagues:







 @type string $needle @type !Foo $x @type $needle == string @type $x != Foo @type(==) string $needle @type(!=) Foo $x @type($needle) == string @type($x) != Foo @filter type($needle) == string @filter type($x) != Foo
      
      





Then we got a little distracted and forgot that it was all inside phpdoc, and this appeared:







 (eq string (typeof $needle)) (neq Foo (typeof $x))
      
      





Although the option with postfix recording for fun was also sounded. A language for describing type and value constraints might be called sixth:







 @eval string $needle typeof = @eval Foo $x typeof <>
      
      





The search for the best option is still not finished ...







Extensibility Comparison with Phan



As one of the advantages of Phan, in the article " Static analysis of PHP-code on the example of PHPStan, Phan and Psalm " extensibility is indicated.







Here is what was implemented in the sample plugin:







We wanted to evaluate how ready our code is for PHP 7.3 (in particular, to find out if it has case-insensitive-constants). We were almost sure that there were no such constants, but anything could happen in 12 years - it should be checked. And we wrote a plugin for Phan that would swear if the third parameter were used in define ().

This is how the plugin code looks (formatting is optimized for width):







 <?php use Phan\AST\ContextNode; use Phan\CodeBase; use Phan\Language\Context; use Phan\Language\Element\Func; use Phan\PluginV2; use Phan\PluginV2\AnalyzeFunctionCallCapability; use ast\Node; class DefineThirdParamTrue extends PluginV2 implements AnalyzeFunctionCallCapability { public function getAnalyzeFunctionCallClosures(CodeBase $code_base) { $def = function(CodeBase $cb, Context $ctx, Func $fn, $args) { if (count($args) < 3) { return; } $this->emitIssue( $cb, $ctx, 'PhanDefineCaseInsensitiv', 'define with 3 arguments', [] ); }; return ['define' => $def]; } } return new DefineThirdParamTrue();
      
      





And here is how this could be done in NoVerify:







 <?php /** @warning define with 3 arguments */ define($_, $_, $_);
      
      





We wanted to achieve roughly the same result - so that trivial things could be done as simply as possible.







Conclusion





Links, useful materials



Important links are collected here, some of which may already have been mentioned in the article, but for clarity and convenience, I collected them in one place.









If you need more examples of rules that can be implemented, you can take a peek at NoVerify tests .








All Articles