👩🏻‍⚖️ ㊙️ 👌🏼 What if there is no static analyzer for your favorite language? 🚜 📓 🈯️

Well, if your favorite language means Russian, English, etc., then this is in another hub . And if the programming or markup language, then of course write the analyzer yourself! At first glance, it is very difficult, but, fortunately, there are ready-made multilingual tools in which it is relatively easy to add support for a new language. Today I will show how to add Modelica language support to the PMD analyzer with a fairly small amount of time.

By the way, do you know what can degrade the quality of the code base obtained from a sequence of ideal pull requests? The fact that third-party programmers copied pieces of existing project code into their patches instead of literate abstraction. You must admit that, to some extent, it’s even more difficult to catch such a banality than poor-quality code - it’s high-quality and even already thoroughly debugged, so local verification is not enough here, you need to keep in mind the entire code base, but this is not easy for a person ... So: if adding full support of Modelica (without creating specific rules) to the state “can run primitive checks” took me about a week, then support for copy-paste detector only can often be added in a day!

What else is Modelica?

Modelica is, as the name suggests, a language for writing models of physical systems. In fact, not only physical: it is possible to describe chemical processes, the quantitative behavior of animal populations, etc. - that is described by systems of differential equations of the form der(X) = f(X)

, where X

is the vector of unknowns. Imperative pieces of code are also supported. Partial differential equations are not explicitly supported, but it is possible to divide the study area into pieces (as we probably would have done in some general-purpose language), and then write down the equations for each element, reducing the problem to the previous one. The trick of Modelka is that the solution to this der(X) = f(X)

lies with the compiler: you can just change the solver in the settings , the equation does not have to be linear, etc. In short, there are some advantages (I wrote the formula from the textbook - and it worked), and the cons (with more abstraction, we get less control). An introduction to Modelika is the topic of a separate article (which has already appeared several times on Habré), and even a whole cycle, today it interests me as an open and having several implementations, but, alas, still a damp standard.

In addition, Modelika, on the one hand, has static typing (which will help us to write some meaningful analysis faster), on the other hand, when instantiating a model, the compiler is not required to fully check the entire library (therefore, a static analyzer is very useful for catching “sleeping” bugs). Finally, unlike some C ++, for which there is a cloud of static analyzers and compilers with beautiful, ~~and most importantly detailed, see C ++ templates~~ error diagnostics, compilers Models still periodically generate an Internal compiler error, which means there is room to help the user even with a fairly simple analyzer.

What is PMD?

I'll answer ~~song~~ by bike. Once I wanted to make some small pull request into the development environment for OpenModelica. Seeing how the saving of the model is processed in another part of the code, I noticed a small and not very clear inside piece of four lines of code that supported some kind of invariant. Not understanding what kind of editor internals he interacts with, but realizing that from the point of view of this piece of code, my task is completely identical, I just put it into a function so that I could reuse it and not break it. Menteiner said, it’s wonderful, only then replace this code with a function call in the remaining twenty places ... I decided not to get involved yet, and just made another copy, noting that then somehow I would need to comb everything right away without mixing with the current patch. Googling, I found Copy-paste Detector (CPD) - part of the PMD static analyzer - which supports even more languages than the analyzer itself. Having set it on the OMEdit code base, I expected to see those two dozen pieces of four lines. I just didn’t see them (each of them simply did not exceed the threshold in the number of tokens), but I saw, for example, the repetition of nearly fifty lines of C ++ code. As I already said, it is unlikely that the menter simply copied a gigantic piece from another file. But he could easily miss this in PR - because the code, by definition, already met all the standards of the project! When I shared the observation with the menter, he agreed that it would be necessary to clean up as part of a separate task.

Accordingly, the Program Mistake Detector (PMD) is an easily extensible static analyzer. Maybe he does not calculate the set of values that a variable can take (although who knows ...), but to add rules to it, you don’t even need to know Java and generally change its code somehow! The fact is that the first thing he, unsurprisingly, is building AST files with source codes. And what does the source code parsing tree look like? To the XML parsing tree! So, you can describe the rules simply as XPath requests - for which it matches, then we issue a warning. They even have a graphical debugger for the rules! More complex rules, of course, can be written directly in Java as visitors for the AST.

Consequence : PMD can be used not only for the harsh and universal rules that harsh Java programmers have committed to the analyzer code, but also for the local coding style - even if you push your own local ruleset.xml into each repository!

Level 1: find copy-paste automatically

In principle, adding support for a new language in CPD is often very simple. I do not see any sense in retelling the documentation “how to do” - it is very clear, structured and step-by-step. To retell such a thing - only play in a damaged phone. I’d better describe what awaits you (TLDR: no big deal) :

The analyzer (both PMD and CPD) is being developed on the github in the pmd / pmd repository
The visual rule debugger has been moved to a separate pmd / pmd-designer repository. Please note that the finished jar-nickname is automatically embedded in the PMD binary distribution , which Gradle will collect for you in the previous repository, you do not need to specially clone pmd-designer

for this.
The project has a Developer Documentation . The one I read was very detailed. True, a bit outdated, but this is treated by the second pull request :)

I’ll warn you that I’m developing on Ubuntu. On Windows, it should also work perfectly - both in terms of quality and in the sense of a slightly different way of launching tools.

So, to add a new language to CPD, you just need to ...

ATTENTION: if you want full support for PMD before the release of PMD 7, then it is better to go straight to level 2, since normal support for the easy way through the finished Antlr grammar will appear, according to rumors, in the very version 7, but for now you’ll just spend time (though and a little bit ...)
Fork the pmd / pmd repository .
Find in antlr / grammars-v4 a ready-made grammar for your language - of course, if the language is internal, you have to write it yourself, but for Modelika, for example, it was found. Here, of course, you need to follow the formalities with licenses - I'm not a lawyer, but at least I need to specify the source from where I copied.
After that, you need to create the pmd-<your language name>

module, add it to Gradle and put the grammar file there. Further, after reading two pages of non-stressful documentation, redo the assembly script from the module for Go, a couple of classes for loading the module through reflection, well, there’s a little thing ...
Correct the reference output in one of the tests, because now CPD supports one more language! How do you find this test? Very easy: he wants to break the build .
PROFIT! It’s really simple provided that there is a ready-made grammar

Now, being in the root of the pmd repository, you can type ./mvnw clean verify

, while in pmd-dist/target

you will get, among other things, binary distribution in the form of a zip archive that you need to unzip and run using ./bin/run.sh cpd --minimum-tokens 100 --files /path/to/source/dir --language <your language name>

from the unpacked directory. In principle, you can do ../mvnw clean verify

from within your new module, which will drastically speed up the assembly, but then you have to correctly put the assembled jar-nickname into the unpacked binary distribution (for example, assembled once after registering a new module).

Level 2: finding errors and violations of the style guide

As I said, full support for Antlr is promised in PMD 7 . If you, like me, do not want to wait for the release by the sea, then you will have to get a description of the grammar of the language in JJTree format from somewhere. Maybe you can nullify the support of an arbitrary parser yourself - the documentation says that it is possible, but they don’t tell how exactly ... I just took modelica.g4

from the same repository with grammars for Anltr as the basis, and manually remade it into JJTree. Naturally, if the grammar turned out to be a revision of the existing one, again, indicate the source, verify compliance with the licenses and. etc.

By the way, for a person who is well versed in all kinds of parser generators, this is unlikely to come as a surprise. Before that, I seriously used it, except with my own hand written regulars and parser combinators on Scala. Therefore, the obvious, in fact, thing saddened me at first: AST, of course, I’ll get it from modelica.g4

, but it doesn’t look very clear and “usable”: it will have clouds of extra nodes, and if you don’t look at the tokens , but only at nodes, it is not always clear where, for example, the then

branch ends, and else

begins.

Again, I will not retell the JJTree documentation and a good tutorial - this time, however, not because the original shines with detail and clarity, but because I myself didn’t figure it out completely, but the documentation was retransmitted incorrectly, but with confidence, obviously worse than the lack of retelling. I’d better leave a little clue, found out along the way:

First, the JavaCC parser description code assumes Java inserts that will fit into the generated parser
Do not be confused by the fact that when building an AST, syntax like [ Expression() ]

means optional, and in the context of describing tokens - choosing a character, as in a regular expression. As far as I understand the explanation of the PMD developers, these are similar constructions that have such a different meaning - legacy, sir ...
For the root node (in my case, StoredDefinition

), you must specify its type instead of void

(i.e. ASTStoredDefiniton

)
Using the #void

syntax after the node name, you can hide it from the parsed tree (that is, it will only affect what is the correct source and what is not, and how other nodes will be nested)
Using a construct of the form void SimpleExpression() #SimpleExpression(>1)

we can say that the node must be shown in the resulting AST if it has more than one descendant. This is very convenient when describing expressions with many operators with different priorities: that is, from the point of view of the parser, lonely constant 1

will be something like LogicExpression(AdditiveExpression(MultiplicativeExperssion(Constant(1))))

- enter all n levels of operation priorities - but the analyzer code will just get Constant(1)
The node has a standard variable image

(see getImage

, setImage

), which usually setImage

the "essence" of this node: for example, for a node corresponding to the name of a local variable, it is logical to copy the matching token with identifier into image

(by default, all tokens from trees will be thrown away, so it’s worth copying the meaning contained in them, in any case, if it is something variable, and not just keywords)
LOOKAHEAD - well, this is a separate song, even a separate chapter in the documentation is devoted to it
- roughly speaking, in JavaCC, if you go to a node, you can’t throw it back and try to parse differently, but you can look ahead in advance and decide whether to go or not
- in the simplest case, upon seeing a JavaCC warning, ~~you just say in the header LOOKAHEAD = n~~
  
  and you get mysterious parsing errors, because in the general case, it seems, it can’t solve all the problems (well, except that by setting a few billion tokens, you actually get a preview of everything, but not the fact that it works that way .. .)
- in front of the name of the embedded node, you can explicitly indicate on the basis of how many tokens here you can definitely make the final decision
- if in the general case there is no such fixed number of tokens, you can say "go here, if previously, starting from this point, we managed to match such a prefix - and then the usual description of the subtree"
- be careful: in the general case, JavaCC cannot check the correctness of the LOOKAHEAD
  
  directives - it trusts you, so at least figure out the mathematical proof why such a lookahead is enough ...

Now that you have a description of the grammar of the language in JJTree format, these simple 14 steps will help you add language support. Most of them have the form "create a class similar to the implementation for java or vm, but adapted." I’ll note only typical features, some of them will appear in the main documentation if they accept my pull request for documentation :

Commenting out the removal of all generated files in the assembly script alljavacc.xml

(which is in your new module), you can transfer them to the source tree from target/generated-sources

. But better not. Most likely, only a small part will be changed, so it’s better to take care of deleting only a few: you saw the need to change the default implementation, copied it to the source tree, added it to the list of deleted files, rebuilt it — and now you manage the file — specifically this file . Otherwise, it will be difficult to figure out what exactly has been changed, and the support can hardly be called pleasant.
now that you have an implementation of the “main” PMD mode, you can easily hang on your JJTree parser a binding for CPD as well, similar to Java or some other available implementation
Remember to implement a method that returns the host name for XPath queries. In the default implementation, either infinite recursion is obtained (the node name via toString

and vice versa), or something else, in general, because of this, it’s also not possible to look at the tree in PMD Designer, and without this debugging the grammar is really sad
part of component registrations is done by adding text files from fully qualified class name entry points to META-INF/services
what can be declaratively described in the rules (for example, a detailed description of the check and error examples) is described not in the code, but in category/<language name>/<ruleset>.xml

- in any case, you will have to register your rules there
... but when implementing the tests, apparently, some, perhaps homegrown, auto discovery mechanism is actively used, therefore
- if you are told “add a trivial test for each version of the language” - better not argue, they say “I don’t need it, it works like that” - perhaps this is the auto discovery mechanism
- if you see a test for a specific rule with a class body containing only a comment // no additional unit tests
  
  , then these are not tests, they simply lie in the resources in the form of XML-description of the input data and expected analyzer reactions, immediately a bunch: a few correct ones and some incorrect examples.

A small but important quest: finished the PMD Designer

Perhaps you can debug everything without a visualizer. But why? Firstly, to finish it is very simple. Secondly, it will greatly help your users who are not familiar with Java: they are easy and simple (if this applies to XPath at all), or at least without recompiling PMD will be able to describe simple patterns of what they don't like (in the simplest case - a style guide like “the name of a model package always starts with a lowercase p”).

Unlike other errors that are immediately visible, problems with PMD Designer are quite insidious: it would seem that you already understood that the Java inscription on the right side of the menu is not a button, but a drop-down list of the O_o language selection, in which it already appeared Modelica, because a new module with registration of entry points has appeared in the classpath. But here you choose your language, download a test file, and see AST. And it seems to be a victory, but it was somehow black and white, and the highlighted subtree could be highlighted in the text - although no, the highlight is there, but it is updated crookedly - and yet, how did they not guess to highlight the found matches with XPath ... Already estimating the amount of work, you are thinking about the next pull request, but then you accidentally decide to switch the language to Java and load some source code of the PMD itself ... Oh! It is colored! .. And the subtree highlight works! Uh ... but it turns out that it normally highlights the matches found and writes out pieces of text in the box to the right of the request ... It seems that when an exception occurs in the JavaFX code during interface rendering, it interrupts the rendering, but does not print to the console ...

In general, you just need to add a little red-haired class to highlight syntax based on regular expressions. In my case, it was net.sourceforge.pmd.util.fxdesigner.util.codearea.syntaxhighlighting.ModelicaSyntaxHighlighter

, which needs to be registered in the AvailableSyntaxHighlighters

class. Please note that both of these changes occur in the pmd-designer

repository, the artifact from the assembly of which needs to be put into your binary distribution.

In the end, it looks something like this (GIF taken from README in the PMD Designer repository):

PMD Designer at work

Subtotal

If you have completed all of these levels, then you now have:

copy paste detector
rules engine
visualizer for debugging AST and bringing it into a convenient form for analysis (as we have already seen, not all grammars of one language are equally useful!)
the same visualizer for debugging XPath rules that your users can write without recompiling PMD and generally knowledge of Java (XPath, of course, is also not BASIC, but it’s at least a standard and not a local query language)

I hope you also have an understanding of the fact that the grammar is now a stable API for your language support implementation - do not change it (or rather, the function of converting the source to AST described by it) unless absolutely necessary, and if you have changed, notify as a breaking change, and then users will be upset: most likely, not everyone will write tests for their rules, and it’s very sad when the rules checked the code, and then stopped without warning - almost like a backup that completely broke down, and a year ago ...

The story does not end there: at least some useful rules have to be written.

But that's not all: PMD natively supports scopes and declarations. Each AST node has scope associated with it: the body of the class, function, loop ... The whole file, at worst! And in every scope there is a list of definitions (declarations) that it directly contains. As in other cases, it is proposed to implement by analogy with other languages, for example, Modelika (but at the time of writing, the logic in my pull request is, quite frankly, raw). scopes declarations visitor, - ScopeAndDeclarationFinder

, — , , , - , read-only AST. , .

 public class ModelicaHandler extends AbstractLanguageVersionHandler { // ... @Override public VisitorStarter getSymbolFacade() { return new VisitorStarter() { @Override public void start(Node rootNode) { new SymbolFacade().initializeWith((ASTStoredDefinition) rootNode); } }; } }

Output

PMD . , «» Clang Static Analyzer , . , CPD ( ), .

What if there is no static analyzer for your favorite language?