What if there is no static analyzer for your favorite language?

Well, if your favorite language means Russian, English, etc., then this is in another hub . And if the programming or markup language, then of course write the analyzer yourself! At first glance, it is very difficult, but, fortunately, there are ready-made multilingual tools in which it is relatively easy to add support for a new language. Today I will show how to add Modelica language support to the PMD analyzer with a fairly small amount of time.







By the way, do you know what can degrade the quality of the code base obtained from a sequence of ideal pull requests? The fact that third-party programmers copied pieces of existing project code into their patches instead of literate abstraction. You must admit that, to some extent, itā€™s even more difficult to catch such a banality than poor-quality code - itā€™s high-quality and even already thoroughly debugged, so local verification is not enough here, you need to keep in mind the entire code base, but this is not easy for a person ... So: if adding full support of Modelica (without creating specific rules) to the state ā€œcan run primitive checksā€ took me about a week, then support for copy-paste detector only can often be added in a day!







What else is Modelica?



Modelica is, as the name suggests, a language for writing models of physical systems. In fact, not only physical: it is possible to describe chemical processes, the quantitative behavior of animal populations, etc. - that is described by systems of differential equations of the form der(X) = f(X)



, where X



is the vector of unknowns. Imperative pieces of code are also supported. Partial differential equations are not explicitly supported, but it is possible to divide the study area into pieces (as we probably would have done in some general-purpose language), and then write down the equations for each element, reducing the problem to the previous one. The trick of Modelka is that the solution to this der(X) = f(X)



lies with the compiler: you can just change the solver in the settings , the equation does not have to be linear, etc. In short, there are some advantages (I wrote the formula from the textbook - and it worked), and the cons (with more abstraction, we get less control). An introduction to Modelika is the topic of a separate article (which has already appeared several times on HabrƩ), and even a whole cycle, today it interests me as an open and having several implementations, but, alas, still a damp standard.







In addition, Modelika, on the one hand, has static typing (which will help us to write some meaningful analysis faster), on the other hand, when instantiating a model, the compiler is not required to fully check the entire library (therefore, a static analyzer is very useful for catching ā€œsleepingā€ bugs). Finally, unlike some C ++, for which there is a cloud of static analyzers and compilers with beautiful, and most importantly detailed, see C ++ templates error diagnostics, compilers Models still periodically generate an Internal compiler error, which means there is room to help the user even with a fairly simple analyzer.







What is PMD?



I'll answer song by bike. Once I wanted to make some small pull request into the development environment for OpenModelica. Seeing how the saving of the model is processed in another part of the code, I noticed a small and not very clear inside piece of four lines of code that supported some kind of invariant. Not understanding what kind of editor internals he interacts with, but realizing that from the point of view of this piece of code, my task is completely identical, I just put it into a function so that I could reuse it and not break it. Menteiner said, itā€™s wonderful, only then replace this code with a function call in the remaining twenty places ... I decided not to get involved yet, and just made another copy, noting that then somehow I would need to comb everything right away without mixing with the current patch. Googling, I found Copy-paste Detector (CPD) - part of the PMD static analyzer - which supports even more languages ā€‹ā€‹than the analyzer itself. Having set it on the OMEdit code base, I expected to see those two dozen pieces of four lines. I just didnā€™t see them (each of them simply did not exceed the threshold in the number of tokens), but I saw, for example, the repetition of nearly fifty lines of C ++ code. As I already said, it is unlikely that the menter simply copied a gigantic piece from another file. But he could easily miss this in PR - because the code, by definition, already met all the standards of the project! When I shared the observation with the menter, he agreed that it would be necessary to clean up as part of a separate task.







Accordingly, the Program Mistake Detector (PMD) is an easily extensible static analyzer. Maybe he does not calculate the set of values ā€‹ā€‹that a variable can take (although who knows ...), but to add rules to it, you donā€™t even need to know Java and generally change its code somehow! The fact is that the first thing he, unsurprisingly, is building AST files with source codes. And what does the source code parsing tree look like? To the XML parsing tree! So, you can describe the rules simply as XPath requests - for which it matches, then we issue a warning. They even have a graphical debugger for the rules! More complex rules, of course, can be written directly in Java as visitors for the AST.







Consequence : PMD can be used not only for the harsh and universal rules that harsh Java programmers have committed to the analyzer code, but also for the local coding style - even if you push your own local ruleset.xml into each repository!







Level 1: find copy-paste automatically



In principle, adding support for a new language in CPD is often very simple. I do not see any sense in retelling the documentation ā€œhow to doā€ - it is very clear, structured and step-by-step. To retell such a thing - only play in a damaged phone. Iā€™d better describe what awaits you (TLDR: no big deal) :









Iā€™ll warn you that Iā€™m developing on Ubuntu. On Windows, it should also work perfectly - both in terms of quality and in the sense of a slightly different way of launching tools.







So, to add a new language to CPD, you just need to ...









Now, being in the root of the pmd repository, you can type ./mvnw clean verify



, while in pmd-dist/target



you will get, among other things, binary distribution in the form of a zip archive that you need to unzip and run using ./bin/run.sh cpd --minimum-tokens 100 --files /path/to/source/dir --language <your language name>



from the unpacked directory. In principle, you can do ../mvnw clean verify



from within your new module, which will drastically speed up the assembly, but then you have to correctly put the assembled jar-nickname into the unpacked binary distribution (for example, assembled once after registering a new module).







Level 2: finding errors and violations of the style guide



As I said, full support for Antlr is promised in PMD 7 . If you, like me, do not want to wait for the release by the sea, then you will have to get a description of the grammar of the language in JJTree format from somewhere. Maybe you can nullify the support of an arbitrary parser yourself - the documentation says that it is possible, but they donā€™t tell how exactly ... I just took modelica.g4



from the same repository with grammars for Anltr as the basis, and manually remade it into JJTree. Naturally, if the grammar turned out to be a revision of the existing one, again, indicate the source, verify compliance with the licenses and. etc.







By the way, for a person who is well versed in all kinds of parser generators, this is unlikely to come as a surprise. Before that, I seriously used it, except with my own hand written regulars and parser combinators on Scala. Therefore, the obvious, in fact, thing saddened me at first: AST, of course, Iā€™ll get it from modelica.g4



, but it doesnā€™t look very clear and ā€œusableā€: it will have clouds of extra nodes, and if you donā€™t look at the tokens , but only at nodes, it is not always clear where, for example, the then



branch ends, and else



begins.







Again, I will not retell the JJTree documentation and a good tutorial - this time, however, not because the original shines with detail and clarity, but because I myself didnā€™t figure it out completely, but the documentation was retransmitted incorrectly, but with confidence, obviously worse than the lack of retelling. Iā€™d better leave a little clue, found out along the way:









Now that you have a description of the grammar of the language in JJTree format, these simple 14 steps will help you add language support. Most of them have the form "create a class similar to the implementation for java or vm, but adapted." Iā€™ll note only typical features, some of them will appear in the main documentation if they accept my pull request for documentation :









A small but important quest: finished the PMD Designer



Perhaps you can debug everything without a visualizer. But why? Firstly, to finish it is very simple. Secondly, it will greatly help your users who are not familiar with Java: they are easy and simple (if this applies to XPath at all), or at least without recompiling PMD will be able to describe simple patterns of what they don't like (in the simplest case - a style guide like ā€œthe name of a model package always starts with a lowercase pā€).







Unlike other errors that are immediately visible, problems with PMD Designer are quite insidious: it would seem that you already understood that the Java inscription on the right side of the menu is not a button, but a drop-down list of the O_o language selection, in which it already appeared Modelica, because a new module with registration of entry points has appeared in the classpath. But here you choose your language, download a test file, and see AST. And it seems to be a victory, but it was somehow black and white, and the highlighted subtree could be highlighted in the text - although no, the highlight is there, but it is updated crookedly - and yet, how did they not guess to highlight the found matches with XPath ... Already estimating the amount of work, you are thinking about the next pull request, but then you accidentally decide to switch the language to Java and load some source code of the PMD itself ... Oh! It is colored! .. And the subtree highlight works! Uh ... but it turns out that it normally highlights the matches found and writes out pieces of text in the box to the right of the request ... It seems that when an exception occurs in the JavaFX code during interface rendering, it interrupts the rendering, but does not print to the console ...







In general, you just need to add a little red-haired class to highlight syntax based on regular expressions. In my case, it was net.sourceforge.pmd.util.fxdesigner.util.codearea.syntaxhighlighting.ModelicaSyntaxHighlighter



, which needs to be registered in the AvailableSyntaxHighlighters



class. Please note that both of these changes occur in the pmd-designer



repository, the artifact from the assembly of which needs to be put into your binary distribution.







In the end, it looks something like this (GIF taken from README in the PMD Designer repository):







PMD Designer at work







Subtotal



If you have completed all of these levels, then you now have:









I hope you also have an understanding of the fact that the grammar is now a stable API for your language support implementation - do not change it (or rather, the function of converting the source to AST described by it) unless absolutely necessary, and if you have changed, notify as a breaking change, and then users will be upset: most likely, not everyone will write tests for their rules, and itā€™s very sad when the rules checked the code, and then stopped without warning - almost like a backup that completely broke down, and a year ago ...







The story does not end there: at least some useful rules have to be written.







But that's not all: PMD natively supports scopes and declarations. Each AST node has scope associated with it: the body of the class, function, loop ... The whole file, at worst! And in every scope there is a list of definitions (declarations) that it directly contains. As in other cases, it is proposed to implement by analogy with other languages, for example, Modelika (but at the time of writing, the logic in my pull request is, quite frankly, raw). scopes declarations visitor, - ScopeAndDeclarationFinder



, ā€” , , , - , read-only AST. , .







 public class ModelicaHandler extends AbstractLanguageVersionHandler { // ... @Override public VisitorStarter getSymbolFacade() { return new VisitorStarter() { @Override public void start(Node rootNode) { new SymbolFacade().initializeWith((ASTStoredDefinition) rootNode); } }; } }
      
      





Output



PMD . , Ā«Ā» Clang Static Analyzer , . , CPD ( ), .








All Articles