The history of one small project of twelve years in length (about BIRMA.NET for the first time and honestly first-hand)

The birth of this project can be considered a small idea that visited me somewhere at the end of 2007, which was destined to find its final form only 12 years later (at the moment, of course, although the current implementation, in the opinion of the author, is very satisfactory) .



It all started with the fact that in the process of fulfilling my official duties in the library, I drew attention to the fact that the process of entering data from the scanned text of the contents of book (and music) publications into the existing database, apparently, can be greatly simplified and automate, using the property of orderliness and repeatability of all data required for input, such as the name of the author of the article (if we are talking about a collection of articles), the title of the article (or the subtitle reflected in the table of contents) and The page of the current table of contents. At first, I was almost convinced that a system suitable for this task could be easily found on the Internet. When some surprise was caused by the fact that I could not find such a project, I decided to try to implement it on my own.



After a fairly short time, the first prototype started working, which I immediately started using in my daily activities, simultaneously debugging it with all the examples that came to my hand. Fortunately, in my usual workplace, where I was by no means a programmer, I still got away with the apparent "downtime" in the work, during which I worked hard to debug my brainchild - an almost unthinkable thing in today's realities, implying daily reports about the work done during the day. The process of polishing the program took a total of no less than a year, but even after that the result could hardly be called completely successful - there were too many different concepts that were not quite intelligible to implement from the beginning: optional elements that could be skipped; leading view of elements (for the purpose of substituting to the search results of previous elements); even your own attempt to implement something like regular expressions (having a distinctive syntax). I must say that before that I managed to throw up programming a bit (for about 8 years, if not more), so the new opportunity to apply my skills to an interesting and necessary task completely captured my attention. It is not surprising that the resulting source code - in the absence of any intelligible approaches to designing it for me - quite quickly became an unimaginable mishmash of disparate pieces in the C language with some C ++ elements and aspects of visual programming (it was originally decided to use such a design system as Borland C ++ Builder - ā€œalmost Delphi, but in Cā€). However, all this ultimately paid off in the automation of the daily activities of our library.



At the same time, I decided, just in case, to take training courses for professional software developers. I donā€™t know if itā€™s possible to learn ā€œto become a programmerā€ from scratch, but taking into account the skills that I already had at that time, I was able to master a bit of such more relevant technologies at that time as C #, Visual Studio for developing under. NET, as well as some of the technologies related to Java, HTML and SQL. All the training took a total of two years, and served as the starting point for another of my projects, which eventually stretched out over several years - but this is already a topic for a separate publication. Here, it will only be pertinent to note that I made an attempt to adapt the experience that I already had on the described project to create a full-fledged window application in C # and WinForms that implements the necessary functionality and put it into the basis of the upcoming graduation project.

Over time, this idea began to seem worthy of being voiced to me at such annual conferences with the participation of representatives of various libraries, such as LIBCOM and CRIMEA. The idea is yes, but by no means my realization of that time. Then I also hoped, among other things, that someone would rewrite it using more competent approaches. One way or another, by 2013, I decided to draw up a report on my preliminary work and send it to the Conference Organizing Committee with an application for a grant to participate in the conference. To my some surprise, my application was satisfied, and I began to make some improvements to the project to prepare it for presentation at the conference.



By that time, the project had already received a new name BIRMA, acquired various additional (not so much fully realized as anticipated) opportunities - all the details can be found in my report .



Frankly, the 2013 BIRMA was difficult to call something complete; frankly, it was a whip made very hacky craft. As for the code part, there were practically no special innovations at all, apart from a rather helpless attempt to create some kind of unified syntax for the parser, which in appearance resembles the IRBIS 64 formatting language (and, in fact, even ISIS, with parentheses as cyclic structures; why then it seemed to me that it looked very cool). The parser hopelessly stumbled on these whirlpools from the brackets of the corresponding type (since the parentheses played the same role there, namely, they marked optional structures that could be skipped during parsing). Everyone who wants to get acquainted in more detail with the then hard-to-imagine, unjustified BIRMA syntax, I again refer to my report of that time.



In general, except for the struggle with our own parser, then as far as the code of this version is concerned, I have nothing more to say - except for the reverse conversion of the available source code in C ++ with preservation of some typical features of .NET code (to be honest, it's hard to understand what exactly prompted me to transfer everything back - probably some stupid fear for keeping my source codes secret, as if it was something equivalent to Coca-Cola's secret recipe).



Perhaps this stupid decision also contains the reason for the difficulties in pairing the resulting DLL with the existing interface of the self-made workstation for entering data into the electronic catalog (yes, I didnā€™t mention yet another important fact: from now on all the BIRMA engine code was as expected, separated from the interface and packaged in the appropriate DLL). Why did you also need to write a separate workstation for these purposes, which anyway in its appearance and method of interacting with the user shamelessly copied the same ā€œCatalogizerā€ workstation of the IRBIS 64 system - this is a separate issue. In short: he gave due respect to my achievements of that time for the graduation project (otherwise the indigestible parser engine alone was somehow not enough). In addition, I then encountered some difficulties when implementing the ā€œCatalogizerā€ workstation pairing with my own modules implemented both in C ++ and C #, and addressing my engine directly.



In general, oddly enough, but it was this rather awkward prototype of the future BIRMA.NET that was destined to become my ā€œworkhorse" for the next four years. It cannot be said that during this time I did not even try to find ways for a new, more complete implementation of a long-standing idea. Among other innovations, there should have already been nested cyclic sequences, which could include optional elements as well - thatā€™s how I was going to realize the idea of ā€‹ā€‹universal templates for bibliographic description of publications and various other interesting things. However, in my practice at that time, all this was poorly demanded, and the implementation I had at that time was quite enough to introduce the table of contents. In addition, the vector of the direction of development of our library began to deviate more and more towards digitizing museum archives, generating reports, and other activities that were of little interest to me, which in the end made me leave it completely, giving way to those to whom it would have been more pleasant .



Paradoxically, but precisely after these dramatic events, the BIRMA project, which at that time already possessed all the characteristic features of a typical long-term construction, seemed to begin to gain its long-awaited new life! I had more free time for idle thoughts, I again began to scour the World Wide Web in search of something similar (good, now I could already guess to look for all this from anywhere, namely on GitHub), and somewhere in At the beginning of this year, I finally came across the corresponding craft of the well-known Salesforce office under the unimportant name Gorp . By itself, it could do almost everything that I needed from such a parser engine - namely, intelligently isolate individual fragments from an arbitrary, but with a clear structure of the text, while having a pretty digestible interface for the end user, including such clear entities, like a pattern, pattern, and occurrence, and at the same time involving the usual syntax of regular expressions, which becomes incomparably more readable due to the splitting into the meaningful semantic groups for analysis.



In general, I decided that this same Gorp (I wonder what this name means? Maybe some kind of "general oriented regular parser"?) Is exactly what I was looking for a long time. True, its immediate implementation for my own needs had such a problem that this engine required too strict adherence to the structural sequence of the source text. For some reports such as log-files (namely, they were placed by developers as visual examples of the use of the project), this is quite suitable, but for the same texts the scanned table of contents is unlikely. After all, the same page with a table of contents can begin with the words ā€œTable of Contentsā€, ā€œContentsā€ and some other preliminary descriptions that we donā€™t need at all to place in the results of the proposed analysis (and cutting them off manually each time is also inconvenient). In addition, between individual repeating elements, such as the authorā€™s name, title and page number, the page may contain a certain amount of garbage (for example, pictures, and just random characters), which would also be nice to be able to cut off. However, the last aspect was still not so significant, but by virtue of the first, the existing implementation could not begin to search for the necessary structures in the text from some specific place, but instead simply processed it from the very beginning, did not find the specified patterns there and ... finished my job. Obviously, a corresponding revision was required, which would allow at least to leave some gaps between the repeating structures, and this made me sit down again at work.



Another problem was that the project itself was implemented in Java, and if I planned to further implement some means of interfacing this technology with the usual applications for entering data into existing databases (such as the Irbis cataloger), then at least least do it in C # and .NET. Not that Java itself was a bad language - once I even implemented a not-so-interesting window application on it that implements the functionality of a domestic programmable calculator (as part of a course project). Yes, and in syntax it is very similar to the same C-sharpe. Well, this is only a plus: the easier it will be for me to finalize an existing project. However, I didnā€™t want to plunge into this rather unusual world of window (or rather desktop) Java-technologies - in the end, the language itself was not ā€œsharpenedā€ for such use, and I did not at all crave a repetition of the previous experience. Perhaps it is precisely because C # in conjunction with WinForms is much closer to Delphi, which many of us once started. Fortunately, the right solution was found quite quickly - in the person of the IKVM.NET project, which makes it easy to translate existing Java programs into managed .NET code. True, the project itself was already abandoned by the authors by that time, but its latest implementation allowed me to quite successfully complete the necessary actions for the Gorp source texts.



So I made all the necessary changes and put it all into a DLL of the appropriate type, which any projects for the .NET Framework that were created in Visual Studio could easily ā€œpick upā€. In the meantime , I created another layer for convenient presentation of the results returned by Gorp , in the form of corresponding data structures that would be convenient to process in a table representation (and taking as a basis both rows and columns; both dictionary keys and numeric indices) . Well, the necessary utilities themselves for processing and displaying the results were written quite quickly.



Also, the process of adapting templates for the new engine did not cause any special complications in order to teach him to disassemble existing samples of scanned table of contents texts. In fact, I didnā€™t even have to turn to my previous workpieces at all: I just created all the necessary templates from scratch. Moreover, if the templates designed to work with the previous version of the system set a fairly narrow framework for texts that could be correctly parsed with their help, the new engine already allowed the development of fairly universal templates suitable for several types of markup at once. I even tried to write some comprehensive template for any arbitrary table of contents text, although, of course, even with all the new possibilities that open up for me, including, in particular, the limited ability to implement all the same nested repeating sequences (such as, for example, last names and initials several authors in a row), this turned out to be a utopia.



It is possible that in the future it will be possible to implement a certain concept of meta-templates that can check the source text for compliance with several of the available templates at once, and then, in accordance with the results, choose the most suitable one using some intelligent algorithm. But now I was more concerned with another question. Such a parser as Gorp , despite all its versatility and modifications that I made, was still inherently incapable of performing one seemingly simple thing that my own hand-written parser was able to do from the very first version. Namely: he had the ability to find and extract from the source text all fragments that match the mask specified in the framework of the template used in the right place, while not at all interested in what this text contains in the spaces between these fragments. So far, I have only slightly improved the new engine, allowing it to search for all possible new repetitions of a given sequence of such masks from the current position, leaving the possibility for the text to be completely unaccounted for when parsing sets of arbitrary characters enclosed between detected repeating structures. However, this did not make it possible to set the next mask regardless of the search results for the previous fragment by the mask corresponding to it: the strictness of the described structure of the text still did not leave room for arbitrary inclusions of irregular characters.



And if for the examples of table of contents that came across to me this problem did not seem so serious yet, then when trying to apply a new parsing mechanism to a similar task in essence to parse website content (i.e. the same parsing), its limitations are here they appeared with all their evidence. After all, itā€™s quite simple to set the necessary masks for web markup fragments, between which should be the data we are looking for (which you need to extract), but how to make the parser immediately go to the next similar fragment, despite all the possible HTML tags and attributes that can fit in the gaps between them?



After a little thought, I decided to introduce a couple of utility patterns (% all_before) and (% all_after) , which serve the obvious purpose of ensuring omissions of everything that can be contained in the source text before any subsequent pattern (mask). Moreover, if (% all_before) simply ignored all these arbitrary inclusions, then (% all_after) , on the contrary, allowed them to be added to the desired fragment after switching from the previous fragment. It sounds pretty simple, but to implement this concept, I had to ā€œcomb throughā€ the gorp sources again to make the necessary modifications so as not to break the already implemented logic. In the end, I managed to do it (although even the very, very first, albeit very buggy implementation of my parser was written and even faster - in a couple of weeks).From now on, the system has taken on a truly universal look - no less than 12 years after the first attempts to make it function.



Of course, this is not the ultimate dream. You can still completely rewrite the parser of gorp's templates in C # using any of the available libraries for implementing free grammar. I think the code should be greatly simplified, and this will get rid of the legacy in the form of existing Java sources. But with the existing engine, itā€™s also quite possible to do various interesting things, including an attempt to implement the meta-templates I have already mentioned, not to mention parsing various data from various websites (however, I do not exclude that the existing specialized software tools are more suitable for this - I just did not have the relevant experience of using them).



By the way, this summer I already received an e-mail invitation from a company that uses Salesforce technology (the developer of the original Gorp ) to pass an interview for further work in Riga. Unfortunately, at the moment I am not ready for such relocations.



If this material will cause some interest, then in the second part I will try to describe in more detail the technology for compiling and subsequent analysis of templates using the implementation example used in Salesforce Gorp (my own additions, with the exception of a couple of service words already described, practically do not make changes to the syntax of templates, so almost all documentation for the original Gorp system is also suitable for my version).



All Articles