The story of one small project twelve years long (about BIRMA.NET for the first time and frankly first-hand)

The birth of this project can be considered a small idea that came to me somewhere at the end of 2007, which was destined to find its final form only 12 years later (at this point in time - of course, although the current implementation, according to the author, is very satisfactory) .

It all started when, in the process of fulfilling my then official duties in the library, I drew attention to the fact that the process of entering data from the scanned text of tables of contents of book (and music) publications into the existing database, apparently, can be significantly simplified and automate, taking advantage of the property of orderliness and repeatability of all data required for input, such as the name of the author of the article (if we are talking about a collection of articles), the title of the article (or the subtitle reflected in the table of contents) and the page number of the current table of contents item. At first, I was practically convinced that a system suitable for carrying out this task could be easily found on the Internet. When some surprise was caused by the fact that I could not find such a project, I decided to try to implement it on my own.

After a fairly short time, the first prototype started working, which I immediately began using in my daily activities, simultaneously debugging it on all the examples that came to my hand. Fortunately, at my usual workplace, where I was by no means a programmer, I then still got away with visible “downtime” in my work, during which I was intensively debugging my brainchild - an almost unthinkable thing in the current realities, which imply daily reports on work done during the day. The process of polishing the program took a total of no less than about a year, but even after that the result could hardly be called completely successful - too many different concepts were initially laid down that were not entirely clear for implementation: optional elements that can be skipped; forward viewing of elements (for the purpose of substituting previous elements into search results); even our own attempt to implement something like regular expressions (which has a unique syntax). I must say that before this I had given up programming somewhat (for about 8 years, if not more), so the new opportunity to apply my skills to an interesting and necessary task completely captured my attention. It is not surprising that the resulting source code - in the absence of any clear approaches to its design on my part - quite quickly became an unimaginable mishmash of disparate pieces in the C language with some elements of C++ and aspects of visual programming (initially it was decided to use such a design system as Borland C++ Builder - “almost Delphi, but in C”). However, all this ultimately bore fruit in automating the daily activities of our library.

At the same time, I decided, just in case, to take courses to train professional software developers. I don’t know if it’s actually possible to learn “to be a programmer” from scratch there, but taking into account the skills I already had at that time, I was able to somewhat master technologies that were more relevant by that time, such as C#, Visual Studio for development under . NET, as well as some technologies related to Java, HTML and SQL. The entire training took a total of two years, and served as the starting point for another project of mine, which ultimately stretched over several years - but this is a topic for a separate publication. Here it would only be appropriate to note that I made an attempt to adapt the developments I already had on the described project to create a full-fledged window application in C# and WinForms that implements the necessary functionality, and use it as the basis for the upcoming diploma project.
Over time, this idea began to seem to me worthy of being voiced at such annual conferences with the participation of representatives of various libraries as “LIBKOM” and “CRIMEA”. The idea, yes, but not my implementation of it at that time. Then I also hoped that someone would rewrite it using more competent approaches. One way or another, by 2013 I decided to write a report on my preliminary work and send it to the Conference Organizing Committee with an application for a grant to participate in the conference. To my somewhat surprise, my application was approved, and I began to make some improvements to the project to prepare it for presentation at the conference.

By that time, the project had already received a new name BIRMA, acquired various additional (not so much fully implemented, but rather assumed) capabilities - all details can be found in my report.

To be honest, it was difficult to call BIRMA 2013 something complete; Frankly speaking, it was a very hacky craft made in haste. In terms of code, there were practically no special innovations at all, except for a rather helpless attempt to create some kind of unified syntax for the parser, in appearance reminiscent of the IRBIS 64 formatting language (and in fact, also the ISIS system - with parentheses as cyclic structures; why At the time I thought it looked pretty cool). The parser hopelessly stumbled on these circles of parentheses of the appropriate type (since parentheses also performed another role, namely, they marked optional structures during parsing that can be skipped). I again refer everyone who wants to get acquainted with the then difficult to imagine, unjustified syntax of BIRMA in more detail to my report of that time.

In general, apart from struggling with my own parser, I have nothing more to say regarding the code of this version - except for the reverse conversion of the existing sources into C++ while preserving some typical features of .NET code (to be honest, it’s difficult to understand , what exactly prompted me to move everything back - probably some stupid fear for keeping my source codes secret, as if it were something equivalent to the secret recipe of Coca-Cola).

Perhaps this stupid decision also lies the reason for the difficulties in pairing the resulting DLL library with the existing interface of a home-made workstation for entering data into an electronic catalog (yes, I did not mention another important fact: from now on, all the code of the BIRMA “engine” was as expected, it is separated from the interface part and packaged in the appropriate DLL). Why it was necessary to write a separate workstation for these purposes, which anyway, in its appearance and method of interaction with the user, shamelessly copied the same workstation “Catalogizer” of the IRBIS 64 system - this is a separate question. In short: it gave the necessary solidity to my then developments for my graduation project (otherwise the indigestible parser engine alone was somehow not enough). In addition, I then encountered some difficulties in implementing the interface of the Cataloger workstation with my own modules, implemented in both C++ and C#, and directly accessing my engine.

In general, oddly enough, it was this rather clumsy prototype of the future BIRMA.NET that was destined to become my “workhorse” for the next four years. It cannot be said that during this time I did not at least try to find ways for a new, more complete implementation of a long-standing idea. Among other innovations, there should have already been nested cyclic sequences that could have included optional elements - this is how I was going to bring to life the idea of ​​​​universal templates for bibliographic descriptions of publications and various other interesting things. However, in my practical activities at that time, all this was in little demand, and the implementation I had at that time was quite sufficient for entering tables of contents. In addition, the vector of development of our library began to deviate more and more towards the digitization of museum archives, reporting and other activities of little interest to me, which in the end forced me to finally leave it, giving way to those who would be more pleased with all this .

Paradoxically, it was after these dramatic events that the BIRMA project, which at that time already had all the characteristic features of a typical long-term construction project, seemed to begin to take on its long-awaited new life! I had more free time for idle thoughts, I again began to comb the World Wide Web in search of something similar (fortunately, now I could already guess to look for all this not just anywhere, but on GitHub), and somewhere in At the beginning of this year, I finally came across a corresponding product from the well-known Salesforce company under the insignificant name Gorp. By itself, it could do almost everything that I needed from such a parser engine - namely, intelligently isolate individual fragments from arbitrary, but clearly structured text, while having a fairly user-friendly interface for the end user, including such understandable essences, as a pattern, template and occurrence, and at the same time using the familiar syntax of regular expressions, which becomes incomparably more readable due to the division into designated semantic groups for parsing.

In general, I decided that this is the one Gorp (I wonder what this name means? Perhaps some kind of “general oriented regular parser”?) – exactly what I’ve been looking for for a long time. True, its immediate implementation for my own needs had such a problem that this engine required too strict adherence to the structural sequence of the source text. For some reports such as log files (namely, they were placed by the developers as clear examples of using the project), this is quite suitable, but for the same texts of scanned tables of contents, it is unlikely. After all, the same page with a table of contents can begin with the words “Table of Contents”, “Contents” and any other preliminary descriptions that we do not need to place in the results of the intended analysis (and cutting them off manually each time is also inconvenient). In addition, between individual repeating elements, such as the author's name, title and page number, the page may contain a certain amount of garbage (for example, drawings, and just random characters), which it would also be nice to be able to cut off. However, the last aspect was not yet so significant, but due to the first, the existing implementation could not start looking for the necessary structures in the text from a certain place, but instead simply processed it from the very beginning, did not find the specified patterns there and... ended my job. Obviously, some tweaking was needed to at least allow some space between the repeating structures, and that got me back to work.

Another problem was that the project itself was implemented in Java, and if I planned in the future to implement some means of interfacing this technology with familiar applications for entering data into existing databases (such as Irbis’s “Cataloguer”), then at least At least do this in C# and .NET. It’s not that Java itself is a bad language – I once even used it to implement an interesting window application that implemented the functionality of a domestic programmable calculator (as part of a course project). And in terms of syntax it is very similar to the same C-sharp. Well, this is only a plus: the easier it will be for me to finalize an existing project. However, I did not want to plunge again into this rather unusual world of window (or rather, desktop) Java technologies - after all, the language itself was not “tailored” for such use, and I did not at all crave a repetition of the previous experience. Perhaps it is precisely because C# in conjunction with WinForms is much closer to Delphi, with which many of us once started. Fortunately, the necessary solution was found quite quickly - in the form of the project IKVM.NET, which makes it easy to translate existing Java programs into managed .NET code. True, the project itself had already been abandoned by the authors by that time, but its latest implementation allowed me to quite successfully carry out the necessary actions for the source texts Gorp.

So I made all the necessary changes and assembled it all into a DLL of the appropriate type, which could easily be “picked up” by any projects for the .NET Framework created in Visual Studio. In the meantime, I created another layer for convenient presentation of the results returned Gorp, in the form of corresponding data structures that would be convenient to process in a table view (taking as a basis both rows and columns; both dictionary keys and numerical indexes). Well, the necessary utilities themselves for processing and displaying the results were written quite quickly.

Also, the process of adapting templates for the new engine in order to teach it to parse existing samples of scanned table of contents texts did not cause any special complications. In fact, I didn’t even have to refer to my previous templates at all: I simply created all the necessary templates from scratch. Moreover, if the templates designed to work with the previous version of the system set a fairly narrow framework for texts that could be correctly parsed with their help, the new engine already made it possible to develop fairly universal templates suitable for several types of markup at once. I even tried to write some kind of comprehensive template for any arbitrary table of contents text, although, of course, even with all the new possibilities opening up for me, including, in particular, the limited ability to implement the same nested repeating sequences (such as, for example, surnames and initials several authors in a row), this turned out to be a utopia.

Perhaps in the future it will be possible to implement a certain concept of meta-templates, which will be able to check the source text for compliance with several of the available templates at once, and then, in accordance with the results obtained, select the most suitable one, using some kind of intelligent algorithm. But now I was more concerned about another question. A parser like Gorp, despite all its versatility and the modifications I made, it was still inherently incapable of doing one seemingly simple thing that my self-written parser was able to do from the very first version. Namely: he had the ability to find and extract from the source text all fragments that match the mask specified within the template used in the right place, while not being at all interested in what the given text contains in the spaces between these fragments. So far, I have only slightly improved the new engine, allowing it to search for all possible new repetitions of a given sequence of such masks from the current position, leaving the possibility for the presence in the text of sets of arbitrary characters that were completely unaccounted for in the parsing, enclosed between the detected repeating structures. However, this did not make it possible to set the next mask regardless of the results of searching for the previous fragment using the corresponding mask: the strictness of the described text structure still did not leave room for arbitrary inclusions of irregular characters.

And if for the examples of tables of contents that I came across this problem did not yet seem so serious, then when trying to apply a new parsing mechanism to a similar task of parsing the contents of a website (i.e. the same parsing), its limitations are here they appeared with all their obviousness. After all, it’s quite easy to set the necessary masks for fragments of web markup, between which the data we are looking for (which needs to be extracted) should be located, but how can we force the parser to immediately move on to the next similar fragment, despite all the possible tags and HTML attributes that can be placed in the spaces between them?

After thinking a little, I decided to introduce a couple of service patterns (%all_before) и (%all_after), serving the obvious purpose of ensuring that everything that may be contained in the source text is skipped before any pattern (mask) that follows them. Moreover, if (%all_before) simply ignored all these arbitrary inclusions, then (%all_after), on the contrary, allowed them to be added to the desired fragment after moving from the previous fragment. It sounds quite simple, but to implement this concept I had to comb through the gorp sources again to make the necessary modifications so as not to break the already implemented logic. In the end, we managed to do this (although even the very, very first, albeit very buggy, implementation of my parser was written, and even faster - in a couple of weeks). From now on, the system took on a truly universal form - no less than 12 years after the first attempts to make it function.

Of course, this is not the end of our dreams. You can also completely rewrite the gorp template parser in C#, using any of the available libraries for implementing a free grammar. I think the code should be significantly simplified, and this will allow us to get rid of the legacy in the form of existing Java sources. But with the existing type of engine, it is also quite possible to do various interesting things, including an attempt to implement the meta-templates I have already mentioned, not to mention parsing various data from various websites (however, I do not rule out that existing specialized software tools are more suitable for this – I just haven’t had the appropriate experience of using them yet).

By the way, this summer I already received an invitation by email from a company that uses Salesforce technologies (the developer of the original Gorp), pass an interview for subsequent work in Riga. Unfortunately, at the moment I am not ready for such redeployments.

If this material arouses some interest, then in the second part I will try to describe in more detail the technology for compiling and subsequently parsing templates using the example of the implementation used in Salesforce Gorp (my own additions, with the exception of a couple of function words already described, make virtually no changes to the template syntax itself, so almost all documentation for the original system Gorp Suitable for my version too).

Source: habr.com

Add a comment