re2c 1.2

On Friday, August 2, re2c, a free lexical analyzer generator for C and C++, was released. Recall that re2c was written in 1993 by Peter Bamboulis as an experimental generator of very fast lexical analyzers, distinguished from other generators by the speed of the generated code and an unusually flexible user interface that makes it easy and efficient to embed parsers into an existing codebase. Since then, the project has been developed by the community and continues to be a platform for experimentation and research in the field of formal grammars and state machines.

Main innovations in version 1.2:

  • Added a new (simplified) way to check for the end of the input data
    (eng. "EOF rule").
    For this, the re2c:eof configuration has been added,
    allowing you to select a terminal character,
    and a special $ rule that fires if the lexer
    successfully reached the end of the input data.
    Historically, re2c provides several ways to check for
    end of inputs varying in limitedness, efficiency, and simplicity
    applications. The new method is designed to simplify writing code, while
    while remaining effective and widely applicable. old ways
    still work and may be preferred in some cases.

  • Added the ability to include external files using the directive
    /*!include:re2c "file.re" */ where file.re
    is the name of the include file. Re2c looks for files in the directory of the containing file,
    as well as in the list of paths given with the -I option.
    Included files can include other files.
    Re2c provides "standard" files in the include/ directory
    project - it is expected that useful definitions will accumulate there
    regular expressions, something in the spirit of the standard library.
    So far, at the request of the workers, one file with definitions of Unicode categories has been added.

  • Added the ability to generate header files with arbitrary
    content using the -t --type-header options (or the appropriate
    configurations) and new directives /*!header:re2c:on*/ and
    /*!header:re2c:off*/. This may be useful in cases where
    when re2c needs to generate definitions for variables, structures and macros,
    used in other translation units.

  • Re2c now understands UTF8 literals and character classes in regular expressions.
    By default, re2c parses expressions like "βˆ€x βˆƒy" as.
    sequence of 1-bit ASCII characters e2 88 80 78 20 e2 88 83 79
    (hex codes), and users have to escape Unicode characters manually:
    "u2200x u2203y". This is very inconvenient and unexpected for many
    users (as evidenced by constant bug reports). So now
    re2c provides an option --input-encoding ,
    which allows you to change the behavior and parse "βˆ€x βˆƒy" as
    2200 78 20 2203 79.

  • Re2c now allows regular re2c blocks in -r --reuse mode.
    This is convenient if the input file contains many blocks, and only some of them
    needs to be reused.

  • Now you can set the format of warnings and error messages
    with the new --location-format option . GNU format displayed
    as filename:line:column: and the MSVC format as filename(line,column).
    This feature may come in handy for IDE lovers.
    The --verbose option has also been added, which prints a short victory message on success.

  • The β€œcompatibility” mode with flex has been improved - some parsing errors have been fixed and
    incorrect operator precedence in rare cases.
    Historically, the -F --flex-support option allows you to write code
    mixed in flex style and re2c style, which makes it a bit difficult to parse.
    Flex compatibility mode is rarely used in new code,
    but re2c continues to support it for backwards compatibility.

  • Character class subtraction operator / is now applied
    before unwrapping the encoding, which allows it to be used in more cases,
    if a variable length encoding is used (such as UTF8).

  • The output file is now created atomically: re2c first creates a temporary file
    and writes the result to it, and then renames the temporary file to the output
    one operation.

  • Documentation has been added and rewritten; in particular, new
    chapters about buffer filling
    ΠΈ about ways to check for the end of input data.
    The new documentation is compiled as
    comprehensive one-page manual
    with examples (the same sources are rendered in the manpage and in the online documentation).
    Few attempts have been made to improve the readability of the site on phones.

  • From the developer's point of view, re2c has got a more complete subsystem
    debugging. Debug code is now disabled in release builds and
    can be enabled with the --enable-debug configure option.

This release took a long time - almost a whole year.
Most of the time, as always, was spent on developing a theoretical framework and writing
Articles "Efficient POSIX Submatch Extraction on NFA".
The algorithms described in the article are implemented in the experimental libre2c library
(building the library and benchmarks is disabled by default and enabled with the configure option
--enable-libs). The library is not conceived as a competitor to existing
projects like RE2, but as a research platform for developing new
algorithms (which can then be used in re2c or in other projects).
It is also convenient in terms of testing, benchmarking and creating bindings for other languages.

Thanks from the re2c developers to everyone who helped make this release happen,
and to the community in general for ideas, bug reports, patches, morale, etc. ;]

Source: linux.org.ru

Add a comment