Skip to content

Latest commit

 

History

History
602 lines (473 loc) · 29.1 KB

README.md

File metadata and controls

602 lines (473 loc) · 29.1 KB

BARF - Bison Awesomely Replaced, Flex

A suite of compiler tools by Victor Dods

This suite contains programs to supplant the useful, but dated, tools bison (LALR(1) parser generator) and flex (lexical scanner generator):

  • bpp - The BARF Preprocessor. A preprocessor vaguely similar to (but not intended to emulate) the C preprocessor, only with formal looping constructs, arrays and maps.
  • reflex - A lexical scanner generator. Produces arbitrary-language scanners using data-driven specifications. Currently "only" provides a cpp target.
  • trison - An LALR parser generator. Produces arbitrary-language parsers using data-driven specifications. Currently "only" provides a cpp target which is an NPDA parser (General LR parser).

Copyright and License

Copyright Victor Dods

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Licensing of Generated Code

The files generated using this tool suite (bpp, reflex, trison) are the property of the user of the tools. The generated code can therefore be licensed any way the user wishes, and thus the generated code is compatible with being used in any software (proprietary, open-source, etc). In other words, the code generated by the tools (bpp, reflex, trison) is not subject to the license (Apache 2.0) specified in the LICENSE.txt file.

Release Notes

This lists versions 2 and later only. Versioning system is non-strict semantic versioning. In particular, the major revision number certainly indicates backward-incompatible changes, and while the minor revision number usually indicates backward-compatible changes, there may be minor incompabilities. The patch revision number indicates bugfixes and minor changes.

Version Notes
2.7.0 Re-licensed barf under Apache 2.0. Note that this only affects barf internal source code, not the code generated by bpp, reflex, or trison (which never has been subject to any license).
2.6.0 Added barftest unit test suite. Improvements to build system and documentation, including verifying that .deb package creation works as expected. Fixed a problem with the metadev check target.
2.5.0 Added move semantics to the cpp trison target, allowing more efficient parser rule handling. Made a few other improvements, including requiring that the rejection_action directive is explicitly specified in reflex's cpp target, technically making this a backward incompatible change.
2.4.1 Fixed a bug in trison cpp npda target where %lookahead[![...]] was not being handled correctly.
2.4.0 Added ability to print token data in trison cpp target. Fixed a bug having to do with %lookahead in npda generation, and fixed a bug having to do with precedence level vs precedence index that could potentially have caused a crash.
2.3.0 Added ability to assign an id to %error tokens. Added better mechanism for handling parse errors. Added %lookahead directive. Added ability to have #line directives use relative paths. Fixed a few bugs.
2.2.0 Added a method to the trison.cpp target related to the default_max_allowable_lookahead_queue_size directive that should have been there in version 2.1.0. This constitutes a minor revision because a new API method was added.
2.1.0 Addition of default_max_allowable_lookahead_queue_size to trison.cpp target, which provides a tool for preventing infinite loops due to insufficient error-handling rules in a grammar.
2.0.0 Long-awaited release of BARF 2.0! Mainly involves major improvements to trison. In particular, trison.cpp target was refactored to be more robust (though it currently only uses an NPDA implementation).

Earlier Releases

While the first version of trison came about in 2006, official packages were only created in 2010. reflex is more or less the same as back then, but trison has improved drastically over time.

Dependencies

The only dependencies are:

  • Required : Doxygen : to generate documentation, and to generate dot graphs. This is a standard piece of software than can be installed using your system's package manager.

  • Optional (if BUILD_barftest is enabled) : lvd : Library of Victor Dods - Miscellaneously useful C++ code. This can be installed as follows:

    git clone https://github.com/vdods/lvd
    cd lvd
    mkdir build
    cd build
    cmake ..
    make package
    

    This produces a package (e.g. .deb) having a particular version number, which can be installed as:

    sudo dpkg -i lvd-X.Y.Z.deb
    

    where X.Y.Z is the appropriate version number. Later, this will be installable via PPA.

Building and Installing

From the project root, create a build directory (e.g. build), run cmake from within in, then build and install. Example commands:

git clone https://github.com/vdods/barf.git
cd barf
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr ..

Then the project can be built via

make

The above invocation of cmake specifies Release mode, which will generate Makefiles to build the tools with compiler optimizations enabled. The choice of /usr for CMAKE_INSTALL_PREFIX is so that the correct path is hardcoded into the reflex and trison binaries for where to find their targetspec files assuming they'll be installed system-wide (a local installation, e.g. by a user, would use the default -D CMAKE_INSTALL_PREFIX=/usr/local). Then build and install the tools in one of the two following ways:

As a .deb package:

make package
sudo dpkg -i barf-X.Y.Z.deb

where X.Y.Z is the version. Using the debian package manager makes it trivial to track the installed version, upgrade, or remove the package.

As an old-fashioned installation:

make install

On Linux (and probably Mac OS X) this will install the resources to ${CMAKE_INSTALL_PREFIX}/lib/barf which notably includes the barf targets dir, which contains the code templates for generating code from reflex or trison source. The tool binaries bpp, reflex, and trison will be installed to ${CMAKE_INSTALL_PREFIX}/bin.

To run cmake in interactive mode:

cd build
ccmake ..

The ccmake tool is a curses interface for cmake which is easy to use.

Using BARF in Another Project

If barf has been properly installed, the files necessary for cmake's find_package to work will be installed, such that an external project can incorporate BARF using the following code in its CMakeLists.txt file.

find_package(barf REQUIRED)

This defines the functions

reflex_add_source(SOURCE_FILE TARGET_NAME)
trison_add_source(SOURCE_FILE TARGET_NAME)

which when invoked create build rules for generating C++ code from reflex code or trison code respectively.

Examples

There are several reflex scanners and trison parsers within the BARF codebase itself which function as realistic examples.

Additional examples, presented as standalone programs, complete with cmake-based build systems, are presented under the examples directory.

Thanks

To Dan Palm for suggesting BARF (as Bison's Awesome Replacement Framework) as a name, and for being the first other person to attempt to develop anything using it. He created a scripting language called Steel for his game engine Stone Ring.

Development Notes

Running Unit Tests

If the BUILD_barftest option is enabled in cmake (done interactively in ccmake or specified via -D BUILD_barftest=ON option to cmake) then the bin/barftest binary will be built, and when run will execute a suite of unit tests, exercising a real scanner and parser that was taken from another of the author's projects.

bin/barftest

Will silently succeed if no tests fail, otherwise will indicate failed tests. A help message can be seen by supplying the --help option. More detailed test output can be had by using the --log-level=X option, e.g.

bin/barftest --log-level=TRC

will give very very verbose output.

bin/barftest --log-level=INF

will give verbose output, essentially only showing which tests are being run, the results, and the summary of the overall run.

A filter can be specified to run only a subset of tests, e.g.

bin/barftest /00/parser/09/func

will run all tests whose path starts with /00/parser/09/func. Note that the end of the filter string doesn't have to coincide with a / char.

Verifying Self-Hosting

The scanners and parsers in barf are implementing using barf itself. This self-hosting poses particular challenges to testing/verification when adding or changing features to reflex and trison. There are some make targets that assist in testing/verification:

  • dev targets : the normally-built binaries (build/bin/reflex and build/bin/trison) are used to generate the scanners and parsers of barf itself. These generated sources appear under the build/dev directory under their respective subdirectories.
  • metadev targets : derived versions of the reflex and trison binaries (build/bin/dev_reflex and build/bin/dev_trison) are built using the sources produced by the dev targets, which appear under the build/metadev directory under their respective subdirectories.

The most basic check is

make metadev_check

which builds and compares the generated dev and metadev scanners and parsers to ensure that the built binaries produce consistent and stable output. Success is when the produced dev and metadev subdirs of the build dir are identical. This build target should work even from a fresh build dir, and will build bpp, reflex, trison, dev_reflex, and dev_trison in the process.

There are individual targets for each checking particular metadev scanner or parser:

  • make diff_metadev_barf_commonlang_scanner
  • make diff_metadev_barf_preprocessor_parser
  • make diff_metadev_barf_preprocessor_scanner
  • make diff_metadev_barf_regex_parser
  • make diff_metadev_barf_targetspec_parser
  • make diff_metadev_reflex_parser
  • make diff_metadev_trison_parser as well as targets for building the categories of metadev targets:
  • make diff_all_metadev_parsers
  • make diff_all_metadev_scanners as well as one that builds all metadev targets:
  • make diff_all_metadev_thingies

The metadev targets can be built without comparing using the respective make metadev_* and make all_metadev_* targets.

The other check, which is more subtle and intended only for barf developers, and which is currently not functioning as intended (due to the presence of #line directives in the committed-to-repo scanner and parser sources but not in the dev targets), is

make diff_all_dev_thingies

This compares the generated dev scanners and parsers with the respective committed-to-repo sources, so that the specific changes in generated scanner and parser due to development work can be exactly quantified, or verified to produce no difference. As with all the many metadev make rules described above, there are analogous dev make rules.

CMake probably handles the dependencies of these targets correctly, but if you want to make absolutely sure that fresh sources have been generated and targets built, then run the following command before make metadev_check.

make clean clean_all_dev_and_metadev

To-dos

Some of these are super old and may no longer apply.

Bugs

  • make diff_all_dev_thingies is not functioning because of the presence of #line directives in the committed-to-repo scanner and parser sources, and their absence in the dev targets. This could be resolved by making the dev targets also use #line directives (as this affects the use of newlines in the generated sources), but applying a transformation to strip out these directives before diffing.

  • disallow empty reflex states

  • reflex: an unterminated regex causes a hang (while reflex is parsing the *.reflex source file):

    %state_machine BLAH
    :
        (
    ;
    
  • bug (in trison) -- lack of a parse error handler for:

    %terminal XXX %type "Blah"
    

    which should be

    %terminal XXX %type.cpp "Blah"
    
  • bug (in trison) -- lack of a parse error handler for:

    %nonterminal XXX %type
    

    where the string following %type is missing.

  • looks like trison doesn't catch EOF before ; in a nonterminal declaration

  • bug (in trison npda target) -- negated %lookahead directive (at end of reduction rule) isn't implemented correctly. For example,

    state 315:
        rule 23: expression <- IDENTIFIER . %lookahead[![':'|DEFINE|CT|RT|LOCAL|GLOBAL]]
    
        default:REDUCE rule 23
        ':':INSERT_LOOKAHEAD_ERROR
        CT:INSERT_LOOKAHEAD_ERROR
        RT:INSERT_LOOKAHEAD_ERROR
        LOCAL:INSERT_LOOKAHEAD_ERROR
        GLOBAL:INSERT_LOOKAHEAD_ERROR
        DEFINE:INSERT_LOOKAHEAD_ERROR
    

    this will reduce even when the lookahead is DEFINE, since the default reduce action logic doesn't check for these other transitions (which should each prevent the default action).

  • bug in CMakeLists.txt -- diff_all_dev_thingies doesn't work anymore (since moving generated code to generated subdirs.

BARF General To-dos

  • write man pages for reflex, trison, bpp (hopefully somehow get doxygen to do this)
  • figure out how to generate/install docs
  • guarantee that errors will be reported such that the line numbers never decrease.
  • bug: getting "file not found" error when a file exists but the permissions are such that the user can't read it.
  • add lots of error rules to all parser grammars
  • descriptions should use terms "finite-automaton-based scanners" and "pushdown-automaton-based parsers"
  • error when any output file is the same as an input file, or when any two output files are the same.
  • change comments to %/ and %* *% in primary sources?
  • move trison/reflex stuff into lib/trison and lib/reflex? and only keep main, options and main header file in app dir (so test apps can link the libs and run tests)? put stuff in Barf::Xxx namespace
  • do comprehensive deleting of stuff (this might be hard with all the throwing) THEN do memory leak checking (valgrind)
  • get rid of AST hierarchy overviews once doxygen docs are to some decent state
  • use Ast::Directive::DirectiveString in error messages instead of creating the error messages by hand -- make all directives derive from Ast::Directive
  • file locations should be "2d range" -- a starting line/column and an ending line/column (or starting line/column and a number of bytes following). nonterminal expressions can then be the union of the file locations of their component parts. this technique will be useful later in compiler design for debugging info.
  • add filename-sanitizing function -- i.e. to check for newlines and other stuff
  • save off FiLoc where each error panic starts, so that if it's a general parse error, at least some info can be printed.
  • Change doxygen comments to all /// style
  • Ensure all the different types of cleanup are performed, verify with valgrind.

CMake-related To-dos

  • Change shell commands (e.g. invocations of make, rm, diff) over to use corresponding cmake commands, such as file(REMOVE ...), etc.

Documentation To-dos

  • table of contents inside each doxygen_pages page
  • pyramidal documentation
  • simplify documentation (maybe leave out basic regex stuff, keep to essentials)

Commonlang To-dos

  • target directives (including rule handlers) should accept multiple targets ? like

    %target.cpp_nostl,cpp_withstl,cpp_noexceptions %{ blah %}
    

Core To-dos

  • make a Graph::Node::Index type (a struct or an enum with a value UINT32_UPPER_BOUND (to force it to be unsigned)) and Graph::Node::IndexArray vector type in barf_graph.h (replacing the ad-hoc vector used all over the place)
  • check standard on octal escape chars (up to 3 octal digits only?)
  • make EscapeChar (which does full char escaping) and use it in EscapeString? this would make the char scanning much easier
  • make repeated identical (or perhaps just very similar) warnings/errors suppressed (e.g. in preprocessor, having an error in a loop will cause that error to be emitted that many times)

Targetspec To-dos

  • allow using the value of previously declared directives as default values (this would be had for free if targetspec-declared, primary-source-defined codespec symbols are allowed to use codespec symbols -- see codespec section)
  • add targetspec versions (target-specific) -- so you can compile/link multiple coexisting versions of the same target but at different versions. do this by sticking a version number in the namespace of the generated baseclasses

Codespec To-dos

  • add comment/description element to each codespec symbol, so that you can print out a list of all available codespec symbols and their descriptions from the commandline for easier development (i.e. to the state machine ones so that you don't have to look through xxx_codespecsymbols.cpp for these)
  • make better error message (i.e. anything informative at all) when there's an error in an included codespec file.

Preprocessor To-dos

  • add parser error handler for when argument to is_defined is not an identifier
  • just make the preprocessor language better in general -- it's sort of crappy
  • add a printf function
  • add some way to use a different file location for errors/warnings
  • add comments (C++ and C style)
  • make _blah ids not be able to be defined in-file, they're reserved
  • add assignment operator; e.g. <{x=x+3}
  • add eval directive so you can parse and execute a block of text as a preprocessor body.
  • add 64 bit ints (better) and add whatever the negative-most 64-bit int is to the scanner (because the way negative ints are constructed in the parser screws this up)
  • maybe make the Textifier detect if a newline was just printed, so that it doesn't print unnecessary newlines when generating line directives
  • add boolean property for wether or not a particular symbol was referenced during preprocessing, so that warnings can be emitted if a symbol isn't used
  • Could add "indent" and "unindent" directives in order to better format code. Would need this to work inside "if" blocks.
  • Add ability to print list of dependencies (i.e. included files). This would depend on the include path as well as the preprocessor variable values themselves.

Regex To-dos

  • document all of this regex stuff in detail, since it's pretty complicated, and not completely standard
  • in graph printing, TT_INPUT_ATOM_RANGEs which include [ or ] as the ending char need to escape them
  • generally fix up the escaping of characters in display strings and graphs to escape in the context of the regex.
  • make a parse-string method which returns a Regex::RegularExpression and possibly throws to indicate errors.
  • move NodeData into separate file
  • improve the DFA generation so that when throwing an exception when there is a conditional conflict, it includes the file location that caused it.
  • add a pass to NFA generation where it optimizes unnecessary epsilon transitions out.
  • bracket expression transitions should use single-atom transitions when possible

BPP To-dos

  • add predefine (-D) option

Reflex To-dos

  • instead of always extracting the accepted string and passing it in, make it available through an accessor, so it can be constructed lazily. OR somehow just return pointers into the buffer for the accepted string, so that there is as little overhead as possible when running the scanner. Could do this with std::span (newer C++ standard)
  • maybe consistencize the codespec symbol names
  • bottom_of_scan_method_actions targetspec directive should indicate which std::string (analogous to accepted_string or rejected_string) is available for that code, in case something needs to be done with the string itself, instead of just knowing that no handler ran.

Reflex/Trison Common To-dos

  • use smallest integer type to store state and transition indices, etc. (in C++, this can be done easily with templates)

  • move the --with-line-directives option to be an optional targetspec directive

  • add targetspec directive for line directives referencing codespecs (the default should be to not refer to codespecs), making development easier

  • add option which prints the reflex/trison targets available in the current search path and which directory each resides in.

  • the rule handler code block type should be specifiable by the targetspec

  • specifying any %target.xxx directive seems to add the target; stop this

  • allow "%targets xxxx" in pre/postdefine

  • also add post-undefine option

  • warn about targetspec directive values that weren't referenced in the code generation (preprocessor) stage -- ones that don't appear at all in any codespec, not necessarily to indicate they weren't used in one particular case.

  • Ensure that commandline-specified values for scanner/parser directives override in-source values, so that scanner/parser generation can be controlled from the build system if needed, e.g. for unit testing.

  • Run preprocessor on code blocks in reflex and trison, so that e.g. stuff like this can work in a reflex/trison code block.

    Parser::ParserReturnCode parse (std::string const &s, double &parsed_value)
    {
        std::istringstream in(s);
        Parser parser;
        parser.attach_istream(in);
    <|if(is_defined(generate_debug_spew_code))
        parser.DebugSpew(true);
    <|end_if
        Parser::ParserReturnCode return_code = parser.Parse(&parsed_value);
        return return_code;
    }
    
  • Probably restructure the targets directory to be hierarchical, like

    targets
        reflex
            cpp
                <associated files>
        trison
            cpp
                <general use trison cpp files>
                dpda
                    <associated files
                npda
                    <associated files>
    

    and then make the include directive use that directory hierarchy. Probably require full paths in include directive, and disallow relative paths.

  • Make error messages such that KDevelop (and other IDEs) correctly parse them and clicking on the error message goes to the correct line in the correct source.

Trison To-dos

  • pass reference instead of double pointer to Parse

  • refactor the DPDA generation to be simpler and faster (difficult)

  • if possible, make the dpda.states file put the states' rules in order.

  • you shouldn't have to specify %terminal 'X' for ascii chars. you should just be able to use them (or maybe add as a directive)

  • check for reduction rule variable name collisions

  • decouple npda states from trison ASTs -- they should only exist in the graph itself.

  • add an unbounded grammar app/dev test case, and deal with unbounded grammars (since DPDAs can't be generated for them)

  • compactify lookaheads in dpda

  • make sure lookahead sequences are alphabetized, allowing for a binary search instead of a linear search (in the dpda target)

  • unit tests for NPDA

  • add some nice metrics to the NPDA parser -- make this part of the parser API

    • max lookahead count (done)
    • max memory used
    • some measurement of the "leanness" of the grammar, like how much branching was encountered, etc
  • in the NPDA dot graph, maybe move the self-transitions to just text lines in the state box itself, to avoid the mess that dot creates with self transitions.

  • maybe add some online DPDA generation in NPDA during runtime, optionally saving the tables to a file. The file should be able to be specified as "read-only". This would allow a balance between DPDA-generation for commonly used grammar pathways and the flexibility of NPDA parsers.

  • port NPDA code to offline DPDA generation (after NPDA has been fully unit tested)

  • NPDA parser test design notes

    • need to test a bunch of different grammars
      • grammars with bounded lookahead (1, 2, 3, and 0 if that's possible)
      • grammars with unbounded lookahead
      • try to come up with a worst case grammar
      • look on wikipedia and other websites for grammars
      • use actual grammars like from barf, steel, etc.
    • for each grammar, make sure each rule is exercised
    • for each grammar, make sure each combination of rules (i.e. binary operators) is exercised Unit test should be compiled into a single binary. Each Parser should have a distinct name, and the various inputs to the parsers should be stored in files. The outputs should be ASTs, and compared against "expected" ASTs which can be defined in code.
  • Add an error handling rule for trison:

    %nonterminal maybe_assigned_id %type "Ast::Id *"
    

    caused the generic error syntax error in %nonterminal directive because of missing .cpp

  • Probably compute epsilon closure of states during trison execution, so that the NPDA implementation doesn't have to implement it. This could also be used for the NPDA states file, so that it requires the reader to do less work.

  • Perhaps in NPDA states file, show epsilon closure and transitions from all states in that epsilon closure, so that the human reader doesn't have to do as much mental work.

  • Collapse some options together using arguments (e.g. the pairs of "do" and "don't" options).

  • Add symbols and strings/chars to the body section of codespec kate syntax highlighting file, so that at least some syntax highlighting is done in the body.

  • If "indent" and "unindent" directives were present in preprocessor, then Python targets would be possible (since they're indentation-sensitive). Would need this to work inside "if" blocks.

  • Make the npda states file grammar section print out the precedence levels and print the precedence for each rule (including default)

  • There appears to be a crash bug in trison where an undeclared terminal is used in the bracketed %error[...] syntax.

  • Should re-add the specification of which stage of each rule it is in the npda state file, probably as a tuple, like if there is

    rule 123 : blah <- A B C
    

    then there would be

    rule 123,0 : blah <- . A B C
    rule 123,1 : blah <- A . B C
    rule 123,2 : blah <- A B . C
    rule 123,3 : blah <- A B C .
    
  • Maybe make %error into a nonterminal (this is somewhat pedantic).

  • Probably just get rid of ResetForNewInput, favoring destroying and recreating the parser/scanner. At least examine if there's any reason to keep that feature.

  • Make an option in reflex and trison for printing a list of dependent targetspec and codespec files for a given source or target. I.e. the input file is parser.trison and it specifies targets target.cpp and target.fakelang, and then it prints target.cpp.targetspec, target.cpp.header.codespec, target.cpp.implementation.codespec, target.cpp.npda.header.codespec, target.cpp.npda.implementation.codespec, ..., target.fakelang.targetspec target.fakelang.codespec. This way, build processes can be more automated and require less hand-specification.

    Similarly, make an option where it prints the files that would be generated by a command (i.e. thing_parser.trison -> thing_parser.hpp thing_parser.cpp

  • Should probably make NPDA DAG generation forbid an INSERT_ERROR_TOKEN action when the lookahead is ERROR_. This would prevent one class of infinite loop.