-
Notifications
You must be signed in to change notification settings - Fork 12
Lexers
A lexer’s job is to take a stream of text and chop it up into a sequence of tokens — little strings categorized with type codes like NUMBER
or KEYWORD
. For more information on tokens and their characteristics, read Tokens.
For a general overview of lexers, check out:
I do not intend to cover many details of lexical analysis in this article. However, highlighting a few of the key differences between lexers and parsers should help clarify the documentation below for users who are new to ANTLR.
As described above, lexers chop raw language text into tokens. If a particular character sequence in the input cannot be matched against the lexical rules of the grammar, the lexer must handle the syntax error somehow and optionally report the error.
A parser’s task is to then take the stream of tokens extracted by the lexer and recognize any one of a number of larger syntactic statements. Parsers may have more than one point of entry. For example, a python parser will probably have one entry rule for parsing full python scripts. It will probably have another rule for parsing a single python statement read from an interactive prompt. Since python restricts the content permitted in its eval
statement, the parser may have an additional entry rule for strings passed to eval
.
In contrast, lexers typically have a single point of entry A grammar will define a number of different tokens that are possible within a language using lexer rules. ANTLR will derive an additional lexer rule that combines all of the lexical rules, invoking the most appropriate token rule based upon the next few characters in the input stream. Thus, while parsers can have any number of independent entry points, lexers automatically choose the most appropriate rule for the current input.
For a combined
or lexer
grammar named Language
, antlr4ruby
will generate a lexer class. In other ANTLR targets, the generated class is named LanguageLexer
. However, this ruby implementation will generate a top level module named Language
, which serves as a general namespace for the various entites required by the code. The actual lexer class is simply named Lexer
, so the full name of the generated lexer will be Language::Lexer
. Consider this really simple lexer grammar:
lexer grammar Digit;
options { language = Ruby; }
DIGIT: '0' .. '9';
An abbreviated form of the output generated by antlr4ruby Digit.g
is shown below:
# editted out code that ensures the antlr3 runtime library is required
module Digit # TokenData defines all of the token type integer values # as constants, which will be included in all # ANTLR-generated recognizers. const_defined?(:TokenData) or TokenData = ANTLR3::TokenScheme.new
module TokenData
# define the token constants define_tokens( :DIGIT => 4, :EOF => -1 )
end
class Lexer < ANTLR3::Lexer @grammar_home = Digit include TokenData
begin generated_using( "Digit.g", "3.2.1-SNAPSHOT Dec 18, 2009 04:29:28", "1.6.3" ) rescue NoMethodError => error error.name.to_sym == :generated_using or raise end
RULE_NAMES = ["DIGIT"].freeze RULE_METHODS = [:digit!].freeze
def initialize(input=nil, options = {}) super(input, options) end
# - - - - - - - - - - - lexer rules - - - - - - - - - - - - # lexer rule digit! (DIGIT) # (in /home/kyle/lib/ruby/projects/antlr3/test/functional/lexer/basic/Digit.g) def digit! # editted out recognition logic end
# main rule used to study the input at the current position, # and choose the proper lexer rule to call in order to # fetch the next token # # usually, you don't make direct calls to this method, # but instead use the next_token method, which will # build and emit the actual next token def token! # at line 1:10: DIGIT digit! end end # class Lexer < ANTLR3::Lexer end
Thus, the generated code for a lexer creates the following named entities:
-
module Language
– whereLanguage
is the name of the input grammar -
class Language::Lexer < ANTLR3::Lexer
– the lexer implementation -
module Language::TokenData
– anANTLR3::TokenScheme
(subclass ofModule
), which is used to define token types and a token class -
class Language::TokenData::Token < ANTLR3::CommonToken
– not apparent in the code above, this class is dynamically created along withLanguage::TokenData
A lexer must be provided with a text source to tokenize. More specifically, a lexer takes in a stream object, which feeds it characters from a text source, and produces a series of token objects. For example, below is an illustration of creating an instance of Digit::Lexer
with a stream object:
input = ANTLR3::StringStream.new( "123" ) lexer = Digit::Lexer.new( input )
input = ANTLR3::FileStream.new( 'numbers.txt' ) lexer = Digit::Lexer.new( input )
More often than not, the stream a lexer will operate upon will be an instance of ANTLR3::StringStream
or ANTLR3::FileStream
. Thus, the runtime library for this binding provides convenience casting for the most common cases. The initialize
method of a lexer will automatically cast plain strings and file objects into appropriate stream objects.
lexer = Digit::Lexer.new( "123" )
lexer = open( 'numbers.txt' ) do | f | Digit::Lexer.new( f ) end
require 'Digit'
lexer = Digit::Lexer.new( "123" )
lexer.next_token # => DIGIT["1"] @ line 1 col 0 (0..0)
lexer.exhaust # => [DIGIT["2"] @ line 1 col 1 (1..1), DIGIT["3"] @ line 1 col 2 (2..2)]
lexer.next_token # => <EOF>
lexer.reset lexer.map { | tk | tk.text.to_i } # => [1, 2, 3]
lexer = Digit::Lexer.new( "123" )
tokens = ANTLR3::CommonTokenStream.new( lexer )
# `tokens' can be passed to the constructor of an ANTLR parser for parsing
In most usage scenarios, a developer does not need to invoke lexer rule methods directly; most lexer work is done by invoking Lexer#next_token. However, it is important to illustrate how a grammar’s identifiers are user in the output ruby code. This is an area in which the antlr3
package diverges from other ANTLR target conventions.
All ANTLR grammar rules are translated into method definitions. For parsers and tree parsers, the methods share same name as the rule. ANTLR requires named lexer rules to start with a capital letter. To avoid name conflicts with constants and to support style conventions in ruby, lexer rule names are altered when they become method names in lexer classes.
Consider the following simple grammar:
lexer grammar Variables; options { language = Ruby; }
ID: ( 'a' .. 'z' | '_' ) ( 'a' .. 'z' | 'A' .. 'Z' | '_' | '0' .. '9' )*; WHOLE_NUMBER: ( '0' .. '9' )+; SPACE: ( ' ' | '\t' | '\n' | '\r' )+;
Below is an abridged skeleton of the code generated by antlr4ruby
:
module Variables #... class Lexer < ANTLR3::Lexer #... def id! # recognition logic for ID end
def whole_number! # recognition logic for WHOLE_NUMBER end
def space! # recognition logic for SPACE end
def token! # the auto-generated rule used to pick the lexer rule end end
end
The method naming convention for lexer rule methods works as follows:
- reformat the name to use a lower case convention:
-
ALL_CAPS_AND_UNDERSCORES
names becomeall_caps_and_underscores
-
CamelCase
names becomecamel_case
names, as in Rails
-
- append an exclaimation point
Clearly there’s a bit of opinionated design in this naming convention. Ruby does allow method names to begin with a capital letter, as in Kernel#Array
or Kernel#String
. So why aren’t lexer rule methods named with the same rule name? Well,
- methods that begin with capital letters cannot be called privately without arguments or parentheses:
- token types are implemented as integer constants sharing the lexer rule name.
Thus, while reading the lexer code, it’d be easy to misinterpret references to token type X
with rule method X()
. All generated calls to method X
would require extra parentheses, making the code a little messier. Furthermore, while ruby does not impose a specific stylistic convention, the community generally adopts the convention of using snake_case
for method names.
The way other ANTLR language targets handles the name conflict is to define lexer rule ID
with an extra leading undercase t
, making method names tID()
. As a developer, I’m not a fan of name mangling like that and thus I chose to diverge from ANTLR conventions.
So why the exclamation point?
- It prevents lexer rule methods from conflicting with ruby keywords. A lexer rule named
FOR
orDEF
will be defined with method namefor!
ordef!
, avoiding potential syntax errors. - It allows lexer rule methods to be referenced privately without parentheses, making code a little cleaner. While
id
may refer to a local variable or a method call,id!
always refers to a method call.
This naming convention for lexer rules could potentially cause a significant bug in the generated code in one particular situation. Consider this grammar:
lexer grammar Strings
FLOAT: Integer '.' Integer; INTEGER: Integer ( 'l' | 'L' )?;
fragment Integer: ( '0' .. '9' )+;
The method code is going to look something like:
def float! # ... logic for FLOAT end
def integer! # ... logic for INTEGER end
def integer! # ... logic for Integer end
Thus the fragment rule Integer’s method will overwrite the full rule method for INTEGER. I have seen this sort of convention used in a grammar and, as a result of the lexer rule naming convention, will break the lexer code. This shouldn’t be a common issue as it’s not that sensible to have lexer rules “INTEGER” and “Integer” exist in the same grammar (“Integer” would make more sense as “DIGITS” in this example).
Currently, antlr4ruby
does not detect this problem to issue a warning or do anything to differentiate these names. While I may change the naming convention or find a way to have ANTLR issue a warning in the future, be aware that lexer rules are effectively case insensitive as far as code generation is concerned.