Skip to content
ohboyohboyohboy edited this page Sep 13, 2010 · 26 revisions

A lexer’s job is to take a stream of text and chop it up into a sequence of tokens — little strings categorized with type codes like NUMBER or KEYWORD. For more information on tokens and their characteristics, read Tokens.

For a general overview of lexers, check out:

How Lexers Differ from Parsers

I do not intend to cover many details of lexical analysis in this article. However, highlighting a few of the key differences between lexers and parsers should help clarify the documentation below for users who are new to ANTLR.

As described above, lexers chop raw language text into tokens. If a particular character sequence in the input cannot be matched against the lexical rules of the grammar, the lexer must handle the syntax error somehow and optionally report the error.

A parser’s task is to then take the stream of tokens extracted by the lexer and recognize any one of a number of larger syntactic statements. Parsers may have more than one point of entry. For example, a python parser will probably have one entry rule for parsing full python scripts. It will probably have another rule for parsing a single python statement read from an interactive prompt. Since python restricts the content permitted in its eval statement, the parser may have an additional entry rule for strings passed to eval.

In contrast, lexers typically have a single point of entry A grammar will define a number of different tokens that are possible within a language using lexer rules. ANTLR will derive an additional lexer rule that combines all of the lexical rules, invoking the most appropriate token rule based upon the next few characters in the input stream. Thus, while parsers can have any number of independent entry points, lexers automatically choose the most appropriate rule for the current input.

Lexer Code and Class Structure

For a combined or lexer grammar named Language, antlr4ruby will generate a lexer class. In other ANTLR targets, the generated class is named LanguageLexer. However, this ruby implementation will generate a top level module named Language, which serves as a general namespace for the various entites required by the code. The actual lexer class is simply named Lexer, so the full name of the generated lexer will be Language::Lexer. Consider this really simple lexer grammar:

lexer grammar Digit;

options {
  language = Ruby;
}

DIGIT: '0' .. '9';

An abbreviated form of the output generated by antlr4ruby Digit.g is shown below:

# editted out code that ensures the antlr3 runtime library is required

module Digit
  # TokenData defines all of the token type integer values
  # as constants, which will be included in all 
  # ANTLR-generated recognizers.
  const_defined?(:TokenData) or TokenData = ANTLR3::TokenScheme.new

  module TokenData

    # define the token constants
    define_tokens( :DIGIT => 4, :EOF => -1 )

  end


  class Lexer < ANTLR3::Lexer
    @grammar_home = Digit
    include TokenData

    begin
      generated_using( "Digit.g", "3.2.1-SNAPSHOT Dec 18, 2009 04:29:28", "1.6.3" )
    rescue NoMethodError => error
      error.name.to_sym == :generated_using or raise
    end

    RULE_NAMES   = ["DIGIT"].freeze
    RULE_METHODS = [:digit!].freeze

    def initialize(input=nil, options = {})
      super(input, options)
    end

    # - - - - - - - - - - - lexer rules - - - - - - - - - - - -
    # lexer rule digit! (DIGIT)
    # (in /home/kyle/lib/ruby/projects/antlr3/test/functional/lexer/basic/Digit.g)
    def digit!
       # editted out recognition logic
    end

    # main rule used to study the input at the current position,
    # and choose the proper lexer rule to call in order to
    # fetch the next token
    # 
    # usually, you don't make direct calls to this method,
    # but instead use the next_token method, which will
    # build and emit the actual next token
    def token!
      # at line 1:10: DIGIT
      digit!
    end
  end # class Lexer < ANTLR3::Lexer
end

Thus, the generated code for a lexer creates the following named entities:

  1. module Language – where Language is the name of the input grammar
  2. class Language::Lexer < ANTLR3::Lexer – the lexer implementation
  3. module Language::TokenData – an ANTLR3::TokenScheme (subclass of Module), which is used to define token types and a token class
  4. class Language::TokenData::Token < ANTLR3::CommonToken – not apparent in the code above, this class is dynamically created along with Language::TokenData

Instantiating Lexers

Providing an Input Stream

A lexer must be provided with a text source to tokenize. More specifically, a lexer takes in a stream object, which feeds it characters from a text source, and produces a series of token objects. For example, below is an illustration of creating an instance of Digit::Lexer with a stream object:

input = ANTLR3::StringStream.new( "123" )
lexer = Digit::Lexer.new( input )

input = ANTLR3::FileStream.new( 'numbers.txt' )
lexer = Digit::Lexer.new( input )

Providing Strings or Files Directly

More often than not, the stream a lexer will operate upon will be an instance of ANTLR3::StringStream or ANTLR3::FileStream. Thus, the runtime library for this binding provides convenience casting for the most common cases. The initialize method of a lexer will automatically cast plain strings and file objects into appropriate stream objects.

lexer = Digit::Lexer.new( "123" )

lexer =
  open( 'numbers.txt' ) do | f |
    Digit::Lexer.new( f )
  end

Fetching Tokens

require 'Digit'

lexer = Digit::Lexer.new( "123" )

lexer.next_token
# => DIGIT["1"] @ line 1 col 0 (0..0)

lexer.exhaust
# => [DIGIT["2"] @ line 1 col 1 (1..1), DIGIT["3"] @ line 1 col 2 (2..2)]

lexer.next_token
# => <EOF>

lexer.reset
lexer.map { | tk | tk.text.to_i } 
# => [1, 2, 3]

Preparing a Lexer for Parser Input

lexer = Digit::Lexer.new( "123" )
tokens = ANTLR3::CommonTokenStream.new( lexer )
# `tokens' can be passed to the constructor of an ANTLR parser for parsing

Rule to Method Name Mapping

In most usage scenarios, a developer does not need to invoke lexer rule methods directly; most lexer work is done by invoking Lexer#next_token. However, it is important to illustrate how a grammar’s identifiers are user in the output ruby code. This is an area in which the antlr3 package diverges from other ANTLR target conventions.

All ANTLR grammar rules are translated into method definitions. For parsers and tree parsers, the methods share same name as the rule. ANTLR requires named lexer rules to start with a capital letter. To avoid name conflicts with constants and to support style conventions in ruby, lexer rule names are altered when they become method names in lexer classes.

Consider the following simple grammar:

lexer grammar Variables;
options { language = Ruby; }

ID: ( 'a' .. 'z' | '_' ) ( 'a' .. 'z' | 'A' .. 'Z' | '_' | '0' .. '9' )*;
WHOLE_NUMBER: ( '0' .. '9' )+;
SPACE: ( ' ' | '\t' | '\n' | '\r' )+;

Below is an abridged skeleton of the code generated by antlr4ruby:

module Variables
#...
class Lexer < ANTLR3::Lexer
  #...
  def id!
    # recognition logic for ID
  end

  def whole_number!
    # recognition logic for WHOLE_NUMBER
  end

  def space!
    # recognition logic for SPACE
  end

  def token!
    # the auto-generated rule used to pick the lexer rule
  end
end

end

The method naming convention for lexer rule methods works as follows:

  1. reformat the name to use a lower case convention:
    1. ALL_CAPS_AND_UNDERSCORES names become all_caps_and_underscores
    2. CamelCase names become camel_case names, as in Rails
  2. append an exclaimation point

Rationale For Lexer Rule Naming Convention

Clearly there’s a bit of opinionated design in this naming convention. Ruby does allow method names to begin with a capital letter, as in Kernel#Array or Kernel#String. So why aren’t lexer rule methods named with the same rule name? Well,

  1. methods that begin with capital letters cannot be called privately without arguments or parentheses:
  2. token types are implemented as integer constants sharing the lexer rule name.

Thus, while reading the lexer code, it’d be easy to misinterpret references to token type X with rule method X(). All generated calls to method X would require extra parentheses, making the code a little messier. Furthermore, while ruby does not impose a specific stylistic convention, the community generally adopts the convention of using snake_case for method names.

The way other ANTLR language targets handles the name conflict is to define lexer rule ID with an extra leading undercase t, making method names tID(). As a developer, I’m not a fan of name mangling like that and thus I chose to diverge from ANTLR conventions.

So why the exclamation point?

  1. It prevents lexer rule methods from conflicting with ruby keywords. A lexer rule named FOR or DEF will be defined with method name for! or def!, avoiding potential syntax errors.
  2. It allows lexer rule methods to be referenced privately without parentheses, making code a little cleaner. While id may refer to a local variable or a method call, id! always refers to a method call.

Important Note About Lexer Rule Names

This naming convention for lexer rules could potentially cause a significant bug in the generated code in one particular situation. Consider this grammar:

lexer grammar Strings

FLOAT:   Integer '.' Integer;
INTEGER: Integer ( 'l' | 'L' )?;

fragment Integer: ( '0' .. '9' )+;

The method code is going to look something like:

  def float!
    # ... logic for FLOAT
  end

  def integer!
    # ... logic for INTEGER
  end

  def integer!
    # ... logic for Integer
  end

Thus the fragment rule Integer’s method will overwrite the full rule method for INTEGER. I have seen this sort of convention used in a grammar and, as a result of the lexer rule naming convention, will break the lexer code. This shouldn’t be a common issue as it’s not that sensible to have lexer rules “INTEGER” and “Integer” exist in the same grammar (“Integer” would make more sense as “DIGITS” in this example).

Currently, antlr4ruby does not detect this problem to issue a warning or do anything to differentiate these names. While I may change the naming convention or find a way to have ANTLR issue a warning in the future, be aware that lexer rules are effectively case insensitive as far as code generation is concerned.