Chapter 3: Format of the input file

The flexc++ input file consists of two sections, separated by a line containing `%%'. The section above %% contains option specifications and definitions; the section below %% contains the regular expressions (and their (optional) actions). The general layout of flexc++'s input file, therefore, looks like this:


definitions
%%
rules
    

Below, these sections are described in more detail.

3.1: Definitions section

Flexc++ supports command-line options and input-file directives controlling flexc++'s behavior. Directives are covered in the next section (3.1.1), options are covered in the section 1.1.1.

The definitions section may also contain declarations of named regular expressions. A named regular expression looks like this:

name   pattern

Here, name is is an identfier, that may also contain the hyphen (-). The `pattern' is a regular expression, see section 3.4. Patterns start at the first non-blank character following the name, and end at the line's last non-blank character. Consequently, a named regular expression itself cannot contain comment.

Finally, the definitions section may be used to declare mini-scanners (a.k.a. start conditions), cf. section 3.6. Mini-scanners are very useful for scanning small `sub-languages' in the language you want to scan. A commonly encountered example is the mini-scanner recognizing C style multi-line comment.

3.1.1: Directives

Some directives require arguments, which are usually provided following separating (but optional) = characters. Arguments of directives, are text, surrounded by double quotes (strings). If a string must itself contain a double quote or a backslash, then precede these characters by a backslash. The exceptions are the %s and %x directives, which are immediately followed by name lists, consisting of identifiers separated by blanks. Here is an example of the definition of a directive:

    %class-name = "MyScanner"
        

The following directives are available:

3.2: Rules section

The rules section of the flexc++ input file contains a series of rules of the form:

pattern    action

The action is optional, and is separated from the pattern by spaces and/or tabs. It consists of a single-line C++-statement, or of a compound statement that may span several lines.

Alternatively, an action may consist of a vertical bar (`|'). In that case no blank is required between pattern and |. A vertical bar indicates that pattern uses the same action as the next rule.

3.3: Comment

Comment may be used almost everywhere in flexc++'s input file. Both traditional C-style multi-line comment (i.e., /* ... */) and C++ style end-of-line comment (i.e., // ...) can be used. Indentation is optional.

When comment is encountered outside of an action, flexc++ discards the comment, while all comment provided in the contect of actions are copied verbatim to the generated source file.

Comment cannot be used when defining named regular expressions in the definitions section.

3.4: Patterns

The patterns in the input (see Rules Section 3.2) are written using an extended set of regular expressions. These are:

Inside of a character class all regular expression operators lose their special meaning except escape (\), the character class operator -, and, at the beginning of the class, ^. To use a closing bracket in a character class either start the character class as [] or as [^].

The operators used in specifying regular expressions have the following priorities (listed from lowest to highest):

Different from the lex-standard, but in line with most other regular expression engines the interval operator is given higher precedence than concatenation. To require two repetitions of the word hello use (hello){2} rather than hello{2}, which to flexc++ is identical to the regular expression helloo.

Named regular expressions have the same precedence as parenthesized regular expressions. So after


    WORD  xyz[a-zA-Z]
    %%
    {WORD}{2}
        
the input xyzaxyzb is matched, whereas xyzab isn't.

In addition to characters and ranges of characters, character classes can also contain predefined character sets. These consist of certain names between [: and :] delimiters. The predefined character sets are:

     
         [:alnum:] [:alpha:] [:blank:]
         [:cntrl:] [:digit:] [:graph:]
         [:lower:] [:print:] [:punct:]
         [:space:] [:upper:] [:xdigit:]

These predefined sets designate sets of characters equivalent to the corresponding standard C isXXX function. For example, [:alnum:] defines all characters for which isalnum returns true.

As an illustration, the following character classes are equivalent:

 
         [[:alnum:]]
         [[:alpha:][:digit:]]
         [[:alpha:][0-9]]
         [a-zA-Z0-9]
    

Note that a negated character class like [^A-Z] matches a newline unless \n (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., [^A-Z\n]). This differs from the way many other regular expression engines treat negated character classes. Matching newlines means that a pattern like [^"]* can match the entire input unless there's another quote in the input.

Flexc++ allows negation of character class expressions by prepending ^ to the name of a predefined character set. Here are the negated predefined character sets:

                
         [:^alnum:] [:^alpha:] [:^blank:]
         [:^cntrl:] [:^digit:] [:^graph:]
         [:^lower:] [:^print:] [:^punct:]
         [:^space:] [:^upper:] [:^xdigit:]
    

The `{+}' operator computes the union of two character classes. For example, [a-z]{+}[0-9] is the same as [a-z0-9].

The `{-}' operator computes the difference of two character classes. For example, [a-c]{-}[b-z] represents all the characters in the class [a-c] that are not in the class [b-z] (which in this case, is just the single character `a').

A rule can have at most one instance of trailing context (the / operator or the $ operator). The start condition, ^, and <<EOF>> patterns can only occur at the beginning of a pattern, and, as well as with / and $, cannot be grouped inside parentheses. A ^ which does not occur at the beginning of a rule or a $ which does not occur at the end of a rule loses its special properties and is treated as a normal character.

The following are invalid:

                
         foo/bar$
         <sc1>foo<sc2>bar
    
Note that the first of these can be rewritten `foo/bar\n'.

If the desired meaning is a `foo' or a `bar'-followed-by-a-newline, the following could be used (the special | action is explained below, see section 3.5):

                
         foo      |
         bar$     /* action goes here */
    
A comparable definition can be used to match a `foo' or a `bar'-at-the-beginning-of-a-line.

3.5: Actions

As described in Section 3.2, the second section of the flexc++ input file contains series of rules: pairs of patterns and actions.

Specifications of patterns end at the first unescaped white space character; the action then starts at the first non-white space character. It usually contains C++ code, with two exceptions: the empty and the bar (|) action (see below). If the C++ code starts with a brace ({), the action can span multiple lines until the matching closing brace (}) is encountered. Flexc++ will correctly handle braces in strings and comments.

Actions can be empty (omitted). Empty actions discard the matched pattern. To avoid confusion it is advised to provide at least a simple comment stating that the matched input is ignored.

The bar action is an action containing only a single vertical bar (|). This tells flexc++ to use the action of the next rule. This can be repeated so the following rules all use the same action:


    a   |
    b   |
    c   std::cout << "Matched " << match() << "\n";
        
Actions can return an int value, which is usually interpreted as a token by the program calling the lexical scanning function lex, called by, e.g., a parser. When lex is called again it continues just beyond the last-matched point in the input stream.

3.6: Startconditions (Miniscanners)

Flexc++ enables a programmer to describe tokens with a set of regular expressions. Often a lexer scanner specification file uses multiple `languages': regular expressions, code, comment, C-type (double quote delimited) strings, etc..

For flexible handling of these sub-languages flexc++, like flex, offers start conditions, a.k.a. mini scanners. A start condition can be declared in the definition section of the lexer file:


%x  string
%%
...
    
A %x is used to declare exclusive start conditions. Following %x a list (no commas) of start condition names is expected. Rules specified for exclusive start conditions only apply to that particular mini scanner. It is also possible to define inclusive mini scanner using %s. Rules not explicitly associated with a start condition (or with the (default) start condition StartCondition__::INITIAL also apply to inclusive mini scanners.

A start condition is used in the rules section of the lexical scanner specification file as indicated in section 3.4. Here is a concrete example:


%x string
%%

\"              {
                    more();
                    begin(StartCondition__::string);
                }

<string>{
    \"          {
                    begin(StartCondition__::INITIAL);
                    return Token::STRING;
                }
    \\.|.       more();
}
    
This tells flexc++ that the double quote starts (begins) the StartCondition__::string mini scanner. The string mini scanner's rules then define what happens to double quoted strings: all its characters are collected, and eventually the string's content is returned by matched().

By default, scanners generated by flexc++ start in the StartCondition__::INITIAL start condition. When encountering a double quote, the scanner switches to the StartCondition__::string mini scanner. Now, only the rules that are defined for the string mini scanner are active. Once flexc++ encounters an unescaped double quote, it switches back to the StartCondition__::INITIAL mini scanner and returns Token::STRING to its called, indicating that it has seen a C string.

3.6.1: Notation details

Instead of using a mini scanner compound statement, it is also possible to provide rules with explicit mini-scanner specifications (cf. section 3.4. Here is the string mini scanner once again, now using explicit mini-scanner specifications:

%x string
    
%%

\"              {
                    more();
                    begin(StartCondition__::string);
                }
<string>\"      {
                    begin(StartCondition__::INITIAL);
                    return Token::STRING;
                }
<string>\\.|.   more();
}
    

3.7: Members

The Scanner class offers the following members, which can be called from within actions (or by members called from those actions):

3.8: Handling input your own way

Assuming that the scanner class is called `Scanner' the class Input is nested within the class `ScannerBase'. The stream from which flexc++ retrieves characters is completely decoupled from the pattern-matching algorithm implemented in the ScannerBase class. the pattern-matching algorithm retrieves the next character from a class Input, nested under ScannerBase. This class will usually provide all the required functionality, but users of flexc++ may optionally provide their own Input class.

In situations where the default Input implementation doesn't suffice simply `roll your own', implementing the following interface and use the %option input-interface and %option input-implementation options in the lexer file to include, respectively, your own class Input interface in the generated scannerbase.h file and Input member function implementations in the generated lex.cc file.

When implementing your own class Input, the following public interface must at least be provided:


    class Input
    {
        public:
            Input();
            Input(std::istream *iStream);   // dynamically allocated iStream
            size_t get();                   // the next character
            size_t lineNr() const;          
            void reRead(size_t ch);         // push back 'ch' (if <= 0x100)
                                            // push back str from idx 'fmIdx'
            void reRead(std::string const &str, size_t fmIdx);

            void close();                 // delete dynamically allocated
    };
        
This interface may be augmented with additional members, but the shown interface is used by ScannerBase. Flexc++ places Input in ScannerBase's private interface and all communication with Input is handled by ScannerBase. Input's members must perform the following tasks: