To `tokenize' main.cc, execute:
a.out < main.cc
)
QUICK START: FLEXC++ and BISONC++
To interface flexc++ to the bisonc++(1) parser generator proceed as follows:
- Specify a grammar that can be processed by bisonc++(1). Assuming
that the scanner and parser are developed in, respectively, the
sub-directories scanner and parser, then a simple grammar
specification that can be used with the scanner developed in the previous
section is, e.g., write the file parser/grammar:
%scanner ../scanner/scanner.h
%scanner-token-function d_scanner.lex()
%token IDENTIFIER NUMBER CHAR
%%
startrule:
startrule tokenshow
|
tokenshow
;
tokenshow:
token
{
std::cout << "matched: " << d_scanner.matched() << '\n';
}
;
token:
IDENTIFIER
|
NUMBER
|
CHAR
;
- Write a scanner specification file. E.g.,
%%
[ \t\n]+ // skip white space chars.
[0-9]+ return Parser::NUMBER;
[[:alpha:]_][[:alpha:][:digit:]_]* return Parser::IDENTIFIER;
. return Parser::CHAR;
This causes the scanner to return Parser tokens to the generated
parser.
- Add the line
#include "../parser/Parserbase.h"
to the file scanner/scanner.ih
- Write a simple main function in the file main.cc. E.g.,
#include "parser/Parser.h"
int main(int argc, char **argv)
{
Parser parser;
parser.parse();
}
- Generate a scanner in the scanner subdirectory:
flexc++ lexer
- Generate a parser in the parser subdirectory:
bisonc++ grammar
- Compile all sources:
g++ --std=c++0x *.cc */*.cc
- Execute the program, providing it some source file to be processed:
a.out < main.cc
3. GENERATED FILES
Flexc++ generates four files from a well-formed input file:
- A file containing the implementation of the lex member function
and its support functions. By default this file is named lex.cc.
- A file containing the scanner's class interface. By default this file
is named scanner.h. The scanner class itself is generated once and is
thereafter `owned' by the programmer, who may change it ad-lib. Newly
added members (data members, function members) will survive future flexc++ runs
as flexc++ will never rewrite an existing scanner class interface file, unless
explicitly ordered to do so. (see also scanner.h(3flexc++)).
- A file containing the interface of the scanner class's base
class. The scanner class is publicly derived from this base class. It is used
to minimize the size of the scanner interface itself. The scanner base class
is `owned' by flexc++ and should never be hand-modified. By
default the scanner's base class is provided in the file
scannerbase.h. At each new flexc++ run this file is rewritten unless flexc++
is explicitly ordered not to do so (see also scannerbase.h(3flexc++)).
- A file containing the implementation header. This file should
contain includes and declarations that are only required when compiling the
members of the scanner class. By default this file is named
scanner.ih. This file, like the file containing the scanner class's
interface is never rewritten by flexc++ unless flexc++ is explicitly ordered to do
so (see also implementationheader(3flexc++)).
4. OPTIONS
If available, single letter options are listed between parentheses
following their associated long-option variants. Single letter options require
arguments if their associated long options require arguments as well.
- --baseclass-header=header (-b)
Use header as the pathname of the file containing the scanner
class's base class. Defaults to the name of the scanner class plus
base.h
- --baseclass-skeleton=skeleton (-C)
Use skeleton as the pathname of the file containing the
skeleton of the scanner class's base class. Its filename defaults
to flexc++base.h.
- --case-insensitive
Use this option to generate a scanner case insensitively
matching regular expressions. All regular expressions specified in
flexc++'s input file are interpreted case insensitively and the
resulting scanner object will case insensitively interpret its
input.
When this option is specified the resulting scanner does not
distinguish between the following rules:
First // initial F is transformed to f
first
FIRST // all capitals are transformed to lower case chars
With a case-insensitive scanner only the first rule can be matched,
and flexc++ will issue warnings for the second and third rule about
rules that cannot be matched.
Input processed by a case-insensitive scanner is also handled case
insensitively. The above mentioned First rule is matched for
all of the following input words: first First FIRST firST.
Although the matching process proceeds case insensitively, the
matched text (as returned by the scanner's matched() member)
always contains the original, unmodified text. So, with the above
input matched() returns, respectively first, First, FIRST
and firST, while matching the rule First.
- --class-header=header (-c)
Use header as the pathname of the file containing the scanner
class. Defaults to the name of the scanner class plus the suffix
.h
- --class-name=class
Use class (rather than Scanner) as the name of the scanner
class. Unless overridden by other options generated files will be
given the (transformed to lower case) class* name instead of
scanner*.
- --class-skeleton=skeleton (-C)
Use skeleton as the pathname of the file containing the
skeleton of the scanner class. Its filename defaults to
flexc++.h.
- --construction (-K)
Write details about the lexical scanner to the file
`rules-file'.output. Details cover the used character ranges,
information about the regexes, the raw NFA states, and the final
DFAs.
- --debug (-d)
Provide lex and its support functions with debugging code,
showing the actual parsing process on the standard output
stream. When included, the debugging output is active by default,
but its activity may be controlled using the setDebug(bool
on-off) member. Note that #ifdef DEBUG macros are not used
anymore. By rerunning flexc++ without the --debug option an
equivalent scanner is generated not containing the debugging
code.
- --filenames=genericName (-f)
Generic name of generated files (header files, not the
lex-function source file, see the --lex-source option for
that). By default the header file names will be equal to the name
of the generated class.
- --force-class-header
By default the generated class header is not overwritten once it
has been created. This option can be used to force the
(re)writing of the file containing the scanner's class.
- --force-implementation-header
By default the generated implementation header is not overwritten
once it has been created. This option can be used to force the
(re)writing of the implementation header file.
- --help (-h)
Write basic usage information to the standard output stream and
terminate.
- --implementation-header=header (-i)
Use header as the pathname of the file containing the
implementation header. Defaults to the name of the generated
scanner class plus the suffix .ih. The implementation header
should contain all directives and declarations only used by
the implementations of the scanner's member functions. It is the
only header file that is included by the source file containing
lex()'s implementation . User defined implementation of
other class members may use the same convention, thus
concentrating all directives and declarations that are required
for the compilation of other source files belonging to the scanner
class in one header file.
- --implementation-skeleton=skeleton (-I)
Use skeleton as the pathname of the file containing the
skeleton of the implementation header. Its filename defaults to
flexc++.ih.
- --interactive
Generate an interactive scanner. An interactive scanner reads lines
from the input stream, and then returns the tokens encountered on
that line. The interactive scanner implemented by flexc++ only
predefines the Scanner(std::istream &in, std::ostream &out)
constructor, by default assuming that input is read from
std::cin.
- --lex-skeleton=skeleton (-L)
Use skeleton as the pathname of the file containing the
lex() member function's skeleton. Its filename defaults to
flexc++.cc.
- --lex-function-name=funname
Use funname rather than lex as the name of the member
function performing the lexical scanning.
- --lex-source=source (-l)
Define source as the name of the source file containing the
scanner member function lex. Defaults to lex.cc.
- --matched-rules (-'R')
The generated scanner will write the numbers of matched rules to
the standard output. It is implied by the --debug option.
Displaying the matched rules can be suppressed by calling the
generated scanner's member setDebug(false) (or, of course, by
re-generating the scanner without using specifying
--matched-rules).
- --max-depth=depth (-m)
Set the maximum inclusion depth of the lexical scanner's
specification files to depth. By default the maximum depth is
set to 10. When more than depth specification files are used
the scanner throws a Max stream stack size exceeded
std::length_error exception.
- --namespace=namespace (-n)
Define the scanner base class, the paser class and the scanner
implentations in the namespace namespace. By default
no namespace is defined. If this options is used the
implementation header will contain a commented out using
namespace declaration for the requested namespace.
- --no-baseclass-header
Do not write the file containing the scanner's base class interface
even if it doesn't yet exist. By default the file containing the
scanner's base class interface is (re)written each time flexc++ is
called.
- --no-lines
Do not put #line preprocessor directives in the file containing
the scanner's lex function. By default #line directives
are entered at the beginning of the action statements in the
generated lex.cc file, allowing the compiler and debuggers
to associate errors with lines in your grammar specification
file, rather than with the source file containing the lex
function itself.
- --no-lex-source
Do not write the file containing the scanner's predefined scanner
member functions, even if that file doesn't yet exist. By default
the file containing the scanner's lex member function is
(re)written each time flexc++ is called. This option
should normally be avoided, as this file contains parsing
tables which are altered whenever the grammar definition is
modified.
- --own-tokens (-T)
The tokens returned as well as the text matched when flexc++ reads
its input files(s) are shown when this option is used.
This option does not result in the generated program displaying
returned tokens and matched text. If that is what you want, use
the --print-tokens option.
- --print-tokens (-t)
The tokens returned as well as the text matched by the generated
lex function are displayed on the standard output stream, just
before returning the token to lex's caller. Displaying tokens
and matched text is suppressed again when the lex.cc file is
generated without using this option. The function showing the
tokens (ScannerBase::print__) is called from
Scanner::printTokens, which is defined in-line in
scanner.h. Calling ScannerBase::print__, therefore, can
also easily be controlled by an option controlled by the program
using the scanner object.
This option does not show the tokens returned and text matched
by flexc++ itself when reading its input s. If that is what
you want, use the --own-tokens option.
- --show-filenames (-F)
Write the names of the files that are generated to the
standard error stream.
- --skeleton-directory=directory (-S)
Specifies the directory containing the skeleton files to use. This
option can be overridden by the specific skeleton-specifying
options (-B -C, -H, and -I).
- --target-directory=directory
Specifies the directory where generated files should be written.
By default this is the directory of flexc++'s input file.
The --target-directory option does not affect files that were
explicitly named (either as option or as directive).
- --usage (-h)
Write basic usage information to the standard output stream and
terminate.
- --verbose(-V)
The verbose option generates on the standard output stream various
pieces of additional information, not covered by the
--construction and --show-filenames options.
- --version (-v)
Display flexc++'s version number and terminate.
5. INTERACTIVE SCANNERS
An interactive scanner is characterized by the fact that scanning is postponed
until an end-of-line character has been received, followed by reading all
information on the line, read so far. Flexc++ supports the --interactive
option (or the equivalent %interactive directive), generating an
interactive scanner. Here it is assumed that
Scanner is the name of the scanner class generated by flexc++.
The interactive scanner generated by flexc++ has the following characteristics:
- The Scanner class is derived privately from
std::istringstream and (as usual) publicly from ScannerBase.
- The istringstream base class is constructed by its default
constructor.
- The function lex's default implementation is removed from
scanner.h and is implemented in the generated lex.cc source
file. It performs the following tasks:
- If the token returned by the scanner is not equal to 0 it is
returned as then next token;
- Otherwise the next line is retrieved from the input stream
passed to the Scanner's constructor (by default std::cin).
If this fails, 0 is returned.
- A '\n' character is appended to the just read line, and the
scanner's std::istringstream base class object is
re-initialized with that line;
- The member lex__ returns the next token.
This implementation allows code calling Scanner::lex() to conclude, as
usual, that the input is exhausted when lex returns 0.
Here is an example of how such a scanner could be used:
// scanner generated with: 'flexc++ --interactive lexer' or with
// 'flexc++ lexer' if lexer contains the %interactive directive
int main()
{
Scanner scanner; // by default: read from std::cin
while (true)
{
cout << "? "; // prompt at each line
while (true) // process all the line's tokens
{
int token = scanner.lex();
if (token == '\n') // end of line: new prompt
break;
if (token == 0) // end of input: done
return 0;
// process other tokens
cout << scanner.matched() << '\n';
if (scanner.matched()[0] == 'q')
return 0;
}
}
}
6. SPECIFICATION FILE(S)
Flexc++ expects an input file containing directives and the regular
expressions that should be recognized by objects of the scanner class
generated by flexc++. In this man page the elements and organization of flexc++'s
input file is described.
Flexc++'s input file consists of two sections, separated from each other by
a line merely containing two consecutive percent characters:
%%
The section before this separator contains directives; the section
following this separator contains regular expressions and possibly actions to
perform when these regular expressions are matched by the object of the
scanner class generated by flexc++.
White space is usually ignored, as is comment, which may be of the
traditional C form (i.e., /*, followed by (possibly multi-line)
comment text, followed by */, and it may be C++ end-of-line comment:
two consecutive slashes (//) start the comment, which continues up to
the next newline character.
6.1. FILE SWITCHING
Flexc++'s input file may be split into multiple files. This allows for the
definition of logically separate elements of the specifications in different
files. Include directives must be specified on a line of their own. To
switch to another specification file the following stanza is used:
//include file-location
The //include directive starts in the line's first column. File
locations can be absolute or relative to the location of the file containing
the //include directive. White space characters following //include
and before the end of the line are ignored. The file specification may be
surrounded by double quotes, but these double quotes are not required and are
ignored (removed) if present. All remaining characters are expected to define
the name of the file where flexc++'s rules specifications continue. Once end of
file of a sub-file has been reached, processing continues at the line beyond
the //include directive of the previously scanned file. The end-of-file of
the file that was initially specified when flexc++ was called indicates the end
of flexc++'s rules specification.
6.2. DIRECTIVES
The first section of flexc++'s input file consists of directives. In
addition it may associate regular expressions with symbolic names, allowing
you to use these identifiers in the rules section. Each directive is defined
on a line of its own. When available, directives are overridden by flexc++
command line options.
Some directives require arguments, which are usually provided following
separating (but optional) = characters. Arguments of directives, are text,
surrounded by double quotes (strings). If a string must itself contain a
double quote or a backslash, then precede these characters by a backslash. The
exceptions are the %s and %x directives, which are immediately
followed by name lists, consisting of identifiers separated by blanks. Here
is an example of the definition of a directive:
%class-name = "MyScanner"
The following directives are available:
- %baseclass-header = "header"
Defines the pathname of the file containing the scanner class's
base class interface. Corresponding command-line option:
--baseclass-header.
- %case-insensitive
Generates a scanner case insensitively matching regular
expressions. All regular expressions specified in flexc++'s input
file are interpreted case insensitively and the resulting scanner
object will case insensitively interpret its input.
Corresponding command-line option: --cases-insensitive.
When this directive is specified the resulting scanner does not
distinguish between the following rules:
First // initial F is transformed to f
first
FIRST // all capitals are transformed to lower case chars
With a case-insensitive scanner only the first rule can be matched,
and flexc++ will issue warnings for the second and third rule about
rules that cannot be matched.
Input processed by a case-insensitive scanner is also handled case
insensitively. The above mentioned First rule is matched for
all of the following input words: first First FIRST firST.
Although the matching process proceeds case insensitively, the
matched text (as returned by the scanner's matched() member)
always contains the original, unmodified text. So, with the above
input matched() returns, respectively first, First, FIRST
and firST, while matching the rule First.
- %class-header = "header"
Defines the pathname of the file containing the scanner class's
interface. Corresponding command-line option:
--class-header.
- %class-name = "class-name"
Declares the name of the scanner class generated by flexc++. This
directive corresponds to the %name directive used by
flex++(1). Contrary to flex++'s %name declaration,
class-name may appear anywhere in the first section of the
grammar specification file. It may be defined only once. If no
class-name is specified the default class name (Scanner)
is used. Corresponding command-line option:
--class-name.
- %debug
Provide lex and its support functions with debugging code,
showing the actual parsing process on the standard output
stream. When included, the debugging output is active by default,
but its activity may be controlled using the setDebug(bool
on-off) member. Note that no #ifdef DEBUG macros are used in
the generated code.
- %implementation-header = "header"
Defines the pathname of the file to contain the implementation
header. Corresponding command-line option:
--implementation-header.
- %input-implementation = "sourcefile"
Defines the pathname of the file containing the implementation of a
user-defined Input class.
- %input-interface = "interface"
Defines the pathname of the file containing the interface of a
user-defined Input class. See input(3flexc++) for
additional information about user-defined Input classes.
- %interactive
Generate an interactive scanner. An interactive scanner reads lines
from the input stream, and then returns the tokens encountered on
that line. The interactive scanner implemented by flexc++ only
predefines the Scanner(std::istream &in, std::ostream &out)
constructor, by default assuming that input is read from
std::cin. See also the INTERACTIVE SCANNER section in
flexc++(1).
- %lex-function-name = "funname"
Defines the name of the scanner class's member to perform the
lexical scanning. If this directive is omitted the default name
(lex) is used. Corresponding command-line option:
--lex-function-name.
- %lex-source = "source"
Defines the pathname of the file to contain the scanner member
lex. Corresponding command-line option: --lex-source.
- %no-lines
Do not put #line preprocessor directives in the file containing
the scanner's lex function. If omitted #line directives
are added to this file, unless overridden by the command line
options --lines and --no-lines.
- %namespace = "namespace"
Define the scanner class in the namespace namespace. By default
no namespace is used. If this options is used the
implementation header is provided with a commented out using
namespace declaration for the requested namespace. This
directive is overridden by the --namespace command-line
option.
- %print-tokens
This option results in the tokens as well as the matched text to be
displayed on the standard output stream, just before returning the
token to lex's caller. Displaying is suppressed again when the
lex.cc file is generated without using this directive. The
function showing the tokens (ScannerBase::print__) is called
from Scanner::print(), which is defined in-line in
scanner.h. Calling ScannerBase::print__, therefore, can
also easily be controlled by an option controlled by the program
using the scanner object.
This option does not show the tokens returned and text matched
by flexc++ itself when reading its input s. If that is what
you want, use the --own-tokens option.
- %s namelist
The %s directive is followed by a list of one or more
identifiers, separated by blanks. Each identifier is the name of
an inclusive mini scanner.
- %skeleton-directory = "path"
Use path rather than the default (e.g., /usr/share/flexc++)
path when looking for flexc++'s skeleton files. Corresponding
command-line option: --skeleton-directory.
- %target-directory = "path"
Generate files in path rather than in flexc++'s input file's
directory.
The %target-directory option does not affect files that were
explicitly named (either as option or as directive).
- %x namelist
The %x directive is followed by a list of one or more
identifiers, separated by blanks. Each identifier is the name of
an exclusive mini scanner.
6.3. MINI SCANNERS
Mini scanners come in two flavors: inclusive mini scanners and exclusive
mini scanners. The rules that apply to an inclusive mini scanner are the mini
scanner's own rules as well as the rules which apply to no mini scanners in
particular (i.e., the rules that apply to the default (or INITIAL) mini
scanner). Exclusive mini scanners only use the rules that were defined for
them.
To define an inclusive mini scanner use %s, followed by one
or more identifiers specifying the name(s) of the mini-scanner(s). To define
an exclusive mini scanner use %x, followed by or more identifiers
specifying the name(s) of the mini-scanner(s). The following example defines
the names of two mini scanners: string and comment:
%x string comment
Following this, rules defined in the context of the string mini
scanner (see below) will only be used when that mini scanner is active.
A flexc++ input file may contain multiple %s and %x
specifications.
6.4. DEFINITIONS
Definitions are of the form
identifier regular-expression
Each definition must be entered on a line of its own. Definitions
associate identifiers with regular expressions, allowing the use of
${identifier} as synonym for its regular expression in the rules section
of the flexc++ input file. One defined, the identifiers representing regular
expressions can also be used in subsequent definitions.
Example:
FIRST [A-Za-z_]
NAME {FIRST}[-A-Za-z0-9_]*
6.5. %% SEPARATOR
Following directives and definitions a line merely containing two consecutive
% characters is expected. Following this line the rules are defined. Rules
consist of regular expressions which should be recognized, possibly followed
by actions to be executed once a rule's regular expression has been matched.
6.6. REGULAR EXPRESSIONS
The regular expressions defined in flexc++'s rules files are matched against
the information passed to the scanner's lex function.
Regular expressions begin as the first non-blank character on a line. Comment
is interpreted as comment as long as it isn't part of the regular
expresssion. To define a regular expression starting with two slashes (at
least) the first slash can be escaped or double quoted. (E.g., "//".*
defines C++ comment to end-of-line).
Regular expressions end at the first blank character (to add a blank character,
e.g., a space character, to a regular expression, prefix it by a backslash or
put it in a double-quoted string).
Actions may be associated with regular expressions. At a match the action
that is associated with the regular expression is executed, after which
scanning continues when the lexical scanning function (e.g., lex) is
called again. Actions are not required, and regular expressions can be defined
without any actions at all. If such action-less regular expressions are
matched then the match is performed silently, after which processing
continues.
Flexc++ tries to match as many characters of the input file as possible (i.e.,
it uses `greedy matching'). Non-greedy matching is accomplished by a
combination of a scanner and parser and/or by using the `lookahead' operator
(/).
The following regular expression `building blocks' are available. More complex
regular expressions are created by combining them:
- x
-
the character `x'
- .
-
any character (byte) except newline
- [xyz]
-
a character class; in this case, the pattern matches either an `x',
a `y', or a `z'
- [abj-oZ]
-
a character class containing a range; matches an `a', a `b', any
letter from `j' through `o', or a `Z'
- [^A-Z]
- a negated character class, i.e., any character except
for those in the class. In this example, any non-capital character.
- "[xyz]\"foo"
-
text between double quotes matches the literal string: [xyz]"foo.
- \X
-
if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
interpretation of `\x' is matched. Otherwise, a literal `X' is matched
(this is used to escape operators such as `*').
- \0
-
a NUL character (ASCII code 0).
- \123
-
the character with octal value 123.
- \x2a
-
the character with hexadecimal value 2a.
- (r)
-
the regular expression `r'; parentheses are used to override
precedence (see below)
- {name}
-
the expansion of the `name' definition.
- r*
-
zero or more regular expressions `r'. This also matches the empty
string.
- r+
-
one or more regular expressions `r'.
- r?
-
zero or one regular expression `r'. This also matches the empty
string.
- rs
-
the regular expression `r' followed by the regular expression `s';
called concatenation
- r{m, n}
-
regular expression `r' at least m, but at most n times (1 <= m
<= n).
- r{m,}
-
regular expression `r' m or more times (1 <= m).
- r{m}
-
regular expression `r' exactly m times (1 <= m).
- r|s
-
either regular expression `r' or regular expression `s'
- r/s
-
regular expression `r' if it is followed by regular expression
`s'. The text matched by `s' is included when determining whether this
rule results in the longest match, but `s' is then returned to the input
before the rule's action (if defined) is executed.
- ^r
-
a regular expression `r' at the beginning of a line or file.
- r$
-
a regular expression `r', occurring at the end of a line. This
pattern is identical to `r/\n'.
- <s>r
-
a regular exprression `r' in start condition `s'
- <s1,s2,s3>r
-
a regular exprression `r' in start conditions s1, s2, or s3.
- <*>r
-
a regular exprression `r' in all start conditions.
- <<EOF>>
-
an end-of-file.
- <s1,s2><<EOF>>
-
an end-of-file when in start conditions s1 or s2
Inside a character class all regular expression operators lose their special
meanings, except for the escape character (\) and the character class
operators -, ]], and, at the beginning of the class, ^. To add a
closing bracket to a character class use []. To add a closing bracket to a
negated character class use [^]. Once a character class has started, all
subsequent character (ranges) are added to the set, until the final closing
bracket (]) has been reached.
The regular expressions listed above are grouped according to precedence, from
highest precedence at the top to lowest at the bottom. From lowest to highest
precedence, the operators are:
- |: the or-operator at the end of a line (instead of an action)
indicates that this expression's action is identical to the action of the next
rule.
- /: the look-ahead operator;
- |: the or-operator withn a regular expression;
- CHAR: individual elements of the regular expression: characters,
strings, quoted characters, escaped characters, character sets etc. are all
considered CHAR elements. Multiple CHAR elements can be combined by
enclosing them in parentheses (e.g., (abc)+ indicates sequences of abc
characters, like abcabcabc);
- *, ?, +, {: multipliers:
?: zero or one occurrence of the previous element;
+: one or more repetitions of the previous element;
*: zero or more repetitions of the previous element;
{...}: interval specification: a specified number of
repetitions of the previous element (see above for specific
forms of the interval specification)
- {+}, {-}: set operators ({+} computing the union of two sets,
{-} computing the difference of the left-hand side set
minus the elements in the right-hand side set);
The lex standard defines concatenation as having a higher precedence than the
interval expression. This is different from many other regular expression
engines, and flexc++ follows these latter engines, giving all `multiplication
operators' equal priority.
Name expansion has the same precedence as grouping (using parentheses to
influence the precedence of the other operators in the regular expression).
Since the name expansion is treated as a group in flexc++, it is not allowed to
use the lookahead operator in a name definition (a named pattern, defined in
the definition section).
Character classes can also contain character class expressions. These are
expressions enclosed inside [: and :] delimiters (which themselves
must appear between the [ and ] of the character class. Other elements
may occur inside the character class as well). The character class expressions
are:
[:alnum:] [:alpha:] [:blank:]
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
Character class expressions designate a set of characters equivalent to
the corresponding standard C isXXX function. For example, [:alnum:]
designates those characters for which isalnum returns true - i.e., any
alphabetic or numeric character. For example, the following character classes
are all equivalent:
[[:alnum:]]
[[:alpha:][:digit:]]
[[:alpha:][0-9]]
[a-zA-Z0-9]
A negated character class such as the example [^A-Z] above will match a
newline unless \n (or an equivalent escape sequence) is one of the
characters explicitly present in the negated character class (e.g.,
[^A-Z\n]). This differs from the way many other regular expression tools
treat negated character classes, but unfortunately the inconsistency is
historically entrenched. Matching newlines means that a pattern like [^"]*
can match the entire input unless there's another quote in the input.
Flexc++ allows negation of character class expressions by prepending ^ to
the POSIX character class name.
[:^alnum:] [:^alpha:] [:^blank:]
[:^cntrl:] [:^digit:] [:^graph:]
[:^lower:] [:^print:] [:^punct:]
[:^space:] [:^upper:] [:^xdigit:]
The {-} operator computes the difference of two character classes. For
example, [a-c]{-}[b-z] represents all the characters in the class
[a-c] that are not in the class [b-z] (which in this case, is just the
single character a). The {-} operator is left associative, so
[abc]{-}[b]{-}[c] is the same as [a].
The {+} operator computes the union of two character classes. For example,
[a-z]{+}[0-9] is the same as [a-z0-9]. This operator is useful when
preceded by the result of a difference operation, as in,
[[:alpha:]]{-}[[:lower:]]{+}[q], which is equivalent to [A-Zq] in the
C locale.
A rule can have at most one instance of trailing context (the / operator
or the $ operator). The start condition, ^, and <<EOF>> patterns
can only occur at the beginning of a pattern, and cannot be surrounded by
parentheses. The characters ^ and $ only have their special properties
at, respectively, the beginning and end of regular expressions. In all other
cases they are treated as a normal characters.
6.7. SPECIFICATION EXAMPLE
%option debug
%x comment
NAME [[:alpha:]][_[:alnum:]]*
%%
"//".* // ignore
"/*" begin(comment);
<comment>.|\n // ignore
<comment>"*/" begin(INITIAL);
^a return 1;
a return 2;
a$ return 3;
{NAME} return 4;
.|\n // ignore
)
7. THE CLASS INTERFACE: SCANNER.H
By default, flexc++ generates a file scanner.h containing the initial
interface of the scanner class performing the lexical scan according to the
specifications given in flexc++'s input file. The name of the file that is
generated can easily be changed using flexc++'s --class-header
option. In this man-page we'll stick to using the default name.
The file scanner.h is generated only once, unless an explicit request
is made to rewrite it (using flexc++'s --force-class-header option).
The provided interface is very light-weight, primarily offering a link to
the scanner's base class (see scannerbase.h(3flexc++).
Many of the facilities offered by the scanner class are inherited from
the ScannerBase base class, and the reader should consult
scannerbase.h(3flexc++) for an overview of additional facilities
offered by the Scanner class.
7.1. NAMING CONVENTION
All symbols that are required by the generated scanner class end in two
consecutive underscore characters (e.g., executeAction__). These names
should not be redefined. As they are part of the Scanner and
ScannerBase class their scope is immediately clear and confusion with
identically named identifiers elsewhere is unlikely.
Some member functions do not use the underscore convention. These are the
scanner class's constructors, or names that are similar or equal to names that
have historically been used (e.g., length). Also, some functions are
offered offering hooks into the implementation (like preCode). The latter
category of function also have names that don't end in underscores.
7.2 CONSTRUCTORS
- explicit Scanner(std::istream &in = std::cin,
std::ostream &out = std::cout)
This constructor by default reads information from the standard input
stream and writes to the standard output stream. When the
Scanner object goes out of scope the input and output files are closed.
With interactive scanners input stream switching or stacking is not
available; switching output streams, however, is.
- Scanner(std::string const &infile, std::string const &outfile)
This constructor opens the input and output streams whose file names
were specified. When the Scanner object goes out of scope the input and
output files are closed. If outfile == "-" then the standard output stream
is used as the scanner's output medium; if outfile == "" then the
standard error stream is used as the scanner's output medium.
This constructor is not available with interactive scanners.
7.3. PUBLIC MEMBER FUNCTIONS
- int lex()
The lex function performs the lexical scanning of the input file
specified at construction time (but also see scannerbase.h(3flexc++) for
information about intermediate stream-switching facilities). It returns an
int representing the token associated with the matched regular
expression. The returned value 0 indicates end-of-file. Considering its
default implementation, it could be redefined by the user. Lex's default
implementation merely calls lex__:
inline int Scanner::lex()
{
return lex__();
}
Caveat: with interactive scanners the lex function is defined in
the generated lex.cc file. Once flexc++ has generated the scanner class
header file this scanner class header file isn't automatically rewritten by
flexc++. If, at some later stage, an interactive scanner must be generated, then
the inline lex implementation must be removed `by hand' from the scanner
class header file. Likewise, a lex member implementation (like the above)
must be provided `by hand' if a non-interactive scanner is required after
first having generated files implementing an interactive scanner.
7.4. PRIVATE MEMBER FUNCTIONS
- int lex__()
This function is used internally by lex and should not otherwise
be used.
- int executeAction__()
This function is used internally by lex and should not otherwise
be used.
- void preCode()
By default this function has an empty, inline implementation in
scanner.h. It can safely be replaced by a user-defined
implementation. This function is called by lex__, just before it starts to
match input characters against its rules: preCode is called by lex__
when lex__ is called and also after having executed the actions of a rule
which did not execute a return statement. The outline of lex__'s
implementation looks like this:
int Scanner::lex__()
{
...
preCode();
while (true)
{
size_t ch = get__(); // fetch next char
...
switch (actionType__(range)) // determine the action
{
... maybe return
}
... no return, continue scanning
preCode();
} // while
}
- void print()
When the --print-tokens or %print-tokens directive is used
this function is called to display, on the standard output stream,
the tokens returned and text matched by the scanner generated by
flexc++.
Displaying is suppressed when the lex.cc file is (re)generated
without using this directive. The function actually showing the
tokens (ScannerBase::print__) is called from print, which
is defined in-line in scanner.h. Calling
ScannerBase::print__, therefore, can also easily be controlled
by an option controlled by the program using the scanner object.
7.5. SCANNER CLASS HEADER EXAMPLE
#ifndef Scanner_H_INCLUDED_
#define Scanner_H_INCLUDED_
// $insert baseclass_h
#include "scannerbase.h"
class Scanner: public ScannerBase
{
public:
explicit Scanner(std::istream &in = std::cin,
std::ostream &out = std::cout);
Scanner(std::string const &infile, std::string const &outfile);
// $insert lexFunctionDecl
int lex();
private:
int lex__();
int executeAction__(size_t ruleNr);
void preCode(); // re-implement this function for code to be
// exec'ed before the pattern matching starts
};
inline void Scanner::preCode()
{
// optionally replace by your own code
}
inline Scanner::Scanner(std::istream &in, std::ostream &out)
:
ScannerBase(in, out)
{}
inline Scanner::Scanner(std::string const &infile,
std::string const &outfile)
:
ScannerBase(infile, outfile)
{}
// $insert inlineLexFunction
inline int Scanner::lex()
{
return lex__();
}
#endif // Scanner_H_INCLUDED_
8.1. THE SCANNER BASE CLASS()
By default, flexc++ generates a file scannerbase.h containing the
interface of the base class of the scanner class also generated by flexc++. The
name of the file that is generated can easily be changed using flexc++'s
--baseclass-header option. In this man-page we use the default name.
The file scanner.h is generated at each new flexc++ run. It contains no
user-serviceable or extensible parts. Rewriting can be prevented by specifying
flexc++'s --no-baseclass-header option.
8.2. PUBLIC ENUMS AND -TYPES
8.3. PROTECTED ENUMS AND -TYPES
- enum class ActionType__
This strongly typed enumeration is for internal use only.
- enum Leave__
This enumeration is for internal use only.
8.4. NO PUBLIC CONSTRUCTORS
There are no public constructors. ScannerBase is a base class for the
Scanner class generated by flexc++. ScannerBase only offers
protected constructors.
8.5. PUBLIC MEMBER FUNCTIONS
- bool debug() const
returns true if --debug or %debug was specified, otherwise
false.
- std::string const &filename() const
returns the name of the file currently processed by the scanner object.
- size_t length() const
returns the length of the text that was matched by lex. With
flex++ this function was called leng.
- size_t lineNr() const
returns the line number of the currently scanned line. This function is
always available (note: flex++ only offered a similar function
(called lineno) after using the %lineno option).
- std::string const &matched() const
returns the text matched by lex (note: flex++ offers a similar
member called YYText).
- void setDebug(bool onOff)
Switches on/off debugging output by providing the argument true
or false. Switching on debugging output only has visible effects
if the debug option was specified.
- void switchIstream(std::string const &infilename)
The currently processed input stream is closed, and processing
continues at the stream whose name is specified as the function's
argument. This is not a stack-operation: after processing
infilename processing does not return to the original stream.
This member is not available with interactive scanners.
- void switchOstream(std::ostream &out)
The currently processed output stream is closed, and
new output is written to out.
- void switchOstream(std::string const &outfilename)
The current output stream is closed, and output is written to
outfilename. If this file already exists, it is rewritten.
- void switchStreams(std::istream &in,
std::ostream &out = std::cout)
The currently processed input and output streams are closed, and
processing continues at in, writing output to out. This is
not a stack-operation: after processing in processing
does not return to the original stream.
This member is not available with interactive scanners.
- void switchStreams(std::string const &infilename,
std::string const &outfilename)
The currently processed input and output streams are closed, and
processing continues at the stream whose name is specified as the
function's first argument, writing output to the file whose name is
specified as the function's second argument. This latter file is
rewritten. This is not a stack-operation: after processing
infilename processing does not return to the original stream.
If outfilename == "-" then the standard output stream
is used as the scanner's output medium; if outfilename == "" then
the standard error stream is used as the scanner's output medium.
If outfilename == "-" then the standard output stream
is used as the scanner's output medium; if outfilename == "" then
the standard error stream is used as the scanner's output medium.
This member is not available with interactive scanners.
8.6. PROTECTED CONSTRUCTORS
8.7. PROTECTED MEMBER FUNCTIONS
All member functions ending in two underscore characters are for internal
use only and should not be called by user-defined members of the
Scanner class.
The following members, however, can safely be called by members of the
generated Scanner class:
- void accept(size_t nChars = 0)
accept(n) returns all but the first `nChars' characters of the
current token back to the input stream, where they will be rescanned
when the scanner looks for the next match. So, it matches `nChars' of
the characters in the input buffer, rescanning the rest. This function
effectively sets length's return value to nChars (note: with
flex++ this function was called less);
- void begin(StartCondition__ startCondition)
activate the regular expression rules associated with
StartCondition__ startCondition. As this enumeration is a strongly
typed enum the StartCondition__ scope must be specified as
well. E.g.,
begin(StartCondition__::INITIAL);
- void echo() const
The currently matched text (i.e., the text returned by the member
matched) is inserted into the scanner object's output stream;
- void leave(int retValue)
actions defined in the lexical scanner specification file may or may
not return. This frequently results in complicated or overlong
compound statements, blurring the readability of the specification
file. By encapsulating the actions in a member function readability is
enhanced. However, frequently a compound statement is still required,
as in:
regex-to-match {
if (int ret = memberFunction())
return ret;
}
The member leave removes the need for constructions like the
above. The member leave can be called from within member
functions encapsulating actions performed when a regular expression
has been matched. It ends lex, returning retValue to its
caller. The above rule can now be written like this:
regex-to-match memberFunction();
and memberFunction could be implemented as follows:
void memberFunction()
{
if (someCondition())
{ // any action, e.g.,
// switch mini-scanner
begin(StartCondition__::INITIAL);
leave(Parser::TOKENVALUE); // lex returns TOKENVALUE
// this point is never reached
}
pushStream(d_matched); // switch to the next stream
// lex continues
}
The member leave should only (indirectly) be called
(usually nested) from actions defined in the scanner's specification
s; calling leave outside of this context results in
undefined behavior.
- void more()
the matched text is kept and will be prefixed to the text that is
matched at the next lexical scan;
- std::ostream &out()
returns a reference to the scanner's output stream;
- bool popStream()
closes the currently processed input stream and continues to process
the most recently stacked input stream (removing it from the stack of
streams). If this switch was successfully performed true is
returned, otherwise (e.g., when the stream stack is empty) false
is returned;
- void push(size_t ch)
character ch is pushed back onto the input stream. I.e., it will be
the character that is retrieved at the next attempt to obtain a
character from the input stream;
- void push(std::string const &txt)
the characters in the string txt are pushed back onto the input
stream. I.e., they will be the characters that are retrieved at the
next attempt to obtain characters from the input stream. The
characters in txt are retrieved from the first character to the
last. So if txt == "hello" then the 'h' will be the character
that's retrieved next, followed by 'e', etc, until 'o';
- void pushStream(std::istream &curStream)
this function pushes curStream on the stream stack;
This member is not available with interactive scanners.
- void pushStream(std::string const &curName)
same, but the stream curName is opened first, and the resulting
istream is pushed on the stream stack;
This member is not available with interactive scanners.
- void redo(size_t nChars = 0)
this member acts like accept but its argument counts backward from
the end of the matched text. All but these nChars characters are
kept and the last nChar characters are rescanned. This function
effectively reduces length's return value by nChars;
- void setFilename(std::string const &name)
this function sets the name of the stream returned by filename to
name;
- void setMatched(std::string const &text)
this function stores text in the matched text buffer. Following a
call to this function matched returns text.
- StartCondition__ startCondition() const
returns the currently active start condition (mini scanner);
- std::vector<StreamStruct> const &streamStack() const
returns the vector of currently stacked input streams. The vector's
size equals 0 unless pushStream has been used. So flexc++'s input
file is not counted here. The StreamStruct is a struct only
having one accessible member: std::string const &pushedName, which
holds the name of the pushed stream. The vector is used internally
as a stack: the stream that was first pushed is found at index
position 0, the most recently pushed stream is found at
streamStack().back().
This member is not available with interactive scanners.
8.8. PROTECTED DATA MEMBERS
All protected data members are for internal use only, allowing lex__
to access them. All of them end in two underscore characters.
8.9. FLEX++ TO FLEXC++ MEMBERS
|
Flex++ (old) | | Flexc++ (new) |
|
lineno() | | lineNr() |
YYText() | | matched() |
less() | | accept() |
|
9.1 THE CLASS INPUT
Flexc++ generates a file scannerbase.h defining the scanner class's base
class, by default named ScannerBase (which is the name used in this
man-page). The base class ScannerBase contains a nested class Input
whose interface looks like this:
class Input
{
public:
Input();
Input(std::istream *iStream, size_t lineNr = 1);
size_t get();
size_t lineNr() const;
void reRead(size_t ch);
void reRead(std::string const &str, size_t fmIdx);
void close();
};
The members of this class are all required and offer a level in between
the operations of ScannerBase and flexc++'s actual input file that's being
processed.
By default, flexc++ provides an implementation for all of Input's
required members. Therefore, in most situations this man-page can safely be
ignored.
However, users may define and extend their own Input class and provide
flexc++'s base class with that Input class. To do so flexc++'s rules file must
contain the following two directives:
%input-implementation = "sourcefile"
%input-interface = "interface"
Here, interface is the name of a file containing the class Input's
interface. This interface is then inserted into ScannerBase's interface
instead of the default class Input's interface. This interface must at
least offer the above-mentioned members and constructors (their functions are
described below). The class may contain additional members if required by the
user-defined implementation. The implementation itself is expected in
sourcefile. The contents of this file are inserted in the generated
lex.cc file instead of Input's default implementation. The file
sourcefile should probably not have a .cc extension to prevent its
compilation by a program maintenance utility.
When the lexical scanner generated by flexc++ switches streams using the
//include directive (see the rules(3flexc++) man-page) the input
stream that's currently processed is pushed on an Input stack maintained
by ScannerBase, and processing continues at the file named at the
//include directive. Once the latter file has been processed, the
previously pushed stream is popped off the stack, and processing of the popped
stream continues. This implies that Input objects must be
`stack-able'. The required interface is designed to satisfy this requirement.
9.2. CONSTRUCTORS
9.3. REQUIRED PUBLIC MEMBER FUNCTIONS
- size_t get()
returns the next character to be processed by the lexical
scanner. Usually it will be the next character from the istream passed to
the Input class at construction time. It is never called by the
ScannerBase object for Input objects defined using Input's default
constructor. It should return 0x100 once istream's end-of-file has been
reached.
- size_t lineNr() const
should return the (1-based) number of the istream object passed to
the Input object. At construction time the istream has just been
opened and so at that point lineNr should return 1.
- void reRead(size_t ch)
if provided with a value smaller than 0x100 ch should be pushed
back onto the istream, where it becomes the character next to be
returned. Physically the character doesn't have to be pushed back. The default
implementation uses a deque onto which the character is pushed-front. Only
when this deque is exhausted characters are retrieved from the Input
object's istream.
- void reRead(std::string const &str, size_t fmIdx)
the characters in str from fmIdx until the string's final
character are pushed back onto the istream object so that the string's
first character is retrieved first and the string's last character is
retrieved last.
- void close()
the istream object initially passed to the Input object is
deleted by close, thereby not only freeing the stream's memory, but also
closing the stream if the stream in fact was an ifstream. Note that the
Input's destructor should not destroy the Input's istream
object.
FILES
Flexc++'s default skeleton files are in /usr/share/flexc++.
By default, flexc++ generates the following files:
- scanner.h: the header file containing the scanner class's
interface.
- scannerbase.h: the header file containing the interface of the
scanner class's base class.
- scanner.ih: the internal header file that is meant to be included
by the scanner class's source files (e.g., it is included by
lex.cc, see the next file), and that should contain all
declarations required for compiling the scanner class's sources.
- lex.cc: the source file implementing the scanner class member
function lex (and support functions), performing the lexical
scan.
SEE ALSO
bisonc++(1)
BUGS
- The priority of interval expressions ({...}) equals the priority
of other multiplicative operators (like *).
- All INITIAL rules apply to inclusive mini scanners, also those
INITIAL rules that were explicitly associated with the INITIAL mini
scanner.
ABOUT flexc++
Flexc++ was originally started as a programming project by Jean-Paul van
Oosten and Richard Berendsen in the 2007-2008 academic year. After graduating,
Richard left the project and moved to Amsterdam. Jean-Paul remained in
Groningen, and after on-and-off activities on the project, in close
cooperation with Frank B. Brokken, Frank undertook a rewrite of the project's
code around 2010. During the development of flexc++, the lookahead-operator
handling continuously threatened the completion of the project. By now, the
project has evolved to a level that we feel it's defensible to publish the
program, although we still tend to consider the program in its experimental
stage; it will remain that way until we decide to move its version from the
0.9x.xx series to the 1.xx.xx series.
COPYRIGHT
This is free software, distributed under the terms of the
GNU General Public License (GPL).
AUTHOR
Frank B. Brokken (f.b.brokken@rug.nl),
Jean-Paul van Oosten (j.p.van.oosten@rug.nl),
Richard Berendsen (richardberendsen@xs4all.nl) (until 2010).