src/usr.bin/lex/flex.1 - annotate

Return to flex.1 CVS log
Up to [local] / src / usr.bin / lex
Annotation of src/usr.bin/lex/flex.1, Revision 1.14

1.14    ! tedu        1: .\"    $OpenBSD: flex.1,v 1.13 2003/06/04 17:34:44 millert Exp $
1.12      jmc         2: .\"
                      3: .\" Copyright (c) 1990 The Regents of the University of California.
                      4: .\" All rights reserved.
1.2       deraadt     5: .\"
1.12      jmc         6: .\" This code is derived from software contributed to Berkeley by
                      7: .\" Vern Paxson.
                      8: .\"
                      9: .\" The United States Government has rights in this work pursuant
                     10: .\" to contract no. DE-AC03-76SF00098 between the United States
                     11: .\" Department of Energy and the University of California.
                     12: .\"
                     13: .\" Redistribution and use in source and binary forms, with or without
1.13      millert    14: .\" modification, are permitted provided that the following conditions
                     15: .\" are met:
                     16: .\"
                     17: .\" 1. Redistributions of source code must retain the above copyright
                     18: .\"    notice, this list of conditions and the following disclaimer.
                     19: .\" 2. Redistributions in binary form must reproduce the above copyright
                     20: .\"    notice, this list of conditions and the following disclaimer in the
                     21: .\"    documentation and/or other materials provided with the distribution.
                     22: .\"
                     23: .\" Neither the name of the University nor the names of its contributors
                     24: .\" may be used to endorse or promote products derived from this software
                     25: .\" without specific prior written permission.
                     26: .\"
                     27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
                     28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
                     29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
                     30: .\" PURPOSE.
1.12      jmc        31: .\"
1.1       deraadt    32: .TH FLEX 1 "April 1995" "Version 2.5"
                     33: .SH NAME
                     34: flex \- fast lexical analyzer generator
                     35: .SH SYNOPSIS
                     36: .B flex
                     37: .B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton]
                     38: .B [\-\-help \-\-version]
                     39: .I [filename ...]
                     40: .SH OVERVIEW
                     41: This manual describes
                     42: .I flex,
                     43: a tool for generating programs that perform pattern-matching on text.  The
                     44: manual includes both tutorial and reference sections:
                     45: .nf
                     46:
                     47:     Description
                     48:         a brief overview of the tool
                     49:
                     50:     Some Simple Examples
                     51:
                     52:     Format Of The Input File
                     53:
                     54:     Patterns
                     55:         the extended regular expressions used by flex
                     56:
                     57:     How The Input Is Matched
                     58:         the rules for determining what has been matched
                     59:
                     60:     Actions
                     61:         how to specify what to do when a pattern is matched
                     62:
                     63:     The Generated Scanner
                     64:         details regarding the scanner that flex produces;
                     65:         how to control the input source
                     66:
                     67:     Start Conditions
                     68:         introducing context into your scanners, and
                     69:         managing "mini-scanners"
                     70:
                     71:     Multiple Input Buffers
                     72:         how to manipulate multiple input sources; how to
                     73:         scan from strings instead of files
                     74:
                     75:     End-of-file Rules
                     76:         special rules for matching the end of the input
                     77:
                     78:     Miscellaneous Macros
                     79:         a summary of macros available to the actions
                     80:
                     81:     Values Available To The User
                     82:         a summary of values available to the actions
                     83:
                     84:     Interfacing With Yacc
                     85:         connecting flex scanners together with yacc parsers
                     86:
                     87:     Options
                     88:         flex command-line options, and the "%option"
                     89:         directive
                     90:
                     91:     Performance Considerations
                     92:         how to make your scanner go as fast as possible
                     93:
                     94:     Generating C++ Scanners
                     95:         the (experimental) facility for generating C++
                     96:         scanner classes
                     97:
                     98:     Incompatibilities With Lex And POSIX
                     99:         how flex differs from AT&T lex and the POSIX lex
                    100:         standard
                    101:
                    102:     Diagnostics
                    103:         those error messages produced by flex (or scanners
                    104:         it generates) whose meanings might not be apparent
                    105:
                    106:     Files
                    107:         files used by flex
                    108:
                    109:     Deficiencies / Bugs
                    110:         known problems with flex
                    111:
                    112:     See Also
                    113:         other documentation, related tools
                    114:
                    115:     Author
                    116:         includes contact information
                    117:
                    118: .fi
                    119: .SH DESCRIPTION
                    120: .I flex
                    121: is a tool for generating
                    122: .I scanners:
1.9       millert   123: programs which recognize lexical patterns in text.
1.1       deraadt   124: .I flex
                    125: reads
                    126: the given input files, or its standard input if no file names are given,
                    127: for a description of a scanner to generate.  The description is in
                    128: the form of pairs
                    129: of regular expressions and C code, called
                    130: .I rules.  flex
                    131: generates as output a C source file,
                    132: .B lex.yy.c,
                    133: which defines a routine
                    134: .B yylex().
                    135: This file is compiled and linked with the
                    136: .B \-lfl
                    137: library to produce an executable.  When the executable is run,
                    138: it analyzes its input for occurrences
                    139: of the regular expressions.  Whenever it finds one, it executes
                    140: the corresponding C code.
                    141: .SH SOME SIMPLE EXAMPLES
                    142: .PP
                    143: First some simple examples to get the flavor of how one uses
                    144: .I flex.
                    145: The following
                    146: .I flex
                    147: input specifies a scanner which whenever it encounters the string
                    148: "username" will replace it with the user's login name:
                    149: .nf
                    150:
                    151:     %%
                    152:     username    printf( "%s", getlogin() );
                    153:
                    154: .fi
                    155: By default, any text not matched by a
                    156: .I flex
                    157: scanner
                    158: is copied to the output, so the net effect of this scanner is
                    159: to copy its input file to its output with each occurrence
                    160: of "username" expanded.
                    161: In this input, there is just one rule.  "username" is the
                    162: .I pattern
                    163: and the "printf" is the
                    164: .I action.
                    165: The "%%" marks the beginning of the rules.
                    166: .PP
                    167: Here's another simple example:
                    168: .nf
                    169:
                    170:             int num_lines = 0, num_chars = 0;
                    171:
                    172:     %%
                    173:     \\n      ++num_lines; ++num_chars;
                    174:     .       ++num_chars;
                    175:
                    176:     %%
                    177:     main()
                    178:             {
                    179:             yylex();
                    180:             printf( "# of lines = %d, # of chars = %d\\n",
                    181:                     num_lines, num_chars );
                    182:             }
                    183:
                    184: .fi
                    185: This scanner counts the number of characters and the number
                    186: of lines in its input (it produces no output other than the
                    187: final report on the counts).  The first line
                    188: declares two globals, "num_lines" and "num_chars", which are accessible
                    189: both inside
                    190: .B yylex()
                    191: and in the
                    192: .B main()
                    193: routine declared after the second "%%".  There are two rules, one
                    194: which matches a newline ("\\n") and increments both the line count and
                    195: the character count, and one which matches any character other than
                    196: a newline (indicated by the "." regular expression).
                    197: .PP
                    198: A somewhat more complicated example:
                    199: .nf
                    200:
                    201:     /* scanner for a toy Pascal-like language */
                    202:
                    203:     %{
                    204:     /* need this for the call to atof() below */
                    205:     #include <math.h>
                    206:     %}
                    207:
                    208:     DIGIT    [0-9]
                    209:     ID       [a-z][a-z0-9]*
                    210:
                    211:     %%
                    212:
                    213:     {DIGIT}+    {
                    214:                 printf( "An integer: %s (%d)\\n", yytext,
                    215:                         atoi( yytext ) );
                    216:                 }
                    217:
                    218:     {DIGIT}+"."{DIGIT}*        {
                    219:                 printf( "A float: %s (%g)\\n", yytext,
                    220:                         atof( yytext ) );
                    221:                 }
                    222:
                    223:     if|then|begin|end|procedure|function        {
                    224:                 printf( "A keyword: %s\\n", yytext );
                    225:                 }
                    226:
                    227:     {ID}        printf( "An identifier: %s\\n", yytext );
                    228:
                    229:     "+"|"-"|"*"|"/"   printf( "An operator: %s\\n", yytext );
                    230:
                    231:     "{"[^}\\n]*"}"     /* eat up one-line comments */
                    232:
                    233:     [ \\t\\n]+          /* eat up whitespace */
                    234:
                    235:     .           printf( "Unrecognized character: %s\\n", yytext );
                    236:
                    237:     %%
                    238:
                    239:     main( argc, argv )
                    240:     int argc;
                    241:     char **argv;
                    242:         {
                    243:         ++argv, --argc;  /* skip over program name */
                    244:         if ( argc > 0 )
                    245:                 yyin = fopen( argv[0], "r" );
                    246:         else
                    247:                 yyin = stdin;
1.7       aaron     248:
1.1       deraadt   249:         yylex();
                    250:         }
                    251:
                    252: .fi
                    253: This is the beginnings of a simple scanner for a language like
                    254: Pascal.  It identifies different types of
                    255: .I tokens
                    256: and reports on what it has seen.
                    257: .PP
                    258: The details of this example will be explained in the following
                    259: sections.
                    260: .SH FORMAT OF THE INPUT FILE
                    261: The
                    262: .I flex
                    263: input file consists of three sections, separated by a line with just
                    264: .B %%
                    265: in it:
                    266: .nf
                    267:
                    268:     definitions
                    269:     %%
                    270:     rules
                    271:     %%
                    272:     user code
                    273:
                    274: .fi
                    275: The
                    276: .I definitions
                    277: section contains declarations of simple
                    278: .I name
                    279: definitions to simplify the scanner specification, and declarations of
                    280: .I start conditions,
                    281: which are explained in a later section.
                    282: .PP
                    283: Name definitions have the form:
                    284: .nf
                    285:
                    286:     name definition
                    287:
                    288: .fi
                    289: The "name" is a word beginning with a letter or an underscore ('_')
                    290: followed by zero or more letters, digits, '_', or '-' (dash).
1.8       aaron     291: The definition is taken to begin at the first non-whitespace character
1.1       deraadt   292: following the name and continuing to the end of the line.
                    293: The definition can subsequently be referred to using "{name}", which
                    294: will expand to "(definition)".  For example,
                    295: .nf
                    296:
                    297:     DIGIT    [0-9]
                    298:     ID       [a-z][a-z0-9]*
                    299:
                    300: .fi
                    301: defines "DIGIT" to be a regular expression which matches a
                    302: single digit, and
                    303: "ID" to be a regular expression which matches a letter
                    304: followed by zero-or-more letters-or-digits.
                    305: A subsequent reference to
                    306: .nf
                    307:
                    308:     {DIGIT}+"."{DIGIT}*
                    309:
                    310: .fi
                    311: is identical to
                    312: .nf
                    313:
                    314:     ([0-9])+"."([0-9])*
                    315:
                    316: .fi
                    317: and matches one-or-more digits followed by a '.' followed
                    318: by zero-or-more digits.
                    319: .PP
                    320: The
                    321: .I rules
                    322: section of the
                    323: .I flex
                    324: input contains a series of rules of the form:
                    325: .nf
                    326:
                    327:     pattern   action
                    328:
                    329: .fi
                    330: where the pattern must be unindented and the action must begin
                    331: on the same line.
                    332: .PP
                    333: See below for a further description of patterns and actions.
                    334: .PP
                    335: Finally, the user code section is simply copied to
                    336: .B lex.yy.c
                    337: verbatim.
                    338: It is used for companion routines which call or are called
                    339: by the scanner.  The presence of this section is optional;
                    340: if it is missing, the second
                    341: .B %%
                    342: in the input file may be skipped, too.
                    343: .PP
                    344: In the definitions and rules sections, any
                    345: .I indented
                    346: text or text enclosed in
                    347: .B %{
                    348: and
                    349: .B %}
                    350: is copied verbatim to the output (with the %{}'s removed).
                    351: The %{}'s must appear unindented on lines by themselves.
                    352: .PP
                    353: In the rules section,
                    354: any indented or %{} text appearing before the
                    355: first rule may be used to declare variables
                    356: which are local to the scanning routine and (after the declarations)
                    357: code which is to be executed whenever the scanning routine is entered.
                    358: Other indented or %{} text in the rule section is still copied to the output,
                    359: but its meaning is not well-defined and it may well cause compile-time
                    360: errors (this feature is present for
                    361: .I POSIX
                    362: compliance; see below for other such features).
                    363: .PP
                    364: In the definitions section (but not in the rules section),
                    365: an unindented comment (i.e., a line
                    366: beginning with "/*") is also copied verbatim to the output up
                    367: to the next "*/".
                    368: .SH PATTERNS
                    369: The patterns in the input are written using an extended set of regular
                    370: expressions.  These are:
                    371: .nf
                    372:
                    373:     x          match the character 'x'
                    374:     .          any character (byte) except newline
                    375:     [xyz]      a "character class"; in this case, the pattern
                    376:                  matches either an 'x', a 'y', or a 'z'
                    377:     [abj-oZ]   a "character class" with a range in it; matches
                    378:                  an 'a', a 'b', any letter from 'j' through 'o',
                    379:                  or a 'Z'
                    380:     [^A-Z]     a "negated character class", i.e., any character
                    381:                  but those in the class.  In this case, any
                    382:                  character EXCEPT an uppercase letter.
                    383:     [^A-Z\\n]   any character EXCEPT an uppercase letter or
                    384:                  a newline
                    385:     r*         zero or more r's, where r is any regular expression
                    386:     r+         one or more r's
                    387:     r?         zero or one r's (that is, "an optional r")
                    388:     r{2,5}     anywhere from two to five r's
                    389:     r{2,}      two or more r's
                    390:     r{4}       exactly 4 r's
                    391:     {name}     the expansion of the "name" definition
                    392:                (see above)
                    393:     "[xyz]\\"foo"
                    394:                the literal string: [xyz]"foo
                    395:     \\X         if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
                    396:                  then the ANSI-C interpretation of \\x.
                    397:                  Otherwise, a literal 'X' (used to escape
                    398:                  operators such as '*')
                    399:     \\0         a NUL character (ASCII code 0)
                    400:     \\123       the character with octal value 123
                    401:     \\x2a       the character with hexadecimal value 2a
                    402:     (r)        match an r; parentheses are used to override
                    403:                  precedence (see below)
                    404:
                    405:
                    406:     rs         the regular expression r followed by the
                    407:                  regular expression s; called "concatenation"
                    408:
                    409:
                    410:     r|s        either an r or an s
                    411:
                    412:
                    413:     r/s        an r but only if it is followed by an s.  The
                    414:                  text matched by s is included when determining
                    415:                  whether this rule is the "longest match",
                    416:                  but is then returned to the input before
                    417:                  the action is executed.  So the action only
                    418:                  sees the text matched by r.  This type
                    419:                  of pattern is called trailing context".
                    420:                  (There are some combinations of r/s that flex
                    421:                  cannot match correctly; see notes in the
                    422:                  Deficiencies / Bugs section below regarding
                    423:                  "dangerous trailing context".)
                    424:     ^r         an r, but only at the beginning of a line (i.e.,
1.10      deraadt   425:                  just starting to scan, or right after a
1.1       deraadt   426:                  newline has been scanned).
                    427:     r$         an r, but only at the end of a line (i.e., just
                    428:                  before a newline).  Equivalent to "r/\\n".
                    429:
                    430:                Note that flex's notion of "newline" is exactly
                    431:                whatever the C compiler used to compile flex
                    432:                interprets '\\n' as; in particular, on some DOS
                    433:                systems you must either filter out \\r's in the
                    434:                input yourself, or explicitly use r/\\r\\n for "r$".
                    435:
                    436:
                    437:     <s>r       an r, but only in start condition s (see
                    438:                  below for discussion of start conditions)
                    439:     <s1,s2,s3>r
                    440:                same, but in any of start conditions s1,
                    441:                  s2, or s3
                    442:     <*>r       an r in any start condition, even an exclusive one.
                    443:
                    444:
                    445:     <<EOF>>    an end-of-file
                    446:     <s1,s2><<EOF>>
                    447:                an end-of-file when in start condition s1 or s2
                    448:
                    449: .fi
                    450: Note that inside of a character class, all regular expression operators
                    451: lose their special meaning except escape ('\\') and the character class
                    452: operators, '-', ']', and, at the beginning of the class, '^'.
                    453: .PP
                    454: The regular expressions listed above are grouped according to
                    455: precedence, from highest precedence at the top to lowest at the bottom.
                    456: Those grouped together have equal precedence.  For example,
                    457: .nf
                    458:
                    459:     foo|bar*
                    460:
                    461: .fi
                    462: is the same as
                    463: .nf
                    464:
                    465:     (foo)|(ba(r*))
                    466:
                    467: .fi
                    468: since the '*' operator has higher precedence than concatenation,
                    469: and concatenation higher than alternation ('|').  This pattern
                    470: therefore matches
                    471: .I either
                    472: the string "foo"
                    473: .I or
                    474: the string "ba" followed by zero-or-more r's.
                    475: To match "foo" or zero-or-more "bar"'s, use:
                    476: .nf
                    477:
                    478:     foo|(bar)*
                    479:
                    480: .fi
                    481: and to match zero-or-more "foo"'s-or-"bar"'s:
                    482: .nf
                    483:
                    484:     (foo|bar)*
                    485:
                    486: .fi
                    487: .PP
                    488: In addition to characters and ranges of characters, character classes
                    489: can also contain character class
                    490: .I expressions.
                    491: These are expressions enclosed inside
                    492: .B [:
                    493: and
                    494: .B :]
                    495: delimiters (which themselves must appear between the '[' and ']' of the
                    496: character class; other elements may occur inside the character class, too).
                    497: The valid expressions are:
                    498: .nf
                    499:
                    500:     [:alnum:] [:alpha:] [:blank:]
                    501:     [:cntrl:] [:digit:] [:graph:]
                    502:     [:lower:] [:print:] [:punct:]
                    503:     [:space:] [:upper:] [:xdigit:]
                    504:
                    505: .fi
                    506: These expressions all designate a set of characters equivalent to
                    507: the corresponding standard C
                    508: .B isXXX
                    509: function.  For example,
                    510: .B [:alnum:]
                    511: designates those characters for which
                    512: .B isalnum()
                    513: returns true - i.e., any alphabetic or numeric.
                    514: Some systems don't provide
                    515: .B isblank(),
                    516: so flex defines
                    517: .B [:blank:]
                    518: as a blank or a tab.
                    519: .PP
                    520: For example, the following character classes are all equivalent:
                    521: .nf
                    522:
                    523:     [[:alnum:]]
1.4       deraadt   524:     [[:alpha:][:digit:]]
1.1       deraadt   525:     [[:alpha:]0-9]
                    526:     [a-zA-Z0-9]
                    527:
                    528: .fi
                    529: If your scanner is case-insensitive (the
                    530: .B \-i
                    531: flag), then
                    532: .B [:upper:]
                    533: and
                    534: .B [:lower:]
                    535: are equivalent to
                    536: .B [:alpha:].
                    537: .PP
                    538: Some notes on patterns:
                    539: .IP -
                    540: A negated character class such as the example "[^A-Z]"
                    541: above
                    542: .I will match a newline
                    543: unless "\\n" (or an equivalent escape sequence) is one of the
                    544: characters explicitly present in the negated character class
                    545: (e.g., "[^A-Z\\n]").  This is unlike how many other regular
                    546: expression tools treat negated character classes, but unfortunately
                    547: the inconsistency is historically entrenched.
                    548: Matching newlines means that a pattern like [^"]* can match the entire
                    549: input unless there's another quote in the input.
                    550: .IP -
                    551: A rule can have at most one instance of trailing context (the '/' operator
                    552: or the '$' operator).  The start condition, '^', and "<<EOF>>" patterns
                    553: can only occur at the beginning of a pattern, and, as well as with '/' and '$',
                    554: cannot be grouped inside parentheses.  A '^' which does not occur at
                    555: the beginning of a rule or a '$' which does not occur at the end of
                    556: a rule loses its special properties and is treated as a normal character.
                    557: .IP
                    558: The following are illegal:
                    559: .nf
                    560:
                    561:     foo/bar$
                    562:     <sc1>foo<sc2>bar
                    563:
                    564: .fi
                    565: Note that the first of these, can be written "foo/bar\\n".
                    566: .IP
                    567: The following will result in '$' or '^' being treated as a normal character:
                    568: .nf
                    569:
                    570:     foo|(bar$)
                    571:     foo|^bar
                    572:
                    573: .fi
                    574: If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
                    575: could be used (the special '|' action is explained below):
                    576: .nf
                    577:
                    578:     foo      |
                    579:     bar$     /* action goes here */
                    580:
                    581: .fi
                    582: A similar trick will work for matching a foo or a
                    583: bar-at-the-beginning-of-a-line.
                    584: .SH HOW THE INPUT IS MATCHED
                    585: When the generated scanner is run, it analyzes its input looking
                    586: for strings which match any of its patterns.  If it finds more than
                    587: one match, it takes the one matching the most text (for trailing
                    588: context rules, this includes the length of the trailing part, even
                    589: though it will then be returned to the input).  If it finds two
                    590: or more matches of the same length, the
                    591: rule listed first in the
                    592: .I flex
                    593: input file is chosen.
                    594: .PP
                    595: Once the match is determined, the text corresponding to the match
                    596: (called the
                    597: .I token)
                    598: is made available in the global character pointer
                    599: .B yytext,
                    600: and its length in the global integer
                    601: .B yyleng.
                    602: The
                    603: .I action
                    604: corresponding to the matched pattern is then executed (a more
                    605: detailed description of actions follows), and then the remaining
                    606: input is scanned for another match.
                    607: .PP
                    608: If no match is found, then the
                    609: .I default rule
                    610: is executed: the next character in the input is considered matched and
                    611: copied to the standard output.  Thus, the simplest legal
                    612: .I flex
                    613: input is:
                    614: .nf
                    615:
                    616:     %%
                    617:
                    618: .fi
                    619: which generates a scanner that simply copies its input (one character
                    620: at a time) to its output.
                    621: .PP
                    622: Note that
                    623: .B yytext
                    624: can be defined in two different ways: either as a character
                    625: .I pointer
                    626: or as a character
                    627: .I array.
                    628: You can control which definition
                    629: .I flex
                    630: uses by including one of the special directives
                    631: .B %pointer
                    632: or
                    633: .B %array
                    634: in the first (definitions) section of your flex input.  The default is
                    635: .B %pointer,
                    636: unless you use the
                    637: .B -l
                    638: lex compatibility option, in which case
                    639: .B yytext
                    640: will be an array.
                    641: The advantage of using
                    642: .B %pointer
                    643: is substantially faster scanning and no buffer overflow when matching
                    644: very large tokens (unless you run out of dynamic memory).  The disadvantage
                    645: is that you are restricted in how your actions can modify
                    646: .B yytext
                    647: (see the next section), and calls to the
                    648: .B unput()
1.10      deraadt   649: function destroy the present contents of
1.1       deraadt   650: .B yytext,
                    651: which can be a considerable porting headache when moving between different
                    652: .I lex
                    653: versions.
                    654: .PP
                    655: The advantage of
                    656: .B %array
                    657: is that you can then modify
                    658: .B yytext
                    659: to your heart's content, and calls to
                    660: .B unput()
                    661: do not destroy
                    662: .B yytext
                    663: (see below).  Furthermore, existing
                    664: .I lex
                    665: programs sometimes access
                    666: .B yytext
                    667: externally using declarations of the form:
                    668: .nf
                    669:     extern char yytext[];
                    670: .fi
                    671: This definition is erroneous when used with
                    672: .B %pointer,
                    673: but correct for
                    674: .B %array.
                    675: .PP
                    676: .B %array
                    677: defines
                    678: .B yytext
                    679: to be an array of
                    680: .B YYLMAX
                    681: characters, which defaults to a fairly large value.  You can change
                    682: the size by simply #define'ing
                    683: .B YYLMAX
                    684: to a different value in the first section of your
                    685: .I flex
                    686: input.  As mentioned above, with
                    687: .B %pointer
                    688: yytext grows dynamically to accommodate large tokens.  While this means your
                    689: .B %pointer
                    690: scanner can accommodate very large tokens (such as matching entire blocks
                    691: of comments), bear in mind that each time the scanner must resize
                    692: .B yytext
                    693: it also must rescan the entire token from the beginning, so matching such
                    694: tokens can prove slow.
                    695: .B yytext
                    696: presently does
                    697: .I not
                    698: dynamically grow if a call to
                    699: .B unput()
                    700: results in too much text being pushed back; instead, a run-time error results.
                    701: .PP
                    702: Also note that you cannot use
                    703: .B %array
                    704: with C++ scanner classes
                    705: (the
                    706: .B c++
                    707: option; see below).
                    708: .SH ACTIONS
                    709: Each pattern in a rule has a corresponding action, which can be any
                    710: arbitrary C statement.  The pattern ends at the first non-escaped
                    711: whitespace character; the remainder of the line is its action.  If the
                    712: action is empty, then when the pattern is matched the input token
                    713: is simply discarded.  For example, here is the specification for a program
                    714: which deletes all occurrences of "zap me" from its input:
                    715: .nf
                    716:
                    717:     %%
                    718:     "zap me"
                    719:
                    720: .fi
                    721: (It will copy all other characters in the input to the output since
                    722: they will be matched by the default rule.)
                    723: .PP
                    724: Here is a program which compresses multiple blanks and tabs down to
                    725: a single blank, and throws away whitespace found at the end of a line:
                    726: .nf
                    727:
                    728:     %%
                    729:     [ \\t]+        putchar( ' ' );
                    730:     [ \\t]+$       /* ignore this token */
                    731:
                    732: .fi
                    733: .PP
                    734: If the action contains a '{', then the action spans till the balancing '}'
                    735: is found, and the action may cross multiple lines.
1.7       aaron     736: .I flex
1.1       deraadt   737: knows about C strings and comments and won't be fooled by braces found
                    738: within them, but also allows actions to begin with
                    739: .B %{
                    740: and will consider the action to be all the text up to the next
                    741: .B %}
                    742: (regardless of ordinary braces inside the action).
                    743: .PP
                    744: An action consisting solely of a vertical bar ('|') means "same as
                    745: the action for the next rule."  See below for an illustration.
                    746: .PP
                    747: Actions can include arbitrary C code, including
                    748: .B return
                    749: statements to return a value to whatever routine called
                    750: .B yylex().
                    751: Each time
                    752: .B yylex()
                    753: is called it continues processing tokens from where it last left
                    754: off until it either reaches
                    755: the end of the file or executes a return.
                    756: .PP
                    757: Actions are free to modify
                    758: .B yytext
                    759: except for lengthening it (adding
                    760: characters to its end--these will overwrite later characters in the
                    761: input stream).  This however does not apply when using
                    762: .B %array
                    763: (see above); in that case,
                    764: .B yytext
                    765: may be freely modified in any way.
                    766: .PP
                    767: Actions are free to modify
                    768: .B yyleng
                    769: except they should not do so if the action also includes use of
                    770: .B yymore()
                    771: (see below).
                    772: .PP
                    773: There are a number of special directives which can be included within
                    774: an action:
                    775: .IP -
                    776: .B ECHO
                    777: copies yytext to the scanner's output.
                    778: .IP -
                    779: .B BEGIN
                    780: followed by the name of a start condition places the scanner in the
                    781: corresponding start condition (see below).
                    782: .IP -
                    783: .B REJECT
                    784: directs the scanner to proceed on to the "second best" rule which matched the
                    785: input (or a prefix of the input).  The rule is chosen as described
                    786: above in "How the Input is Matched", and
                    787: .B yytext
                    788: and
                    789: .B yyleng
                    790: set up appropriately.
                    791: It may either be one which matched as much text
                    792: as the originally chosen rule but came later in the
                    793: .I flex
                    794: input file, or one which matched less text.
                    795: For example, the following will both count the
                    796: words in the input and call the routine special() whenever "frob" is seen:
                    797: .nf
                    798:
                    799:             int word_count = 0;
                    800:     %%
                    801:
                    802:     frob        special(); REJECT;
                    803:     [^ \\t\\n]+   ++word_count;
                    804:
                    805: .fi
                    806: Without the
                    807: .B REJECT,
                    808: any "frob"'s in the input would not be counted as words, since the
                    809: scanner normally executes only one action per token.
                    810: Multiple
                    811: .B REJECT's
                    812: are allowed, each one finding the next best choice to the currently
                    813: active rule.  For example, when the following scanner scans the token
                    814: "abcd", it will write "abcdabcaba" to the output:
                    815: .nf
                    816:
                    817:     %%
                    818:     a        |
                    819:     ab       |
                    820:     abc      |
                    821:     abcd     ECHO; REJECT;
                    822:     .|\\n     /* eat up any unmatched character */
                    823:
                    824: .fi
                    825: (The first three rules share the fourth's action since they use
                    826: the special '|' action.)
                    827: .B REJECT
                    828: is a particularly expensive feature in terms of scanner performance;
                    829: if it is used in
                    830: .I any
                    831: of the scanner's actions it will slow down
                    832: .I all
                    833: of the scanner's matching.  Furthermore,
                    834: .B REJECT
                    835: cannot be used with the
                    836: .I -Cf
                    837: or
                    838: .I -CF
                    839: options (see below).
                    840: .IP
                    841: Note also that unlike the other special actions,
                    842: .B REJECT
                    843: is a
                    844: .I branch;
                    845: code immediately following it in the action will
                    846: .I not
                    847: be executed.
                    848: .IP -
                    849: .B yymore()
                    850: tells the scanner that the next time it matches a rule, the corresponding
                    851: token should be
                    852: .I appended
                    853: onto the current value of
                    854: .B yytext
                    855: rather than replacing it.  For example, given the input "mega-kludge"
                    856: the following will write "mega-mega-kludge" to the output:
                    857: .nf
                    858:
                    859:     %%
                    860:     mega-    ECHO; yymore();
                    861:     kludge   ECHO;
                    862:
                    863: .fi
                    864: First "mega-" is matched and echoed to the output.  Then "kludge"
                    865: is matched, but the previous "mega-" is still hanging around at the
                    866: beginning of
                    867: .B yytext
                    868: so the
                    869: .B ECHO
                    870: for the "kludge" rule will actually write "mega-kludge".
                    871: .PP
                    872: Two notes regarding use of
                    873: .B yymore().
                    874: First,
                    875: .B yymore()
                    876: depends on the value of
                    877: .I yyleng
                    878: correctly reflecting the size of the current token, so you must not
                    879: modify
                    880: .I yyleng
                    881: if you are using
                    882: .B yymore().
                    883: Second, the presence of
                    884: .B yymore()
                    885: in the scanner's action entails a minor performance penalty in the
                    886: scanner's matching speed.
                    887: .IP -
                    888: .B yyless(n)
                    889: returns all but the first
                    890: .I n
                    891: characters of the current token back to the input stream, where they
                    892: will be rescanned when the scanner looks for the next match.
                    893: .B yytext
                    894: and
                    895: .B yyleng
                    896: are adjusted appropriately (e.g.,
                    897: .B yyleng
                    898: will now be equal to
                    899: .I n
                    900: ).  For example, on the input "foobar" the following will write out
                    901: "foobarbar":
                    902: .nf
                    903:
                    904:     %%
                    905:     foobar    ECHO; yyless(3);
                    906:     [a-z]+    ECHO;
                    907:
                    908: .fi
                    909: An argument of 0 to
                    910: .B yyless
                    911: will cause the entire current input string to be scanned again.  Unless you've
                    912: changed how the scanner will subsequently process its input (using
                    913: .B BEGIN,
                    914: for example), this will result in an endless loop.
                    915: .PP
                    916: Note that
                    917: .B yyless
                    918: is a macro and can only be used in the flex input file, not from
                    919: other source files.
                    920: .IP -
                    921: .B unput(c)
                    922: puts the character
                    923: .I c
                    924: back onto the input stream.  It will be the next character scanned.
                    925: The following action will take the current token and cause it
                    926: to be rescanned enclosed in parentheses.
                    927: .nf
                    928:
                    929:     {
                    930:     int i;
1.14    ! tedu      931:     char *yycopy;
        !           932:
1.1       deraadt   933:     /* Copy yytext because unput() trashes yytext */
1.14    ! tedu      934:     if ((yycopy = strdup( yytext )) == NULL);
        !           935:         err(1, NULL);
1.1       deraadt   936:     unput( ')' );
                    937:     for ( i = yyleng - 1; i >= 0; --i )
                    938:         unput( yycopy[i] );
                    939:     unput( '(' );
                    940:     free( yycopy );
                    941:     }
                    942:
                    943: .fi
                    944: Note that since each
                    945: .B unput()
                    946: puts the given character back at the
                    947: .I beginning
                    948: of the input stream, pushing back strings must be done back-to-front.
                    949: .PP
                    950: An important potential problem when using
                    951: .B unput()
                    952: is that if you are using
                    953: .B %pointer
                    954: (the default), a call to
                    955: .B unput()
                    956: .I destroys
                    957: the contents of
                    958: .I yytext,
                    959: starting with its rightmost character and devouring one character to
                    960: the left with each call.  If you need the value of yytext preserved
                    961: after a call to
                    962: .B unput()
                    963: (as in the above example),
                    964: you must either first copy it elsewhere, or build your scanner using
                    965: .B %array
                    966: instead (see How The Input Is Matched).
                    967: .PP
                    968: Finally, note that you cannot put back
                    969: .B EOF
                    970: to attempt to mark the input stream with an end-of-file.
                    971: .IP -
                    972: .B input()
                    973: reads the next character from the input stream.  For example,
                    974: the following is one way to eat up C comments:
                    975: .nf
                    976:
                    977:     %%
                    978:     "/*"        {
                    979:                 register int c;
                    980:
                    981:                 for ( ; ; )
                    982:                     {
                    983:                     while ( (c = input()) != '*' &&
                    984:                             c != EOF )
                    985:                         ;    /* eat up text of comment */
                    986:
                    987:                     if ( c == '*' )
                    988:                         {
                    989:                         while ( (c = input()) == '*' )
                    990:                             ;
                    991:                         if ( c == '/' )
                    992:                             break;    /* found the end */
                    993:                         }
                    994:
                    995:                     if ( c == EOF )
                    996:                         {
                    997:                         error( "EOF in comment" );
                    998:                         break;
                    999:                         }
                   1000:                     }
                   1001:                 }
                   1002:
                   1003: .fi
                   1004: (Note that if the scanner is compiled using
                   1005: .B C++,
                   1006: then
                   1007: .B input()
                   1008: is instead referred to as
                   1009: .B yyinput(),
                   1010: in order to avoid a name clash with the
                   1011: .B C++
                   1012: stream by the name of
                   1013: .I input.)
                   1014: .IP -
                   1015: .B YY_FLUSH_BUFFER
                   1016: flushes the scanner's internal buffer
                   1017: so that the next time the scanner attempts to match a token, it will
                   1018: first refill the buffer using
                   1019: .B YY_INPUT
                   1020: (see The Generated Scanner, below).  This action is a special case
                   1021: of the more general
                   1022: .B yy_flush_buffer()
                   1023: function, described below in the section Multiple Input Buffers.
                   1024: .IP -
                   1025: .B yyterminate()
                   1026: can be used in lieu of a return statement in an action.  It terminates
                   1027: the scanner and returns a 0 to the scanner's caller, indicating "all done".
                   1028: By default,
                   1029: .B yyterminate()
                   1030: is also called when an end-of-file is encountered.  It is a macro and
                   1031: may be redefined.
                   1032: .SH THE GENERATED SCANNER
                   1033: The output of
                   1034: .I flex
                   1035: is the file
                   1036: .B lex.yy.c,
                   1037: which contains the scanning routine
                   1038: .B yylex(),
                   1039: a number of tables used by it for matching tokens, and a number
                   1040: of auxiliary routines and macros.  By default,
                   1041: .B yylex()
                   1042: is declared as follows:
                   1043: .nf
                   1044:
                   1045:     int yylex()
                   1046:         {
                   1047:         ... various definitions and the actions in here ...
                   1048:         }
                   1049:
                   1050: .fi
                   1051: (If your environment supports function prototypes, then it will
                   1052: be "int yylex( void )".)  This definition may be changed by defining
                   1053: the "YY_DECL" macro.  For example, you could use:
                   1054: .nf
                   1055:
                   1056:     #define YY_DECL float lexscan( a, b ) float a, b;
                   1057:
                   1058: .fi
                   1059: to give the scanning routine the name
                   1060: .I lexscan,
                   1061: returning a float, and taking two floats as arguments.  Note that
                   1062: if you give arguments to the scanning routine using a
                   1063: K&R-style/non-prototyped function declaration, you must terminate
                   1064: the definition with a semi-colon (;).
                   1065: .PP
                   1066: Whenever
                   1067: .B yylex()
                   1068: is called, it scans tokens from the global input file
                   1069: .I yyin
                   1070: (which defaults to stdin).  It continues until it either reaches
                   1071: an end-of-file (at which point it returns the value 0) or
                   1072: one of its actions executes a
                   1073: .I return
                   1074: statement.
                   1075: .PP
                   1076: If the scanner reaches an end-of-file, subsequent calls are undefined
                   1077: unless either
                   1078: .I yyin
                   1079: is pointed at a new input file (in which case scanning continues from
                   1080: that file), or
                   1081: .B yyrestart()
                   1082: is called.
                   1083: .B yyrestart()
                   1084: takes one argument, a
                   1085: .B FILE *
                   1086: pointer (which can be nil, if you've set up
                   1087: .B YY_INPUT
                   1088: to scan from a source other than
                   1089: .I yyin),
                   1090: and initializes
                   1091: .I yyin
                   1092: for scanning from that file.  Essentially there is no difference between
                   1093: just assigning
                   1094: .I yyin
                   1095: to a new input file or using
                   1096: .B yyrestart()
                   1097: to do so; the latter is available for compatibility with previous versions
                   1098: of
                   1099: .I flex,
                   1100: and because it can be used to switch input files in the middle of scanning.
                   1101: It can also be used to throw away the current input buffer, by calling
                   1102: it with an argument of
                   1103: .I yyin;
                   1104: but better is to use
                   1105: .B YY_FLUSH_BUFFER
                   1106: (see above).
                   1107: Note that
                   1108: .B yyrestart()
                   1109: does
                   1110: .I not
                   1111: reset the start condition to
                   1112: .B INITIAL
                   1113: (see Start Conditions, below).
                   1114: .PP
                   1115: If
                   1116: .B yylex()
                   1117: stops scanning due to executing a
                   1118: .I return
                   1119: statement in one of the actions, the scanner may then be called again and it
                   1120: will resume scanning where it left off.
                   1121: .PP
                   1122: By default (and for purposes of efficiency), the scanner uses
                   1123: block-reads rather than simple
                   1124: .I getc()
                   1125: calls to read characters from
                   1126: .I yyin.
                   1127: The nature of how it gets its input can be controlled by defining the
                   1128: .B YY_INPUT
                   1129: macro.
                   1130: YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)".  Its
                   1131: action is to place up to
                   1132: .I max_size
                   1133: characters in the character array
                   1134: .I buf
                   1135: and return in the integer variable
                   1136: .I result
                   1137: either the
                   1138: number of characters read or the constant YY_NULL (0 on Unix systems)
                   1139: to indicate EOF.  The default YY_INPUT reads from the
                   1140: global file-pointer "yyin".
                   1141: .PP
                   1142: A sample definition of YY_INPUT (in the definitions
                   1143: section of the input file):
                   1144: .nf
                   1145:
                   1146:     %{
                   1147:     #define YY_INPUT(buf,result,max_size) \\
                   1148:         { \\
                   1149:         int c = getchar(); \\
                   1150:         result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
                   1151:         }
                   1152:     %}
                   1153:
                   1154: .fi
                   1155: This definition will change the input processing to occur
                   1156: one character at a time.
                   1157: .PP
                   1158: When the scanner receives an end-of-file indication from YY_INPUT,
                   1159: it then checks the
                   1160: .B yywrap()
                   1161: function.  If
                   1162: .B yywrap()
                   1163: returns false (zero), then it is assumed that the
                   1164: function has gone ahead and set up
                   1165: .I yyin
                   1166: to point to another input file, and scanning continues.  If it returns
                   1167: true (non-zero), then the scanner terminates, returning 0 to its
                   1168: caller.  Note that in either case, the start condition remains unchanged;
                   1169: it does
                   1170: .I not
                   1171: revert to
                   1172: .B INITIAL.
                   1173: .PP
                   1174: If you do not supply your own version of
                   1175: .B yywrap(),
                   1176: then you must either use
                   1177: .B %option noyywrap
                   1178: (in which case the scanner behaves as though
                   1179: .B yywrap()
                   1180: returned 1), or you must link with
                   1181: .B \-lfl
                   1182: to obtain the default version of the routine, which always returns 1.
                   1183: .PP
                   1184: Three routines are available for scanning from in-memory buffers rather
                   1185: than files:
                   1186: .B yy_scan_string(), yy_scan_bytes(),
                   1187: and
                   1188: .B yy_scan_buffer().
                   1189: See the discussion of them below in the section Multiple Input Buffers.
                   1190: .PP
                   1191: The scanner writes its
                   1192: .B ECHO
                   1193: output to the
                   1194: .I yyout
                   1195: global (default, stdout), which may be redefined by the user simply
                   1196: by assigning it to some other
                   1197: .B FILE
                   1198: pointer.
                   1199: .SH START CONDITIONS
                   1200: .I flex
                   1201: provides a mechanism for conditionally activating rules.  Any rule
                   1202: whose pattern is prefixed with "<sc>" will only be active when
                   1203: the scanner is in the start condition named "sc".  For example,
                   1204: .nf
                   1205:
                   1206:     <STRING>[^"]*        { /* eat up the string body ... */
                   1207:                 ...
                   1208:                 }
                   1209:
                   1210: .fi
                   1211: will be active only when the scanner is in the "STRING" start
                   1212: condition, and
                   1213: .nf
                   1214:
                   1215:     <INITIAL,STRING,QUOTE>\\.        { /* handle an escape ... */
                   1216:                 ...
                   1217:                 }
                   1218:
                   1219: .fi
                   1220: will be active only when the current start condition is
                   1221: either "INITIAL", "STRING", or "QUOTE".
                   1222: .PP
                   1223: Start conditions
                   1224: are declared in the definitions (first) section of the input
                   1225: using unindented lines beginning with either
                   1226: .B %s
                   1227: or
                   1228: .B %x
                   1229: followed by a list of names.
                   1230: The former declares
                   1231: .I inclusive
                   1232: start conditions, the latter
                   1233: .I exclusive
                   1234: start conditions.  A start condition is activated using the
                   1235: .B BEGIN
                   1236: action.  Until the next
                   1237: .B BEGIN
                   1238: action is executed, rules with the given start
                   1239: condition will be active and
                   1240: rules with other start conditions will be inactive.
                   1241: If the start condition is
                   1242: .I inclusive,
                   1243: then rules with no start conditions at all will also be active.
                   1244: If it is
                   1245: .I exclusive,
                   1246: then
                   1247: .I only
                   1248: rules qualified with the start condition will be active.
                   1249: A set of rules contingent on the same exclusive start condition
                   1250: describe a scanner which is independent of any of the other rules in the
                   1251: .I flex
                   1252: input.  Because of this,
                   1253: exclusive start conditions make it easy to specify "mini-scanners"
                   1254: which scan portions of the input that are syntactically different
                   1255: from the rest (e.g., comments).
                   1256: .PP
                   1257: If the distinction between inclusive and exclusive start conditions
                   1258: is still a little vague, here's a simple example illustrating the
                   1259: connection between the two.  The set of rules:
                   1260: .nf
                   1261:
                   1262:     %s example
                   1263:     %%
                   1264:
                   1265:     <example>foo   do_something();
                   1266:
                   1267:     bar            something_else();
                   1268:
                   1269: .fi
                   1270: is equivalent to
                   1271: .nf
                   1272:
                   1273:     %x example
                   1274:     %%
                   1275:
                   1276:     <example>foo   do_something();
                   1277:
                   1278:     <INITIAL,example>bar    something_else();
                   1279:
                   1280: .fi
                   1281: Without the
                   1282: .B <INITIAL,example>
                   1283: qualifier, the
                   1284: .I bar
                   1285: pattern in the second example wouldn't be active (i.e., couldn't match)
                   1286: when in start condition
                   1287: .B example.
                   1288: If we just used
                   1289: .B <example>
                   1290: to qualify
                   1291: .I bar,
                   1292: though, then it would only be active in
                   1293: .B example
                   1294: and not in
                   1295: .B INITIAL,
                   1296: while in the first example it's active in both, because in the first
                   1297: example the
                   1298: .B example
1.10      deraadt  1299: start condition is an
1.1       deraadt  1300: .I inclusive
                   1301: .B (%s)
                   1302: start condition.
                   1303: .PP
                   1304: Also note that the special start-condition specifier
                   1305: .B <*>
                   1306: matches every start condition.  Thus, the above example could also
                   1307: have been written;
                   1308: .nf
                   1309:
                   1310:     %x example
                   1311:     %%
                   1312:
                   1313:     <example>foo   do_something();
                   1314:
                   1315:     <*>bar    something_else();
                   1316:
                   1317: .fi
                   1318: .PP
                   1319: The default rule (to
                   1320: .B ECHO
                   1321: any unmatched character) remains active in start conditions.  It
                   1322: is equivalent to:
                   1323: .nf
                   1324:
                   1325:     <*>.|\\n     ECHO;
                   1326:
                   1327: .fi
                   1328: .PP
                   1329: .B BEGIN(0)
                   1330: returns to the original state where only the rules with
                   1331: no start conditions are active.  This state can also be
                   1332: referred to as the start-condition "INITIAL", so
                   1333: .B BEGIN(INITIAL)
                   1334: is equivalent to
                   1335: .B BEGIN(0).
                   1336: (The parentheses around the start condition name are not required but
                   1337: are considered good style.)
                   1338: .PP
                   1339: .B BEGIN
                   1340: actions can also be given as indented code at the beginning
                   1341: of the rules section.  For example, the following will cause
                   1342: the scanner to enter the "SPECIAL" start condition whenever
                   1343: .B yylex()
                   1344: is called and the global variable
                   1345: .I enter_special
                   1346: is true:
                   1347: .nf
                   1348:
                   1349:             int enter_special;
                   1350:
                   1351:     %x SPECIAL
                   1352:     %%
                   1353:             if ( enter_special )
                   1354:                 BEGIN(SPECIAL);
                   1355:
                   1356:     <SPECIAL>blahblahblah
                   1357:     ...more rules follow...
                   1358:
                   1359: .fi
                   1360: .PP
                   1361: To illustrate the uses of start conditions,
                   1362: here is a scanner which provides two different interpretations
                   1363: of a string like "123.456".  By default it will treat it as
                   1364: three tokens, the integer "123", a dot ('.'), and the integer "456".
                   1365: But if the string is preceded earlier in the line by the string
                   1366: "expect-floats"
                   1367: it will treat it as a single token, the floating-point number
                   1368: 123.456:
                   1369: .nf
                   1370:
                   1371:     %{
                   1372:     #include <math.h>
                   1373:     %}
                   1374:     %s expect
                   1375:
                   1376:     %%
                   1377:     expect-floats        BEGIN(expect);
                   1378:
                   1379:     <expect>[0-9]+"."[0-9]+      {
                   1380:                 printf( "found a float, = %f\\n",
                   1381:                         atof( yytext ) );
                   1382:                 }
                   1383:     <expect>\\n           {
                   1384:                 /* that's the end of the line, so
                   1385:                  * we need another "expect-number"
                   1386:                  * before we'll recognize any more
                   1387:                  * numbers
                   1388:                  */
                   1389:                 BEGIN(INITIAL);
                   1390:                 }
                   1391:
                   1392:     [0-9]+      {
                   1393:                 printf( "found an integer, = %d\\n",
                   1394:                         atoi( yytext ) );
                   1395:                 }
                   1396:
                   1397:     "."         printf( "found a dot\\n" );
                   1398:
                   1399: .fi
                   1400: Here is a scanner which recognizes (and discards) C comments while
                   1401: maintaining a count of the current input line.
                   1402: .nf
                   1403:
                   1404:     %x comment
                   1405:     %%
                   1406:             int line_num = 1;
                   1407:
                   1408:     "/*"         BEGIN(comment);
                   1409:
                   1410:     <comment>[^*\\n]*        /* eat anything that's not a '*' */
                   1411:     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
                   1412:     <comment>\\n             ++line_num;
                   1413:     <comment>"*"+"/"        BEGIN(INITIAL);
                   1414:
                   1415: .fi
                   1416: This scanner goes to a bit of trouble to match as much
                   1417: text as possible with each rule.  In general, when attempting to write
1.10      deraadt  1418: a high-speed scanner try to match as much as possible in each rule, as
1.1       deraadt  1419: it's a big win.
                   1420: .PP
1.10      deraadt  1421: Note that start-condition names are really integer values and
1.1       deraadt  1422: can be stored as such.  Thus, the above could be extended in the
                   1423: following fashion:
                   1424: .nf
                   1425:
                   1426:     %x comment foo
                   1427:     %%
                   1428:             int line_num = 1;
                   1429:             int comment_caller;
                   1430:
                   1431:     "/*"         {
                   1432:                  comment_caller = INITIAL;
                   1433:                  BEGIN(comment);
                   1434:                  }
                   1435:
                   1436:     ...
                   1437:
                   1438:     <foo>"/*"    {
                   1439:                  comment_caller = foo;
                   1440:                  BEGIN(comment);
                   1441:                  }
                   1442:
                   1443:     <comment>[^*\\n]*        /* eat anything that's not a '*' */
                   1444:     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
                   1445:     <comment>\\n             ++line_num;
                   1446:     <comment>"*"+"/"        BEGIN(comment_caller);
                   1447:
                   1448: .fi
                   1449: Furthermore, you can access the current start condition using
                   1450: the integer-valued
                   1451: .B YY_START
                   1452: macro.  For example, the above assignments to
                   1453: .I comment_caller
                   1454: could instead be written
                   1455: .nf
                   1456:
                   1457:     comment_caller = YY_START;
                   1458:
                   1459: .fi
                   1460: Flex provides
                   1461: .B YYSTATE
                   1462: as an alias for
                   1463: .B YY_START
                   1464: (since that is what's used by AT&T
                   1465: .I lex).
                   1466: .PP
                   1467: Note that start conditions do not have their own name-space; %s's and %x's
                   1468: declare names in the same fashion as #define's.
                   1469: .PP
                   1470: Finally, here's an example of how to match C-style quoted strings using
                   1471: exclusive start conditions, including expanded escape sequences (but
                   1472: not including checking for a string that's too long):
                   1473: .nf
                   1474:
                   1475:     %x str
                   1476:
                   1477:     %%
                   1478:             char string_buf[MAX_STR_CONST];
                   1479:             char *string_buf_ptr;
                   1480:
                   1481:
                   1482:     \\"      string_buf_ptr = string_buf; BEGIN(str);
                   1483:
                   1484:     <str>\\"        { /* saw closing quote - all done */
                   1485:             BEGIN(INITIAL);
                   1486:             *string_buf_ptr = '\\0';
                   1487:             /* return string constant token type and
                   1488:              * value to parser
                   1489:              */
                   1490:             }
                   1491:
                   1492:     <str>\\n        {
                   1493:             /* error - unterminated string constant */
                   1494:             /* generate error message */
                   1495:             }
                   1496:
                   1497:     <str>\\\\[0-7]{1,3} {
                   1498:             /* octal escape sequence */
                   1499:             int result;
                   1500:
                   1501:             (void) sscanf( yytext + 1, "%o", &result );
                   1502:
                   1503:             if ( result > 0xff )
                   1504:                     /* error, constant is out-of-bounds */
                   1505:
                   1506:             *string_buf_ptr++ = result;
                   1507:             }
                   1508:
                   1509:     <str>\\\\[0-9]+ {
                   1510:             /* generate error - bad escape sequence; something
                   1511:              * like '\\48' or '\\0777777'
                   1512:              */
                   1513:             }
                   1514:
                   1515:     <str>\\\\n  *string_buf_ptr++ = '\\n';
                   1516:     <str>\\\\t  *string_buf_ptr++ = '\\t';
                   1517:     <str>\\\\r  *string_buf_ptr++ = '\\r';
                   1518:     <str>\\\\b  *string_buf_ptr++ = '\\b';
                   1519:     <str>\\\\f  *string_buf_ptr++ = '\\f';
                   1520:
                   1521:     <str>\\\\(.|\\n)  *string_buf_ptr++ = yytext[1];
                   1522:
                   1523:     <str>[^\\\\\\n\\"]+        {
                   1524:             char *yptr = yytext;
                   1525:
                   1526:             while ( *yptr )
                   1527:                     *string_buf_ptr++ = *yptr++;
                   1528:             }
                   1529:
                   1530: .fi
                   1531: .PP
                   1532: Often, such as in some of the examples above, you wind up writing a
                   1533: whole bunch of rules all preceded by the same start condition(s).  Flex
                   1534: makes this a little easier and cleaner by introducing a notion of
                   1535: start condition
                   1536: .I scope.
                   1537: A start condition scope is begun with:
                   1538: .nf
                   1539:
                   1540:     <SCs>{
                   1541:
                   1542: .fi
                   1543: where
                   1544: .I SCs
                   1545: is a list of one or more start conditions.  Inside the start condition
                   1546: scope, every rule automatically has the prefix
                   1547: .I <SCs>
                   1548: applied to it, until a
                   1549: .I '}'
                   1550: which matches the initial
                   1551: .I '{'.
                   1552: So, for example,
                   1553: .nf
                   1554:
                   1555:     <ESC>{
                   1556:         "\\\\n"   return '\\n';
                   1557:         "\\\\r"   return '\\r';
                   1558:         "\\\\f"   return '\\f';
                   1559:         "\\\\0"   return '\\0';
                   1560:     }
                   1561:
                   1562: .fi
                   1563: is equivalent to:
                   1564: .nf
                   1565:
                   1566:     <ESC>"\\\\n"  return '\\n';
                   1567:     <ESC>"\\\\r"  return '\\r';
                   1568:     <ESC>"\\\\f"  return '\\f';
                   1569:     <ESC>"\\\\0"  return '\\0';
                   1570:
                   1571: .fi
                   1572: Start condition scopes may be nested.
                   1573: .PP
                   1574: Three routines are available for manipulating stacks of start conditions:
                   1575: .TP
                   1576: .B void yy_push_state(int new_state)
                   1577: pushes the current start condition onto the top of the start condition
                   1578: stack and switches to
                   1579: .I new_state
                   1580: as though you had used
                   1581: .B BEGIN new_state
                   1582: (recall that start condition names are also integers).
                   1583: .TP
                   1584: .B void yy_pop_state()
                   1585: pops the top of the stack and switches to it via
                   1586: .B BEGIN.
                   1587: .TP
                   1588: .B int yy_top_state()
                   1589: returns the top of the stack without altering the stack's contents.
                   1590: .PP
                   1591: The start condition stack grows dynamically and so has no built-in
                   1592: size limitation.  If memory is exhausted, program execution aborts.
                   1593: .PP
                   1594: To use start condition stacks, your scanner must include a
                   1595: .B %option stack
                   1596: directive (see Options below).
                   1597: .SH MULTIPLE INPUT BUFFERS
                   1598: Some scanners (such as those which support "include" files)
                   1599: require reading from several input streams.  As
                   1600: .I flex
                   1601: scanners do a large amount of buffering, one cannot control
                   1602: where the next input will be read from by simply writing a
                   1603: .B YY_INPUT
                   1604: which is sensitive to the scanning context.
                   1605: .B YY_INPUT
                   1606: is only called when the scanner reaches the end of its buffer, which
                   1607: may be a long time after scanning a statement such as an "include"
                   1608: which requires switching the input source.
                   1609: .PP
                   1610: To negotiate these sorts of problems,
                   1611: .I flex
                   1612: provides a mechanism for creating and switching between multiple
                   1613: input buffers.  An input buffer is created by using:
                   1614: .nf
                   1615:
                   1616:     YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
                   1617:
                   1618: .fi
                   1619: which takes a
                   1620: .I FILE
                   1621: pointer and a size and creates a buffer associated with the given
                   1622: file and large enough to hold
                   1623: .I size
                   1624: characters (when in doubt, use
                   1625: .B YY_BUF_SIZE
                   1626: for the size).  It returns a
                   1627: .B YY_BUFFER_STATE
                   1628: handle, which may then be passed to other routines (see below).  The
                   1629: .B YY_BUFFER_STATE
                   1630: type is a pointer to an opaque
                   1631: .B struct yy_buffer_state
                   1632: structure, so you may safely initialize YY_BUFFER_STATE variables to
                   1633: .B ((YY_BUFFER_STATE) 0)
                   1634: if you wish, and also refer to the opaque structure in order to
                   1635: correctly declare input buffers in source files other than that
                   1636: of your scanner.  Note that the
                   1637: .I FILE
                   1638: pointer in the call to
                   1639: .B yy_create_buffer
                   1640: is only used as the value of
                   1641: .I yyin
                   1642: seen by
                   1643: .B YY_INPUT;
                   1644: if you redefine
                   1645: .B YY_INPUT
                   1646: so it no longer uses
                   1647: .I yyin,
                   1648: then you can safely pass a nil
                   1649: .I FILE
                   1650: pointer to
                   1651: .B yy_create_buffer.
                   1652: You select a particular buffer to scan from using:
                   1653: .nf
                   1654:
                   1655:     void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
                   1656:
                   1657: .fi
                   1658: switches the scanner's input buffer so subsequent tokens will
                   1659: come from
                   1660: .I new_buffer.
                   1661: Note that
                   1662: .B yy_switch_to_buffer()
                   1663: may be used by yywrap() to set things up for continued scanning, instead
                   1664: of opening a new file and pointing
                   1665: .I yyin
                   1666: at it.  Note also that switching input sources via either
                   1667: .B yy_switch_to_buffer()
                   1668: or
                   1669: .B yywrap()
                   1670: does
                   1671: .I not
                   1672: change the start condition.
                   1673: .nf
                   1674:
                   1675:     void yy_delete_buffer( YY_BUFFER_STATE buffer )
                   1676:
                   1677: .fi
                   1678: is used to reclaim the storage associated with a buffer.  (
                   1679: .B buffer
                   1680: can be nil, in which case the routine does nothing.)
                   1681: You can also clear the current contents of a buffer using:
                   1682: .nf
                   1683:
                   1684:     void yy_flush_buffer( YY_BUFFER_STATE buffer )
                   1685:
                   1686: .fi
                   1687: This function discards the buffer's contents,
                   1688: so the next time the scanner attempts to match a token from the
                   1689: buffer, it will first fill the buffer anew using
                   1690: .B YY_INPUT.
                   1691: .PP
                   1692: .B yy_new_buffer()
                   1693: is an alias for
                   1694: .B yy_create_buffer(),
                   1695: provided for compatibility with the C++ use of
                   1696: .I new
                   1697: and
                   1698: .I delete
                   1699: for creating and destroying dynamic objects.
                   1700: .PP
                   1701: Finally, the
                   1702: .B YY_CURRENT_BUFFER
                   1703: macro returns a
                   1704: .B YY_BUFFER_STATE
                   1705: handle to the current buffer.
                   1706: .PP
                   1707: Here is an example of using these features for writing a scanner
                   1708: which expands include files (the
                   1709: .B <<EOF>>
                   1710: feature is discussed below):
                   1711: .nf
                   1712:
                   1713:     /* the "incl" state is used for picking up the name
                   1714:      * of an include file
                   1715:      */
                   1716:     %x incl
                   1717:
                   1718:     %{
                   1719:     #define MAX_INCLUDE_DEPTH 10
                   1720:     YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
                   1721:     int include_stack_ptr = 0;
                   1722:     %}
                   1723:
                   1724:     %%
                   1725:     include             BEGIN(incl);
                   1726:
                   1727:     [a-z]+              ECHO;
                   1728:     [^a-z\\n]*\\n?        ECHO;
                   1729:
                   1730:     <incl>[ \\t]*      /* eat the whitespace */
                   1731:     <incl>[^ \\t\\n]+   { /* got the include file name */
                   1732:             if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
                   1733:                 {
                   1734:                 fprintf( stderr, "Includes nested too deeply" );
                   1735:                 exit( 1 );
                   1736:                 }
                   1737:
                   1738:             include_stack[include_stack_ptr++] =
                   1739:                 YY_CURRENT_BUFFER;
                   1740:
                   1741:             yyin = fopen( yytext, "r" );
                   1742:
                   1743:             if ( ! yyin )
                   1744:                 error( ... );
                   1745:
                   1746:             yy_switch_to_buffer(
                   1747:                 yy_create_buffer( yyin, YY_BUF_SIZE ) );
                   1748:
                   1749:             BEGIN(INITIAL);
                   1750:             }
                   1751:
                   1752:     <<EOF>> {
                   1753:             if ( --include_stack_ptr < 0 )
                   1754:                 {
                   1755:                 yyterminate();
                   1756:                 }
                   1757:
                   1758:             else
                   1759:                 {
                   1760:                 yy_delete_buffer( YY_CURRENT_BUFFER );
                   1761:                 yy_switch_to_buffer(
                   1762:                      include_stack[include_stack_ptr] );
                   1763:                 }
                   1764:             }
                   1765:
                   1766: .fi
                   1767: Three routines are available for setting up input buffers for
                   1768: scanning in-memory strings instead of files.  All of them create
                   1769: a new input buffer for scanning the string, and return a corresponding
                   1770: .B YY_BUFFER_STATE
                   1771: handle (which you should delete with
                   1772: .B yy_delete_buffer()
                   1773: when done with it).  They also switch to the new buffer using
                   1774: .B yy_switch_to_buffer(),
                   1775: so the next call to
                   1776: .B yylex()
                   1777: will start scanning the string.
                   1778: .TP
                   1779: .B yy_scan_string(const char *str)
                   1780: scans a NUL-terminated string.
                   1781: .TP
                   1782: .B yy_scan_bytes(const char *bytes, int len)
                   1783: scans
                   1784: .I len
                   1785: bytes (including possibly NUL's)
                   1786: starting at location
                   1787: .I bytes.
                   1788: .PP
                   1789: Note that both of these functions create and scan a
                   1790: .I copy
                   1791: of the string or bytes.  (This may be desirable, since
                   1792: .B yylex()
                   1793: modifies the contents of the buffer it is scanning.)  You can avoid the
                   1794: copy by using:
                   1795: .TP
                   1796: .B yy_scan_buffer(char *base, yy_size_t size)
                   1797: which scans in place the buffer starting at
                   1798: .I base,
                   1799: consisting of
                   1800: .I size
                   1801: bytes, the last two bytes of which
                   1802: .I must
                   1803: be
                   1804: .B YY_END_OF_BUFFER_CHAR
                   1805: (ASCII NUL).
                   1806: These last two bytes are not scanned; thus, scanning
                   1807: consists of
                   1808: .B base[0]
                   1809: through
                   1810: .B base[size-2],
                   1811: inclusive.
                   1812: .IP
                   1813: If you fail to set up
                   1814: .I base
                   1815: in this manner (i.e., forget the final two
                   1816: .B YY_END_OF_BUFFER_CHAR
                   1817: bytes), then
                   1818: .B yy_scan_buffer()
                   1819: returns a nil pointer instead of creating a new input buffer.
                   1820: .IP
                   1821: The type
                   1822: .B yy_size_t
                   1823: is an integral type to which you can cast an integer expression
                   1824: reflecting the size of the buffer.
                   1825: .SH END-OF-FILE RULES
                   1826: The special rule "<<EOF>>" indicates
                   1827: actions which are to be taken when an end-of-file is
                   1828: encountered and yywrap() returns non-zero (i.e., indicates
                   1829: no further files to process).  The action must finish
                   1830: by doing one of four things:
                   1831: .IP -
                   1832: assigning
                   1833: .I yyin
                   1834: to a new input file (in previous versions of flex, after doing the
                   1835: assignment you had to call the special action
                   1836: .B YY_NEW_FILE;
                   1837: this is no longer necessary);
                   1838: .IP -
                   1839: executing a
                   1840: .I return
                   1841: statement;
                   1842: .IP -
                   1843: executing the special
                   1844: .B yyterminate()
                   1845: action;
                   1846: .IP -
                   1847: or, switching to a new buffer using
                   1848: .B yy_switch_to_buffer()
                   1849: as shown in the example above.
                   1850: .PP
                   1851: <<EOF>> rules may not be used with other
                   1852: patterns; they may only be qualified with a list of start
                   1853: conditions.  If an unqualified <<EOF>> rule is given, it
                   1854: applies to
                   1855: .I all
                   1856: start conditions which do not already have <<EOF>> actions.  To
                   1857: specify an <<EOF>> rule for only the initial start condition, use
                   1858: .nf
                   1859:
                   1860:     <INITIAL><<EOF>>
                   1861:
                   1862: .fi
                   1863: .PP
                   1864: These rules are useful for catching things like unclosed comments.
                   1865: An example:
                   1866: .nf
                   1867:
                   1868:     %x quote
                   1869:     %%
                   1870:
                   1871:     ...other rules for dealing with quotes...
                   1872:
                   1873:     <quote><<EOF>>   {
                   1874:              error( "unterminated quote" );
                   1875:              yyterminate();
                   1876:              }
                   1877:     <<EOF>>  {
                   1878:              if ( *++filelist )
                   1879:                  yyin = fopen( *filelist, "r" );
                   1880:              else
                   1881:                 yyterminate();
                   1882:              }
                   1883:
                   1884: .fi
                   1885: .SH MISCELLANEOUS MACROS
                   1886: The macro
                   1887: .B YY_USER_ACTION
                   1888: can be defined to provide an action
                   1889: which is always executed prior to the matched rule's action.  For example,
                   1890: it could be #define'd to call a routine to convert yytext to lower-case.
                   1891: When
                   1892: .B YY_USER_ACTION
                   1893: is invoked, the variable
                   1894: .I yy_act
                   1895: gives the number of the matched rule (rules are numbered starting with 1).
                   1896: Suppose you want to profile how often each of your rules is matched.  The
                   1897: following would do the trick:
                   1898: .nf
                   1899:
                   1900:     #define YY_USER_ACTION ++ctr[yy_act]
                   1901:
                   1902: .fi
                   1903: where
                   1904: .I ctr
                   1905: is an array to hold the counts for the different rules.  Note that
                   1906: the macro
                   1907: .B YY_NUM_RULES
                   1908: gives the total number of rules (including the default rule, even if
                   1909: you use
                   1910: .B \-s),
                   1911: so a correct declaration for
                   1912: .I ctr
                   1913: is:
                   1914: .nf
                   1915:
                   1916:     int ctr[YY_NUM_RULES];
                   1917:
                   1918: .fi
                   1919: .PP
                   1920: The macro
                   1921: .B YY_USER_INIT
                   1922: may be defined to provide an action which is always executed before
                   1923: the first scan (and before the scanner's internal initializations are done).
                   1924: For example, it could be used to call a routine to read
                   1925: in a data table or open a logging file.
                   1926: .PP
                   1927: The macro
                   1928: .B yy_set_interactive(is_interactive)
                   1929: can be used to control whether the current buffer is considered
                   1930: .I interactive.
                   1931: An interactive buffer is processed more slowly,
                   1932: but must be used when the scanner's input source is indeed
                   1933: interactive to avoid problems due to waiting to fill buffers
                   1934: (see the discussion of the
                   1935: .B \-I
                   1936: flag below).  A non-zero value
1.7       aaron    1937: in the macro invocation marks the buffer as interactive, a zero
1.1       deraadt  1938: value as non-interactive.  Note that use of this macro overrides
                   1939: .B %option always-interactive
                   1940: or
                   1941: .B %option never-interactive
                   1942: (see Options below).
                   1943: .B yy_set_interactive()
                   1944: must be invoked prior to beginning to scan the buffer that is
                   1945: (or is not) to be considered interactive.
                   1946: .PP
                   1947: The macro
                   1948: .B yy_set_bol(at_bol)
                   1949: can be used to control whether the current buffer's scanning
                   1950: context for the next token match is done as though at the
                   1951: beginning of a line.  A non-zero macro argument makes rules anchored with
1.10      deraadt  1952: \'^' active, while a zero argument makes '^' rules inactive.
1.1       deraadt  1953: .PP
                   1954: The macro
                   1955: .B YY_AT_BOL()
                   1956: returns true if the next token scanned from the current buffer
                   1957: will have '^' rules active, false otherwise.
                   1958: .PP
                   1959: In the generated scanner, the actions are all gathered in one large
                   1960: switch statement and separated using
                   1961: .B YY_BREAK,
                   1962: which may be redefined.  By default, it is simply a "break", to separate
1.10      deraadt  1963: each rule's action from the following rules.
1.1       deraadt  1964: Redefining
                   1965: .B YY_BREAK
                   1966: allows, for example, C++ users to
                   1967: #define YY_BREAK to do nothing (while being very careful that every
                   1968: rule ends with a "break" or a "return"!) to avoid suffering from
                   1969: unreachable statement warnings where because a rule's action ends with
                   1970: "return", the
                   1971: .B YY_BREAK
                   1972: is inaccessible.
                   1973: .SH VALUES AVAILABLE TO THE USER
                   1974: This section summarizes the various values available to the user
                   1975: in the rule actions.
                   1976: .IP -
                   1977: .B char *yytext
                   1978: holds the text of the current token.  It may be modified but not lengthened
                   1979: (you cannot append characters to the end).
                   1980: .IP
                   1981: If the special directive
                   1982: .B %array
                   1983: appears in the first section of the scanner description, then
                   1984: .B yytext
                   1985: is instead declared
                   1986: .B char yytext[YYLMAX],
                   1987: where
                   1988: .B YYLMAX
                   1989: is a macro definition that you can redefine in the first section
                   1990: if you don't like the default value (generally 8KB).  Using
                   1991: .B %array
                   1992: results in somewhat slower scanners, but the value of
                   1993: .B yytext
                   1994: becomes immune to calls to
                   1995: .I input()
                   1996: and
                   1997: .I unput(),
                   1998: which potentially destroy its value when
                   1999: .B yytext
                   2000: is a character pointer.  The opposite of
                   2001: .B %array
                   2002: is
                   2003: .B %pointer,
                   2004: which is the default.
                   2005: .IP
                   2006: You cannot use
                   2007: .B %array
                   2008: when generating C++ scanner classes
                   2009: (the
                   2010: .B \-+
                   2011: flag).
                   2012: .IP -
                   2013: .B int yyleng
                   2014: holds the length of the current token.
                   2015: .IP -
                   2016: .B FILE *yyin
                   2017: is the file which by default
                   2018: .I flex
                   2019: reads from.  It may be redefined but doing so only makes sense before
                   2020: scanning begins or after an EOF has been encountered.  Changing it in
                   2021: the midst of scanning will have unexpected results since
                   2022: .I flex
                   2023: buffers its input; use
                   2024: .B yyrestart()
                   2025: instead.
                   2026: Once scanning terminates because an end-of-file
                   2027: has been seen, you can assign
                   2028: .I yyin
                   2029: at the new input file and then call the scanner again to continue scanning.
                   2030: .IP -
                   2031: .B void yyrestart( FILE *new_file )
                   2032: may be called to point
                   2033: .I yyin
                   2034: at the new input file.  The switch-over to the new file is immediate
                   2035: (any previously buffered-up input is lost).  Note that calling
                   2036: .B yyrestart()
                   2037: with
                   2038: .I yyin
                   2039: as an argument thus throws away the current input buffer and continues
                   2040: scanning the same input file.
                   2041: .IP -
                   2042: .B FILE *yyout
                   2043: is the file to which
                   2044: .B ECHO
                   2045: actions are done.  It can be reassigned by the user.
                   2046: .IP -
                   2047: .B YY_CURRENT_BUFFER
                   2048: returns a
                   2049: .B YY_BUFFER_STATE
                   2050: handle to the current buffer.
                   2051: .IP -
                   2052: .B YY_START
                   2053: returns an integer value corresponding to the current start
                   2054: condition.  You can subsequently use this value with
                   2055: .B BEGIN
                   2056: to return to that start condition.
                   2057: .SH INTERFACING WITH YACC
                   2058: One of the main uses of
                   2059: .I flex
                   2060: is as a companion to the
                   2061: .I yacc
                   2062: parser-generator.
                   2063: .I yacc
                   2064: parsers expect to call a routine named
                   2065: .B yylex()
                   2066: to find the next input token.  The routine is supposed to
                   2067: return the type of the next token as well as putting any associated
                   2068: value in the global
                   2069: .B yylval.
                   2070: To use
                   2071: .I flex
                   2072: with
                   2073: .I yacc,
                   2074: one specifies the
                   2075: .B \-d
                   2076: option to
                   2077: .I yacc
                   2078: to instruct it to generate the file
                   2079: .B y.tab.h
                   2080: containing definitions of all the
                   2081: .B %tokens
                   2082: appearing in the
                   2083: .I yacc
                   2084: input.  This file is then included in the
                   2085: .I flex
                   2086: scanner.  For example, if one of the tokens is "TOK_NUMBER",
                   2087: part of the scanner might look like:
                   2088: .nf
                   2089:
                   2090:     %{
                   2091:     #include "y.tab.h"
                   2092:     %}
                   2093:
                   2094:     %%
                   2095:
                   2096:     [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
                   2097:
                   2098: .fi
                   2099: .SH OPTIONS
                   2100: .I flex
                   2101: has the following options:
                   2102: .TP
                   2103: .B \-b
                   2104: Generate backing-up information to
                   2105: .I lex.backup.
                   2106: This is a list of scanner states which require backing up
                   2107: and the input characters on which they do so.  By adding rules one
                   2108: can remove backing-up states.  If
                   2109: .I all
                   2110: backing-up states are eliminated and
                   2111: .B \-Cf
                   2112: or
                   2113: .B \-CF
                   2114: is used, the generated scanner will run faster (see the
                   2115: .B \-p
                   2116: flag).  Only users who wish to squeeze every last cycle out of their
                   2117: scanners need worry about this option.  (See the section on Performance
                   2118: Considerations below.)
                   2119: .TP
                   2120: .B \-c
                   2121: is a do-nothing, deprecated option included for POSIX compliance.
                   2122: .TP
                   2123: .B \-d
                   2124: makes the generated scanner run in
                   2125: .I debug
                   2126: mode.  Whenever a pattern is recognized and the global
                   2127: .B yy_flex_debug
                   2128: is non-zero (which is the default),
                   2129: the scanner will write to
                   2130: .I stderr
                   2131: a line of the form:
                   2132: .nf
                   2133:
                   2134:     --accepting rule at line 53 ("the matched text")
                   2135:
                   2136: .fi
                   2137: The line number refers to the location of the rule in the file
                   2138: defining the scanner (i.e., the file that was fed to flex).  Messages
                   2139: are also generated when the scanner backs up, accepts the
                   2140: default rule, reaches the end of its input buffer (or encounters
                   2141: a NUL; at this point, the two look the same as far as the scanner's concerned),
                   2142: or reaches an end-of-file.
                   2143: .TP
                   2144: .B \-f
                   2145: specifies
                   2146: .I fast scanner.
                   2147: No table compression is done and stdio is bypassed.
                   2148: The result is large but fast.  This option is equivalent to
                   2149: .B \-Cfr
                   2150: (see below).
                   2151: .TP
                   2152: .B \-h
                   2153: generates a "help" summary of
                   2154: .I flex's
                   2155: options to
1.7       aaron    2156: .I stdout
1.1       deraadt  2157: and then exits.
                   2158: .B \-?
                   2159: and
                   2160: .B \-\-help
                   2161: are synonyms for
                   2162: .B \-h.
                   2163: .TP
                   2164: .B \-i
                   2165: instructs
                   2166: .I flex
                   2167: to generate a
                   2168: .I case-insensitive
                   2169: scanner.  The case of letters given in the
                   2170: .I flex
                   2171: input patterns will
                   2172: be ignored, and tokens in the input will be matched regardless of case.  The
                   2173: matched text given in
                   2174: .I yytext
                   2175: will have the preserved case (i.e., it will not be folded).
                   2176: .TP
                   2177: .B \-l
                   2178: turns on maximum compatibility with the original AT&T
                   2179: .I lex
                   2180: implementation.  Note that this does not mean
                   2181: .I full
                   2182: compatibility.  Use of this option costs a considerable amount of
                   2183: performance, and it cannot be used with the
                   2184: .B \-+, -f, -F, -Cf,
                   2185: or
                   2186: .B -CF
                   2187: options.  For details on the compatibilities it provides, see the section
                   2188: "Incompatibilities With Lex And POSIX" below.  This option also results
                   2189: in the name
                   2190: .B YY_FLEX_LEX_COMPAT
                   2191: being #define'd in the generated scanner.
                   2192: .TP
                   2193: .B \-n
                   2194: is another do-nothing, deprecated option included only for
                   2195: POSIX compliance.
                   2196: .TP
                   2197: .B \-p
                   2198: generates a performance report to stderr.  The report
                   2199: consists of comments regarding features of the
                   2200: .I flex
                   2201: input file which will cause a serious loss of performance in the resulting
                   2202: scanner.  If you give the flag twice, you will also get comments regarding
                   2203: features that lead to minor performance losses.
                   2204: .IP
                   2205: Note that the use of
                   2206: .B REJECT,
                   2207: .B %option yylineno,
                   2208: and variable trailing context (see the Deficiencies / Bugs section below)
                   2209: entails a substantial performance penalty; use of
                   2210: .I yymore(),
                   2211: the
                   2212: .B ^
                   2213: operator,
                   2214: and the
                   2215: .B \-I
                   2216: flag entail minor performance penalties.
                   2217: .TP
                   2218: .B \-s
                   2219: causes the
                   2220: .I default rule
                   2221: (that unmatched scanner input is echoed to
                   2222: .I stdout)
                   2223: to be suppressed.  If the scanner encounters input that does not
                   2224: match any of its rules, it aborts with an error.  This option is
                   2225: useful for finding holes in a scanner's rule set.
                   2226: .TP
                   2227: .B \-t
                   2228: instructs
                   2229: .I flex
                   2230: to write the scanner it generates to standard output instead
                   2231: of
                   2232: .B lex.yy.c.
                   2233: .TP
                   2234: .B \-v
                   2235: specifies that
                   2236: .I flex
                   2237: should write to
                   2238: .I stderr
                   2239: a summary of statistics regarding the scanner it generates.
                   2240: Most of the statistics are meaningless to the casual
                   2241: .I flex
                   2242: user, but the first line identifies the version of
                   2243: .I flex
                   2244: (same as reported by
                   2245: .B \-V),
                   2246: and the next line the flags used when generating the scanner, including
                   2247: those that are on by default.
                   2248: .TP
                   2249: .B \-w
                   2250: suppresses warning messages.
                   2251: .TP
                   2252: .B \-B
                   2253: instructs
                   2254: .I flex
                   2255: to generate a
                   2256: .I batch
                   2257: scanner, the opposite of
                   2258: .I interactive
                   2259: scanners generated by
                   2260: .B \-I
                   2261: (see below).  In general, you use
                   2262: .B \-B
                   2263: when you are
                   2264: .I certain
                   2265: that your scanner will never be used interactively, and you want to
                   2266: squeeze a
                   2267: .I little
                   2268: more performance out of it.  If your goal is instead to squeeze out a
                   2269: .I lot
                   2270: more performance, you should  be using the
                   2271: .B \-Cf
                   2272: or
                   2273: .B \-CF
                   2274: options (discussed below), which turn on
                   2275: .B \-B
                   2276: automatically anyway.
                   2277: .TP
                   2278: .B \-F
                   2279: specifies that the
                   2280: .ul
                   2281: fast
                   2282: scanner table representation should be used (and stdio
                   2283: bypassed).  This representation is
                   2284: about as fast as the full table representation
                   2285: .B (-f),
                   2286: and for some sets of patterns will be considerably smaller (and for
                   2287: others, larger).  In general, if the pattern set contains both "keywords"
                   2288: and a catch-all, "identifier" rule, such as in the set:
                   2289: .nf
                   2290:
                   2291:     "case"    return TOK_CASE;
                   2292:     "switch"  return TOK_SWITCH;
                   2293:     ...
                   2294:     "default" return TOK_DEFAULT;
                   2295:     [a-z]+    return TOK_ID;
                   2296:
                   2297: .fi
                   2298: then you're better off using the full table representation.  If only
                   2299: the "identifier" rule is present and you then use a hash table or some such
                   2300: to detect the keywords, you're better off using
                   2301: .B -F.
                   2302: .IP
                   2303: This option is equivalent to
                   2304: .B \-CFr
                   2305: (see below).  It cannot be used with
                   2306: .B \-+.
                   2307: .TP
                   2308: .B \-I
                   2309: instructs
                   2310: .I flex
                   2311: to generate an
                   2312: .I interactive
                   2313: scanner.  An interactive scanner is one that only looks ahead to decide
                   2314: what token has been matched if it absolutely must.  It turns out that
                   2315: always looking one extra character ahead, even if the scanner has already
                   2316: seen enough text to disambiguate the current token, is a bit faster than
                   2317: only looking ahead when necessary.  But scanners that always look ahead
                   2318: give dreadful interactive performance; for example, when a user types
                   2319: a newline, it is not recognized as a newline token until they enter
                   2320: .I another
                   2321: token, which often means typing in another whole line.
                   2322: .IP
                   2323: .I Flex
                   2324: scanners default to
                   2325: .I interactive
                   2326: unless you use the
                   2327: .B \-Cf
                   2328: or
                   2329: .B \-CF
                   2330: table-compression options (see below).  That's because if you're looking
                   2331: for high-performance you should be using one of these options, so if you
                   2332: didn't,
                   2333: .I flex
                   2334: assumes you'd rather trade off a bit of run-time performance for intuitive
                   2335: interactive behavior.  Note also that you
                   2336: .I cannot
                   2337: use
                   2338: .B \-I
                   2339: in conjunction with
                   2340: .B \-Cf
                   2341: or
                   2342: .B \-CF.
                   2343: Thus, this option is not really needed; it is on by default for all those
                   2344: cases in which it is allowed.
                   2345: .IP
                   2346: You can force a scanner to
                   2347: .I not
                   2348: be interactive by using
                   2349: .B \-B
                   2350: (see above).
                   2351: .TP
                   2352: .B \-L
                   2353: instructs
                   2354: .I flex
                   2355: not to generate
                   2356: .B #line
                   2357: directives.  Without this option,
                   2358: .I flex
                   2359: peppers the generated scanner
                   2360: with #line directives so error messages in the actions will be correctly
                   2361: located with respect to either the original
                   2362: .I flex
                   2363: input file (if the errors are due to code in the input file), or
                   2364: .B lex.yy.c
                   2365: (if the errors are
                   2366: .I flex's
                   2367: fault -- you should report these sorts of errors to the email address
                   2368: given below).
                   2369: .TP
                   2370: .B \-T
                   2371: makes
                   2372: .I flex
                   2373: run in
                   2374: .I trace
                   2375: mode.  It will generate a lot of messages to
                   2376: .I stderr
                   2377: concerning
                   2378: the form of the input and the resultant non-deterministic and deterministic
                   2379: finite automata.  This option is mostly for use in maintaining
                   2380: .I flex.
                   2381: .TP
                   2382: .B \-V
                   2383: prints the version number to
                   2384: .I stdout
                   2385: and exits.
                   2386: .B \-\-version
                   2387: is a synonym for
                   2388: .B \-V.
                   2389: .TP
                   2390: .B \-7
                   2391: instructs
                   2392: .I flex
                   2393: to generate a 7-bit scanner, i.e., one which can only recognized 7-bit
                   2394: characters in its input.  The advantage of using
                   2395: .B \-7
                   2396: is that the scanner's tables can be up to half the size of those generated
                   2397: using the
                   2398: .B \-8
                   2399: option (see below).  The disadvantage is that such scanners often hang
                   2400: or crash if their input contains an 8-bit character.
                   2401: .IP
                   2402: Note, however, that unless you generate your scanner using the
                   2403: .B \-Cf
                   2404: or
                   2405: .B \-CF
                   2406: table compression options, use of
                   2407: .B \-7
                   2408: will save only a small amount of table space, and make your scanner
                   2409: considerably less portable.
                   2410: .I Flex's
                   2411: default behavior is to generate an 8-bit scanner unless you use the
                   2412: .B \-Cf
                   2413: or
                   2414: .B \-CF,
                   2415: in which case
                   2416: .I flex
                   2417: defaults to generating 7-bit scanners unless your site was always
                   2418: configured to generate 8-bit scanners (as will often be the case
                   2419: with non-USA sites).  You can tell whether flex generated a 7-bit
                   2420: or an 8-bit scanner by inspecting the flag summary in the
                   2421: .B \-v
                   2422: output as described above.
                   2423: .IP
                   2424: Note that if you use
                   2425: .B \-Cfe
                   2426: or
                   2427: .B \-CFe
                   2428: (those table compression options, but also using equivalence classes as
                   2429: discussed see below), flex still defaults to generating an 8-bit
                   2430: scanner, since usually with these compression options full 8-bit tables
                   2431: are not much more expensive than 7-bit tables.
                   2432: .TP
                   2433: .B \-8
                   2434: instructs
                   2435: .I flex
                   2436: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
                   2437: characters.  This flag is only needed for scanners generated using
                   2438: .B \-Cf
                   2439: or
                   2440: .B \-CF,
                   2441: as otherwise flex defaults to generating an 8-bit scanner anyway.
                   2442: .IP
                   2443: See the discussion of
                   2444: .B \-7
                   2445: above for flex's default behavior and the tradeoffs between 7-bit
                   2446: and 8-bit scanners.
                   2447: .TP
                   2448: .B \-+
                   2449: specifies that you want flex to generate a C++
                   2450: scanner class.  See the section on Generating C++ Scanners below for
                   2451: details.
1.7       aaron    2452: .TP
1.1       deraadt  2453: .B \-C[aefFmr]
                   2454: controls the degree of table compression and, more generally, trade-offs
                   2455: between small scanners and fast scanners.
                   2456: .IP
                   2457: .B \-Ca
                   2458: ("align") instructs flex to trade off larger tables in the
                   2459: generated scanner for faster performance because the elements of
                   2460: the tables are better aligned for memory access and computation.  On some
                   2461: RISC architectures, fetching and manipulating longwords is more efficient
                   2462: than with smaller-sized units such as shortwords.  This option can
                   2463: double the size of the tables used by your scanner.
                   2464: .IP
                   2465: .B \-Ce
                   2466: directs
                   2467: .I flex
                   2468: to construct
                   2469: .I equivalence classes,
                   2470: i.e., sets of characters
                   2471: which have identical lexical properties (for example, if the only
                   2472: appearance of digits in the
                   2473: .I flex
                   2474: input is in the character class
                   2475: "[0-9]" then the digits '0', '1', ..., '9' will all be put
                   2476: in the same equivalence class).  Equivalence classes usually give
                   2477: dramatic reductions in the final table/object file sizes (typically
                   2478: a factor of 2-5) and are pretty cheap performance-wise (one array
                   2479: look-up per character scanned).
                   2480: .IP
                   2481: .B \-Cf
                   2482: specifies that the
                   2483: .I full
                   2484: scanner tables should be generated -
                   2485: .I flex
                   2486: should not compress the
1.10      deraadt  2487: tables by taking advantage of similar transition functions for
1.1       deraadt  2488: different states.
                   2489: .IP
                   2490: .B \-CF
                   2491: specifies that the alternate fast scanner representation (described
                   2492: above under the
                   2493: .B \-F
                   2494: flag)
                   2495: should be used.  This option cannot be used with
                   2496: .B \-+.
                   2497: .IP
                   2498: .B \-Cm
                   2499: directs
                   2500: .I flex
                   2501: to construct
                   2502: .I meta-equivalence classes,
                   2503: which are sets of equivalence classes (or characters, if equivalence
                   2504: classes are not being used) that are commonly used together.  Meta-equivalence
                   2505: classes are often a big win when using compressed tables, but they
                   2506: have a moderate performance impact (one or two "if" tests and one
                   2507: array look-up per character scanned).
                   2508: .IP
                   2509: .B \-Cr
                   2510: causes the generated scanner to
                   2511: .I bypass
                   2512: use of the standard I/O library (stdio) for input.  Instead of calling
                   2513: .B fread()
                   2514: or
                   2515: .B getc(),
                   2516: the scanner will use the
                   2517: .B read()
                   2518: system call, resulting in a performance gain which varies from system
                   2519: to system, but in general is probably negligible unless you are also using
                   2520: .B \-Cf
                   2521: or
                   2522: .B \-CF.
                   2523: Using
                   2524: .B \-Cr
                   2525: can cause strange behavior if, for example, you read from
                   2526: .I yyin
                   2527: using stdio prior to calling the scanner (because the scanner will miss
                   2528: whatever text your previous reads left in the stdio input buffer).
                   2529: .IP
                   2530: .B \-Cr
                   2531: has no effect if you define
                   2532: .B YY_INPUT
                   2533: (see The Generated Scanner above).
                   2534: .IP
                   2535: A lone
                   2536: .B \-C
                   2537: specifies that the scanner tables should be compressed but neither
                   2538: equivalence classes nor meta-equivalence classes should be used.
                   2539: .IP
                   2540: The options
                   2541: .B \-Cf
                   2542: or
                   2543: .B \-CF
                   2544: and
                   2545: .B \-Cm
                   2546: do not make sense together - there is no opportunity for meta-equivalence
                   2547: classes if the table is not being compressed.  Otherwise the options
                   2548: may be freely mixed, and are cumulative.
                   2549: .IP
                   2550: The default setting is
                   2551: .B \-Cem,
                   2552: which specifies that
                   2553: .I flex
                   2554: should generate equivalence classes
                   2555: and meta-equivalence classes.  This setting provides the highest
                   2556: degree of table compression.  You can trade off
                   2557: faster-executing scanners at the cost of larger tables with
                   2558: the following generally being true:
                   2559: .nf
                   2560:
                   2561:     slowest & smallest
                   2562:           -Cem
                   2563:           -Cm
                   2564:           -Ce
                   2565:           -C
                   2566:           -C{f,F}e
                   2567:           -C{f,F}
                   2568:           -C{f,F}a
                   2569:     fastest & largest
                   2570:
                   2571: .fi
                   2572: Note that scanners with the smallest tables are usually generated and
                   2573: compiled the quickest, so
                   2574: during development you will usually want to use the default, maximal
                   2575: compression.
                   2576: .IP
                   2577: .B \-Cfe
                   2578: is often a good compromise between speed and size for production
                   2579: scanners.
                   2580: .TP
                   2581: .B \-ooutput
                   2582: directs flex to write the scanner to the file
                   2583: .B output
                   2584: instead of
                   2585: .B lex.yy.c.
                   2586: If you combine
                   2587: .B \-o
                   2588: with the
                   2589: .B \-t
                   2590: option, then the scanner is written to
                   2591: .I stdout
                   2592: but its
                   2593: .B #line
                   2594: directives (see the
                   2595: .B \\-L
                   2596: option above) refer to the file
                   2597: .B output.
                   2598: .TP
                   2599: .B \-Pprefix
                   2600: changes the default
                   2601: .I "yy"
                   2602: prefix used by
                   2603: .I flex
1.6       aaron    2604: for all globally visible variable and function names to instead be
1.1       deraadt  2605: .I prefix.
                   2606: For example,
                   2607: .B \-Pfoo
                   2608: changes the name of
                   2609: .B yytext
                   2610: to
                   2611: .B footext.
                   2612: It also changes the name of the default output file from
                   2613: .B lex.yy.c
                   2614: to
                   2615: .B lex.foo.c.
                   2616: Here are all of the names affected:
                   2617: .nf
                   2618:
                   2619:     yy_create_buffer
                   2620:     yy_delete_buffer
                   2621:     yy_flex_debug
                   2622:     yy_init_buffer
                   2623:     yy_flush_buffer
                   2624:     yy_load_buffer_state
                   2625:     yy_switch_to_buffer
                   2626:     yyin
                   2627:     yyleng
                   2628:     yylex
                   2629:     yylineno
                   2630:     yyout
                   2631:     yyrestart
                   2632:     yytext
                   2633:     yywrap
                   2634:
                   2635: .fi
                   2636: (If you are using a C++ scanner, then only
                   2637: .B yywrap
                   2638: and
                   2639: .B yyFlexLexer
                   2640: are affected.)
                   2641: Within your scanner itself, you can still refer to the global variables
                   2642: and functions using either version of their name; but externally, they
                   2643: have the modified name.
                   2644: .IP
                   2645: This option lets you easily link together multiple
                   2646: .I flex
                   2647: programs into the same executable.  Note, though, that using this
                   2648: option also renames
                   2649: .B yywrap(),
                   2650: so you now
                   2651: .I must
                   2652: either
1.6       aaron    2653: provide your own (appropriately named) version of the routine for your
1.1       deraadt  2654: scanner, or use
                   2655: .B %option noyywrap,
                   2656: as linking with
                   2657: .B \-lfl
                   2658: no longer provides one for you by default.
                   2659: .TP
                   2660: .B \-Sskeleton_file
                   2661: overrides the default skeleton file from which
                   2662: .I flex
                   2663: constructs its scanners.  You'll never need this option unless you are doing
                   2664: .I flex
                   2665: maintenance or development.
                   2666: .PP
                   2667: .I flex
                   2668: also provides a mechanism for controlling options within the
                   2669: scanner specification itself, rather than from the flex command-line.
                   2670: This is done by including
                   2671: .B %option
                   2672: directives in the first section of the scanner specification.
                   2673: You can specify multiple options with a single
                   2674: .B %option
                   2675: directive, and multiple directives in the first section of your flex input
                   2676: file.
                   2677: .PP
                   2678: Most options are given simply as names, optionally preceded by the
                   2679: word "no" (with no intervening whitespace) to negate their meaning.
                   2680: A number are equivalent to flex flags or their negation:
                   2681: .nf
                   2682:
                   2683:     7bit            -7 option
                   2684:     8bit            -8 option
                   2685:     align           -Ca option
                   2686:     backup          -b option
                   2687:     batch           -B option
                   2688:     c++             -+ option
                   2689:
                   2690:     caseful or
                   2691:     case-sensitive  opposite of -i (default)
                   2692:
                   2693:     case-insensitive or
                   2694:     caseless        -i option
                   2695:
                   2696:     debug           -d option
                   2697:     default         opposite of -s option
                   2698:     ecs             -Ce option
                   2699:     fast            -F option
                   2700:     full            -f option
                   2701:     interactive     -I option
                   2702:     lex-compat      -l option
                   2703:     meta-ecs        -Cm option
                   2704:     perf-report     -p option
                   2705:     read            -Cr option
                   2706:     stdout          -t option
                   2707:     verbose         -v option
                   2708:     warn            opposite of -w option
                   2709:                     (use "%option nowarn" for -w)
                   2710:
                   2711:     array           equivalent to "%array"
                   2712:     pointer         equivalent to "%pointer" (default)
                   2713:
                   2714: .fi
                   2715: Some
                   2716: .B %option's
                   2717: provide features otherwise not available:
                   2718: .TP
                   2719: .B always-interactive
                   2720: instructs flex to generate a scanner which always considers its input
                   2721: "interactive".  Normally, on each new input file the scanner calls
                   2722: .B isatty()
                   2723: in an attempt to determine whether
                   2724: the scanner's input source is interactive and thus should be read a
                   2725: character at a time.  When this option is used, however, then no
                   2726: such call is made.
                   2727: .TP
                   2728: .B main
                   2729: directs flex to provide a default
                   2730: .B main()
                   2731: program for the scanner, which simply calls
                   2732: .B yylex().
                   2733: This option implies
                   2734: .B noyywrap
                   2735: (see below).
                   2736: .TP
                   2737: .B never-interactive
                   2738: instructs flex to generate a scanner which never considers its input
                   2739: "interactive" (again, no call made to
                   2740: .B isatty()).
                   2741: This is the opposite of
                   2742: .B always-interactive.
                   2743: .TP
                   2744: .B stack
                   2745: enables the use of start condition stacks (see Start Conditions above).
                   2746: .TP
                   2747: .B stdinit
                   2748: if set (i.e.,
                   2749: .B %option stdinit)
                   2750: initializes
                   2751: .I yyin
                   2752: and
                   2753: .I yyout
                   2754: to
                   2755: .I stdin
                   2756: and
                   2757: .I stdout,
                   2758: instead of the default of
                   2759: .I nil.
                   2760: Some existing
                   2761: .I lex
                   2762: programs depend on this behavior, even though it is not compliant with
                   2763: ANSI C, which does not require
                   2764: .I stdin
                   2765: and
                   2766: .I stdout
                   2767: to be compile-time constant.
                   2768: .TP
                   2769: .B yylineno
                   2770: directs
                   2771: .I flex
                   2772: to generate a scanner that maintains the number of the current line
                   2773: read from its input in the global variable
                   2774: .B yylineno.
                   2775: This option is implied by
                   2776: .B %option lex-compat.
                   2777: .TP
                   2778: .B yywrap
                   2779: if unset (i.e.,
                   2780: .B %option noyywrap),
                   2781: makes the scanner not call
                   2782: .B yywrap()
                   2783: upon an end-of-file, but simply assume that there are no more
                   2784: files to scan (until the user points
                   2785: .I yyin
                   2786: at a new file and calls
                   2787: .B yylex()
                   2788: again).
                   2789: .PP
                   2790: .I flex
                   2791: scans your rule actions to determine whether you use the
                   2792: .B REJECT
                   2793: or
                   2794: .B yymore()
                   2795: features.  The
                   2796: .B reject
                   2797: and
                   2798: .B yymore
                   2799: options are available to override its decision as to whether you use the
                   2800: options, either by setting them (e.g.,
                   2801: .B %option reject)
                   2802: to indicate the feature is indeed used, or
                   2803: unsetting them to indicate it actually is not used
                   2804: (e.g.,
                   2805: .B %option noyymore).
                   2806: .PP
                   2807: Three options take string-delimited values, offset with '=':
                   2808: .nf
                   2809:
                   2810:     %option outfile="ABC"
                   2811:
                   2812: .fi
                   2813: is equivalent to
                   2814: .B -oABC,
                   2815: and
                   2816: .nf
                   2817:
                   2818:     %option prefix="XYZ"
                   2819:
                   2820: .fi
                   2821: is equivalent to
                   2822: .B -PXYZ.
                   2823: Finally,
                   2824: .nf
                   2825:
                   2826:     %option yyclass="foo"
                   2827:
                   2828: .fi
                   2829: only applies when generating a C++ scanner (
                   2830: .B \-+
                   2831: option).  It informs
                   2832: .I flex
                   2833: that you have derived
                   2834: .B foo
                   2835: as a subclass of
                   2836: .B yyFlexLexer,
                   2837: so
                   2838: .I flex
                   2839: will place your actions in the member function
                   2840: .B foo::yylex()
                   2841: instead of
                   2842: .B yyFlexLexer::yylex().
                   2843: It also generates a
                   2844: .B yyFlexLexer::yylex()
                   2845: member function that emits a run-time error (by invoking
                   2846: .B yyFlexLexer::LexerError())
                   2847: if called.
                   2848: See Generating C++ Scanners, below, for additional information.
                   2849: .PP
                   2850: A number of options are available for lint purists who want to suppress
                   2851: the appearance of unneeded routines in the generated scanner.  Each of the
                   2852: following, if unset
                   2853: (e.g.,
                   2854: .B %option nounput
                   2855: ), results in the corresponding routine not appearing in
                   2856: the generated scanner:
                   2857: .nf
                   2858:
                   2859:     input, unput
                   2860:     yy_push_state, yy_pop_state, yy_top_state
                   2861:     yy_scan_buffer, yy_scan_bytes, yy_scan_string
                   2862:
                   2863: .fi
                   2864: (though
                   2865: .B yy_push_state()
                   2866: and friends won't appear anyway unless you use
                   2867: .B %option stack).
                   2868: .SH PERFORMANCE CONSIDERATIONS
                   2869: The main design goal of
                   2870: .I flex
                   2871: is that it generate high-performance scanners.  It has been optimized
                   2872: for dealing well with large sets of rules.  Aside from the effects on
                   2873: scanner speed of the table compression
                   2874: .B \-C
                   2875: options outlined above,
                   2876: there are a number of options/actions which degrade performance.  These
                   2877: are, from most expensive to least:
                   2878: .nf
                   2879:
                   2880:     REJECT
                   2881:     %option yylineno
                   2882:     arbitrary trailing context
                   2883:
                   2884:     pattern sets that require backing up
                   2885:     %array
                   2886:     %option interactive
                   2887:     %option always-interactive
                   2888:
                   2889:     '^' beginning-of-line operator
                   2890:     yymore()
                   2891:
                   2892: .fi
                   2893: with the first three all being quite expensive and the last two
                   2894: being quite cheap.  Note also that
                   2895: .B unput()
                   2896: is implemented as a routine call that potentially does quite a bit of
                   2897: work, while
                   2898: .B yyless()
                   2899: is a quite-cheap macro; so if just putting back some excess text you
                   2900: scanned, use
                   2901: .B yyless().
                   2902: .PP
                   2903: .B REJECT
                   2904: should be avoided at all costs when performance is important.
                   2905: It is a particularly expensive option.
                   2906: .PP
                   2907: Getting rid of backing up is messy and often may be an enormous
                   2908: amount of work for a complicated scanner.  In principal, one begins
                   2909: by using the
1.7       aaron    2910: .B \-b
1.1       deraadt  2911: flag to generate a
                   2912: .I lex.backup
                   2913: file.  For example, on the input
                   2914: .nf
                   2915:
                   2916:     %%
                   2917:     foo        return TOK_KEYWORD;
                   2918:     foobar     return TOK_KEYWORD;
                   2919:
                   2920: .fi
                   2921: the file looks like:
                   2922: .nf
                   2923:
                   2924:     State #6 is non-accepting -
                   2925:      associated rule line numbers:
                   2926:            2       3
                   2927:      out-transitions: [ o ]
                   2928:      jam-transitions: EOF [ \\001-n  p-\\177 ]
                   2929:
                   2930:     State #8 is non-accepting -
                   2931:      associated rule line numbers:
                   2932:            3
                   2933:      out-transitions: [ a ]
                   2934:      jam-transitions: EOF [ \\001-`  b-\\177 ]
                   2935:
                   2936:     State #9 is non-accepting -
                   2937:      associated rule line numbers:
                   2938:            3
                   2939:      out-transitions: [ r ]
                   2940:      jam-transitions: EOF [ \\001-q  s-\\177 ]
                   2941:
                   2942:     Compressed tables always back up.
                   2943:
                   2944: .fi
                   2945: The first few lines tell us that there's a scanner state in
                   2946: which it can make a transition on an 'o' but not on any other
                   2947: character, and that in that state the currently scanned text does not match
                   2948: any rule.  The state occurs when trying to match the rules found
                   2949: at lines 2 and 3 in the input file.
                   2950: If the scanner is in that state and then reads
                   2951: something other than an 'o', it will have to back up to find
                   2952: a rule which is matched.  With
                   2953: a bit of headscratching one can see that this must be the
                   2954: state it's in when it has seen "fo".  When this has happened,
                   2955: if anything other than another 'o' is seen, the scanner will
                   2956: have to back up to simply match the 'f' (by the default rule).
                   2957: .PP
                   2958: The comment regarding State #8 indicates there's a problem
                   2959: when "foob" has been scanned.  Indeed, on any character other
                   2960: than an 'a', the scanner will have to back up to accept "foo".
                   2961: Similarly, the comment for State #9 concerns when "fooba" has
                   2962: been scanned and an 'r' does not follow.
                   2963: .PP
                   2964: The final comment reminds us that there's no point going to
                   2965: all the trouble of removing backing up from the rules unless
                   2966: we're using
                   2967: .B \-Cf
                   2968: or
                   2969: .B \-CF,
                   2970: since there's no performance gain doing so with compressed scanners.
                   2971: .PP
                   2972: The way to remove the backing up is to add "error" rules:
                   2973: .nf
                   2974:
                   2975:     %%
                   2976:     foo         return TOK_KEYWORD;
                   2977:     foobar      return TOK_KEYWORD;
                   2978:
                   2979:     fooba       |
                   2980:     foob        |
                   2981:     fo          {
                   2982:                 /* false alarm, not really a keyword */
                   2983:                 return TOK_ID;
                   2984:                 }
                   2985:
                   2986: .fi
                   2987: .PP
                   2988: Eliminating backing up among a list of keywords can also be
                   2989: done using a "catch-all" rule:
                   2990: .nf
                   2991:
                   2992:     %%
                   2993:     foo         return TOK_KEYWORD;
                   2994:     foobar      return TOK_KEYWORD;
                   2995:
                   2996:     [a-z]+      return TOK_ID;
                   2997:
                   2998: .fi
                   2999: This is usually the best solution when appropriate.
                   3000: .PP
                   3001: Backing up messages tend to cascade.
                   3002: With a complicated set of rules it's not uncommon to get hundreds
                   3003: of messages.  If one can decipher them, though, it often
                   3004: only takes a dozen or so rules to eliminate the backing up (though
                   3005: it's easy to make a mistake and have an error rule accidentally match
                   3006: a valid token.  A possible future
                   3007: .I flex
                   3008: feature will be to automatically add rules to eliminate backing up).
                   3009: .PP
                   3010: It's important to keep in mind that you gain the benefits of eliminating
                   3011: backing up only if you eliminate
                   3012: .I every
                   3013: instance of backing up.  Leaving just one means you gain nothing.
                   3014: .PP
                   3015: .I Variable
                   3016: trailing context (where both the leading and trailing parts do not have
                   3017: a fixed length) entails almost the same performance loss as
                   3018: .B REJECT
                   3019: (i.e., substantial).  So when possible a rule like:
                   3020: .nf
                   3021:
                   3022:     %%
                   3023:     mouse|rat/(cat|dog)   run();
                   3024:
                   3025: .fi
                   3026: is better written:
                   3027: .nf
                   3028:
                   3029:     %%
                   3030:     mouse/cat|dog         run();
                   3031:     rat/cat|dog           run();
                   3032:
                   3033: .fi
                   3034: or as
                   3035: .nf
                   3036:
                   3037:     %%
                   3038:     mouse|rat/cat         run();
                   3039:     mouse|rat/dog         run();
                   3040:
                   3041: .fi
                   3042: Note that here the special '|' action does
                   3043: .I not
                   3044: provide any savings, and can even make things worse (see
                   3045: Deficiencies / Bugs below).
                   3046: .LP
                   3047: Another area where the user can increase a scanner's performance
                   3048: (and one that's easier to implement) arises from the fact that
                   3049: the longer the tokens matched, the faster the scanner will run.
                   3050: This is because with long tokens the processing of most input
                   3051: characters takes place in the (short) inner scanning loop, and
                   3052: does not often have to go through the additional work of setting up
                   3053: the scanning environment (e.g.,
                   3054: .B yytext)
                   3055: for the action.  Recall the scanner for C comments:
                   3056: .nf
                   3057:
                   3058:     %x comment
                   3059:     %%
                   3060:             int line_num = 1;
                   3061:
                   3062:     "/*"         BEGIN(comment);
                   3063:
                   3064:     <comment>[^*\\n]*
                   3065:     <comment>"*"+[^*/\\n]*
                   3066:     <comment>\\n             ++line_num;
                   3067:     <comment>"*"+"/"        BEGIN(INITIAL);
                   3068:
                   3069: .fi
                   3070: This could be sped up by writing it as:
                   3071: .nf
                   3072:
                   3073:     %x comment
                   3074:     %%
                   3075:             int line_num = 1;
                   3076:
                   3077:     "/*"         BEGIN(comment);
                   3078:
                   3079:     <comment>[^*\\n]*
                   3080:     <comment>[^*\\n]*\\n      ++line_num;
                   3081:     <comment>"*"+[^*/\\n]*
                   3082:     <comment>"*"+[^*/\\n]*\\n ++line_num;
                   3083:     <comment>"*"+"/"        BEGIN(INITIAL);
                   3084:
                   3085: .fi
                   3086: Now instead of each newline requiring the processing of another
                   3087: action, recognizing the newlines is "distributed" over the other rules
                   3088: to keep the matched text as long as possible.  Note that
                   3089: .I adding
                   3090: rules does
                   3091: .I not
                   3092: slow down the scanner!  The speed of the scanner is independent
                   3093: of the number of rules or (modulo the considerations given at the
                   3094: beginning of this section) how complicated the rules are with
                   3095: regard to operators such as '*' and '|'.
                   3096: .PP
                   3097: A final example in speeding up a scanner: suppose you want to scan
                   3098: through a file containing identifiers and keywords, one per line
                   3099: and with no other extraneous characters, and recognize all the
                   3100: keywords.  A natural first approach is:
                   3101: .nf
                   3102:
                   3103:     %%
                   3104:     asm      |
                   3105:     auto     |
                   3106:     break    |
                   3107:     ... etc ...
                   3108:     volatile |
                   3109:     while    /* it's a keyword */
                   3110:
                   3111:     .|\\n     /* it's not a keyword */
                   3112:
                   3113: .fi
                   3114: To eliminate the back-tracking, introduce a catch-all rule:
                   3115: .nf
                   3116:
                   3117:     %%
                   3118:     asm      |
                   3119:     auto     |
                   3120:     break    |
                   3121:     ... etc ...
                   3122:     volatile |
                   3123:     while    /* it's a keyword */
                   3124:
                   3125:     [a-z]+   |
                   3126:     .|\\n     /* it's not a keyword */
                   3127:
                   3128: .fi
                   3129: Now, if it's guaranteed that there's exactly one word per line,
                   3130: then we can reduce the total number of matches by a half by
                   3131: merging in the recognition of newlines with that of the other
                   3132: tokens:
                   3133: .nf
                   3134:
                   3135:     %%
                   3136:     asm\\n    |
                   3137:     auto\\n   |
                   3138:     break\\n  |
                   3139:     ... etc ...
                   3140:     volatile\\n |
                   3141:     while\\n  /* it's a keyword */
                   3142:
                   3143:     [a-z]+\\n |
                   3144:     .|\\n     /* it's not a keyword */
                   3145:
                   3146: .fi
                   3147: One has to be careful here, as we have now reintroduced backing up
                   3148: into the scanner.  In particular, while
                   3149: .I we
                   3150: know that there will never be any characters in the input stream
                   3151: other than letters or newlines,
                   3152: .I flex
                   3153: can't figure this out, and it will plan for possibly needing to back up
                   3154: when it has scanned a token like "auto" and then the next character
                   3155: is something other than a newline or a letter.  Previously it would
                   3156: then just match the "auto" rule and be done, but now it has no "auto"
1.10      deraadt  3157: rule, only an "auto\\n" rule.  To eliminate the possibility of backing up,
1.1       deraadt  3158: we could either duplicate all rules but without final newlines, or,
                   3159: since we never expect to encounter such an input and therefore don't
                   3160: how it's classified, we can introduce one more catch-all rule, this
                   3161: one which doesn't include a newline:
                   3162: .nf
                   3163:
                   3164:     %%
                   3165:     asm\\n    |
                   3166:     auto\\n   |
                   3167:     break\\n  |
                   3168:     ... etc ...
                   3169:     volatile\\n |
                   3170:     while\\n  /* it's a keyword */
                   3171:
                   3172:     [a-z]+\\n |
                   3173:     [a-z]+   |
                   3174:     .|\\n     /* it's not a keyword */
                   3175:
                   3176: .fi
                   3177: Compiled with
                   3178: .B \-Cf,
                   3179: this is about as fast as one can get a
1.7       aaron    3180: .I flex
1.1       deraadt  3181: scanner to go for this particular problem.
                   3182: .PP
                   3183: A final note:
                   3184: .I flex
                   3185: is slow when matching NUL's, particularly when a token contains
                   3186: multiple NUL's.
                   3187: It's best to write rules which match
                   3188: .I short
                   3189: amounts of text if it's anticipated that the text will often include NUL's.
                   3190: .PP
                   3191: Another final note regarding performance: as mentioned above in the section
                   3192: How the Input is Matched, dynamically resizing
                   3193: .B yytext
                   3194: to accommodate huge tokens is a slow process because it presently requires that
                   3195: the (huge) token be rescanned from the beginning.  Thus if performance is
                   3196: vital, you should attempt to match "large" quantities of text but not
                   3197: "huge" quantities, where the cutoff between the two is at about 8K
                   3198: characters/token.
                   3199: .SH GENERATING C++ SCANNERS
                   3200: .I flex
                   3201: provides two different ways to generate scanners for use with C++.  The
                   3202: first way is to simply compile a scanner generated by
                   3203: .I flex
                   3204: using a C++ compiler instead of a C compiler.  You should not encounter
1.10      deraadt  3205: any compilation errors (please report any you find to the email address
1.1       deraadt  3206: given in the Author section below).  You can then use C++ code in your
                   3207: rule actions instead of C code.  Note that the default input source for
                   3208: your scanner remains
                   3209: .I yyin,
                   3210: and default echoing is still done to
                   3211: .I yyout.
                   3212: Both of these remain
                   3213: .I FILE *
                   3214: variables and not C++
                   3215: .I streams.
                   3216: .PP
                   3217: You can also use
                   3218: .I flex
                   3219: to generate a C++ scanner class, using the
                   3220: .B \-+
                   3221: option (or, equivalently,
                   3222: .B %option c++),
                   3223: which is automatically specified if the name of the flex
                   3224: executable ends in a '+', such as
                   3225: .I flex++.
                   3226: When using this option, flex defaults to generating the scanner to the file
                   3227: .B lex.yy.cc
                   3228: instead of
                   3229: .B lex.yy.c.
                   3230: The generated scanner includes the header file
1.5       deraadt  3231: .I g++/FlexLexer.h,
1.1       deraadt  3232: which defines the interface to two C++ classes.
                   3233: .PP
                   3234: The first class,
                   3235: .B FlexLexer,
                   3236: provides an abstract base class defining the general scanner class
                   3237: interface.  It provides the following member functions:
                   3238: .TP
                   3239: .B const char* YYText()
                   3240: returns the text of the most recently matched token, the equivalent of
                   3241: .B yytext.
                   3242: .TP
                   3243: .B int YYLeng()
                   3244: returns the length of the most recently matched token, the equivalent of
                   3245: .B yyleng.
                   3246: .TP
                   3247: .B int lineno() const
                   3248: returns the current input line number
                   3249: (see
                   3250: .B %option yylineno),
                   3251: or
                   3252: .B 1
                   3253: if
                   3254: .B %option yylineno
                   3255: was not used.
                   3256: .TP
                   3257: .B void set_debug( int flag )
                   3258: sets the debugging flag for the scanner, equivalent to assigning to
                   3259: .B yy_flex_debug
                   3260: (see the Options section above).  Note that you must build the scanner
                   3261: using
                   3262: .B %option debug
                   3263: to include debugging information in it.
                   3264: .TP
                   3265: .B int debug() const
                   3266: returns the current setting of the debugging flag.
                   3267: .PP
                   3268: Also provided are member functions equivalent to
                   3269: .B yy_switch_to_buffer(),
                   3270: .B yy_create_buffer()
                   3271: (though the first argument is an
                   3272: .B istream*
                   3273: object pointer and not a
                   3274: .B FILE*),
                   3275: .B yy_flush_buffer(),
                   3276: .B yy_delete_buffer(),
                   3277: and
                   3278: .B yyrestart()
1.10      deraadt  3279: (again, the first argument is an
1.1       deraadt  3280: .B istream*
                   3281: object pointer).
                   3282: .PP
                   3283: The second class defined in
1.5       deraadt  3284: .I g++/FlexLexer.h
1.1       deraadt  3285: is
                   3286: .B yyFlexLexer,
                   3287: which is derived from
                   3288: .B FlexLexer.
                   3289: It defines the following additional member functions:
                   3290: .TP
                   3291: .B
                   3292: yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
                   3293: constructs a
                   3294: .B yyFlexLexer
                   3295: object using the given streams for input and output.  If not specified,
                   3296: the streams default to
                   3297: .B cin
                   3298: and
                   3299: .B cout,
                   3300: respectively.
                   3301: .TP
                   3302: .B virtual int yylex()
1.10      deraadt  3303: performs the same role as
1.1       deraadt  3304: .B yylex()
                   3305: does for ordinary flex scanners: it scans the input stream, consuming
                   3306: tokens, until a rule's action returns a value.  If you derive a subclass
                   3307: .B S
                   3308: from
                   3309: .B yyFlexLexer
                   3310: and want to access the member functions and variables of
                   3311: .B S
                   3312: inside
                   3313: .B yylex(),
                   3314: then you need to use
                   3315: .B %option yyclass="S"
                   3316: to inform
                   3317: .I flex
                   3318: that you will be using that subclass instead of
                   3319: .B yyFlexLexer.
                   3320: In this case, rather than generating
                   3321: .B yyFlexLexer::yylex(),
                   3322: .I flex
                   3323: generates
                   3324: .B S::yylex()
                   3325: (and also generates a dummy
                   3326: .B yyFlexLexer::yylex()
                   3327: that calls
                   3328: .B yyFlexLexer::LexerError()
                   3329: if called).
                   3330: .TP
                   3331: .B
                   3332: virtual void switch_streams(istream* new_in = 0,
                   3333: .B
                   3334: ostream* new_out = 0)
                   3335: reassigns
                   3336: .B yyin
                   3337: to
                   3338: .B new_in
                   3339: (if non-nil)
                   3340: and
                   3341: .B yyout
                   3342: to
                   3343: .B new_out
                   3344: (ditto), deleting the previous input buffer if
                   3345: .B yyin
                   3346: is reassigned.
                   3347: .TP
                   3348: .B
                   3349: int yylex( istream* new_in, ostream* new_out = 0 )
                   3350: first switches the input streams via
                   3351: .B switch_streams( new_in, new_out )
                   3352: and then returns the value of
                   3353: .B yylex().
                   3354: .PP
                   3355: In addition,
                   3356: .B yyFlexLexer
                   3357: defines the following protected virtual functions which you can redefine
                   3358: in derived classes to tailor the scanner:
                   3359: .TP
                   3360: .B
                   3361: virtual int LexerInput( char* buf, int max_size )
                   3362: reads up to
                   3363: .B max_size
                   3364: characters into
                   3365: .B buf
                   3366: and returns the number of characters read.  To indicate end-of-input,
                   3367: return 0 characters.  Note that "interactive" scanners (see the
                   3368: .B \-B
                   3369: and
                   3370: .B \-I
                   3371: flags) define the macro
                   3372: .B YY_INTERACTIVE.
                   3373: If you redefine
                   3374: .B LexerInput()
                   3375: and need to take different actions depending on whether or not
                   3376: the scanner might be scanning an interactive input source, you can
                   3377: test for the presence of this name via
                   3378: .B #ifdef.
                   3379: .TP
                   3380: .B
                   3381: virtual void LexerOutput( const char* buf, int size )
                   3382: writes out
                   3383: .B size
                   3384: characters from the buffer
                   3385: .B buf,
                   3386: which, while NUL-terminated, may also contain "internal" NUL's if
                   3387: the scanner's rules can match text with NUL's in them.
                   3388: .TP
                   3389: .B
                   3390: virtual void LexerError( const char* msg )
                   3391: reports a fatal error message.  The default version of this function
                   3392: writes the message to the stream
                   3393: .B cerr
                   3394: and exits.
                   3395: .PP
                   3396: Note that a
                   3397: .B yyFlexLexer
                   3398: object contains its
                   3399: .I entire
                   3400: scanning state.  Thus you can use such objects to create reentrant
                   3401: scanners.  You can instantiate multiple instances of the same
                   3402: .B yyFlexLexer
                   3403: class, and you can also combine multiple C++ scanner classes together
                   3404: in the same program using the
                   3405: .B \-P
                   3406: option discussed above.
                   3407: .PP
                   3408: Finally, note that the
                   3409: .B %array
                   3410: feature is not available to C++ scanner classes; you must use
                   3411: .B %pointer
                   3412: (the default).
                   3413: .PP
                   3414: Here is an example of a simple C++ scanner:
                   3415: .nf
                   3416:
                   3417:         // An example of using the flex C++ scanner class.
                   3418:
                   3419:     %{
                   3420:     int mylineno = 0;
                   3421:     %}
                   3422:
                   3423:     string  \\"[^\\n"]+\\"
                   3424:
                   3425:     ws      [ \\t]+
                   3426:
                   3427:     alpha   [A-Za-z]
                   3428:     dig     [0-9]
                   3429:     name    ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])*
                   3430:     num1    [-+]?{dig}+\\.?([eE][-+]?{dig}+)?
                   3431:     num2    [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)?
                   3432:     number  {num1}|{num2}
                   3433:
                   3434:     %%
                   3435:
                   3436:     {ws}    /* skip blanks and tabs */
                   3437:
                   3438:     "/*"    {
                   3439:             int c;
                   3440:
                   3441:             while((c = yyinput()) != 0)
                   3442:                 {
                   3443:                 if(c == '\\n')
                   3444:                     ++mylineno;
                   3445:
                   3446:                 else if(c == '*')
                   3447:                     {
                   3448:                     if((c = yyinput()) == '/')
                   3449:                         break;
                   3450:                     else
                   3451:                         unput(c);
                   3452:                     }
                   3453:                 }
                   3454:             }
                   3455:
                   3456:     {number}  cout << "number " << YYText() << '\\n';
                   3457:
                   3458:     \\n        mylineno++;
                   3459:
                   3460:     {name}    cout << "name " << YYText() << '\\n';
                   3461:
                   3462:     {string}  cout << "string " << YYText() << '\\n';
                   3463:
                   3464:     %%
                   3465:
                   3466:     int main( int /* argc */, char** /* argv */ )
                   3467:         {
                   3468:         FlexLexer* lexer = new yyFlexLexer;
                   3469:         while(lexer->yylex() != 0)
                   3470:             ;
                   3471:         return 0;
                   3472:         }
                   3473: .fi
                   3474: If you want to create multiple (different) lexer classes, you use the
                   3475: .B \-P
                   3476: flag (or the
                   3477: .B prefix=
                   3478: option) to rename each
                   3479: .B yyFlexLexer
                   3480: to some other
                   3481: .B xxFlexLexer.
                   3482: You then can include
1.5       deraadt  3483: .B <g++/FlexLexer.h>
1.1       deraadt  3484: in your other sources once per lexer class, first renaming
                   3485: .B yyFlexLexer
                   3486: as follows:
                   3487: .nf
                   3488:
                   3489:     #undef yyFlexLexer
                   3490:     #define yyFlexLexer xxFlexLexer
1.5       deraadt  3491:     #include <g++/FlexLexer.h>
1.1       deraadt  3492:
                   3493:     #undef yyFlexLexer
                   3494:     #define yyFlexLexer zzFlexLexer
1.5       deraadt  3495:     #include <g++/FlexLexer.h>
1.1       deraadt  3496:
                   3497: .fi
                   3498: if, for example, you used
                   3499: .B %option prefix="xx"
                   3500: for one of your scanners and
                   3501: .B %option prefix="zz"
                   3502: for the other.
                   3503: .PP
                   3504: IMPORTANT: the present form of the scanning class is
                   3505: .I experimental
1.7       aaron    3506: and may change considerably between major releases.
1.1       deraadt  3507: .SH INCOMPATIBILITIES WITH LEX AND POSIX
                   3508: .I flex
                   3509: is a rewrite of the AT&T Unix
                   3510: .I lex
                   3511: tool (the two implementations do not share any code, though),
                   3512: with some extensions and incompatibilities, both of which
                   3513: are of concern to those who wish to write scanners acceptable
                   3514: to either implementation.  Flex is fully compliant with the POSIX
                   3515: .I lex
                   3516: specification, except that when using
                   3517: .B %pointer
                   3518: (the default), a call to
                   3519: .B unput()
                   3520: destroys the contents of
                   3521: .B yytext,
                   3522: which is counter to the POSIX specification.
                   3523: .PP
                   3524: In this section we discuss all of the known areas of incompatibility
                   3525: between flex, AT&T lex, and the POSIX specification.
                   3526: .PP
                   3527: .I flex's
                   3528: .B \-l
                   3529: option turns on maximum compatibility with the original AT&T
                   3530: .I lex
                   3531: implementation, at the cost of a major loss in the generated scanner's
                   3532: performance.  We note below which incompatibilities can be overcome
                   3533: using the
                   3534: .B \-l
                   3535: option.
                   3536: .PP
                   3537: .I flex
                   3538: is fully compatible with
                   3539: .I lex
                   3540: with the following exceptions:
                   3541: .IP -
                   3542: The undocumented
                   3543: .I lex
                   3544: scanner internal variable
                   3545: .B yylineno
                   3546: is not supported unless
                   3547: .B \-l
                   3548: or
                   3549: .B %option yylineno
                   3550: is used.
                   3551: .IP
                   3552: .B yylineno
                   3553: should be maintained on a per-buffer basis, rather than a per-scanner
                   3554: (single global variable) basis.
                   3555: .IP
                   3556: .B yylineno
                   3557: is not part of the POSIX specification.
                   3558: .IP -
                   3559: The
                   3560: .B input()
                   3561: routine is not redefinable, though it may be called to read characters
                   3562: following whatever has been matched by a rule.  If
                   3563: .B input()
                   3564: encounters an end-of-file the normal
                   3565: .B yywrap()
                   3566: processing is done.  A ``real'' end-of-file is returned by
                   3567: .B input()
                   3568: as
                   3569: .I EOF.
                   3570: .IP
                   3571: Input is instead controlled by defining the
                   3572: .B YY_INPUT
                   3573: macro.
                   3574: .IP
                   3575: The
                   3576: .I flex
                   3577: restriction that
                   3578: .B input()
                   3579: cannot be redefined is in accordance with the POSIX specification,
                   3580: which simply does not specify any way of controlling the
                   3581: scanner's input other than by making an initial assignment to
                   3582: .I yyin.
                   3583: .IP -
                   3584: The
                   3585: .B unput()
                   3586: routine is not redefinable.  This restriction is in accordance with POSIX.
                   3587: .IP -
                   3588: .I flex
                   3589: scanners are not as reentrant as
                   3590: .I lex
                   3591: scanners.  In particular, if you have an interactive scanner and
                   3592: an interrupt handler which long-jumps out of the scanner, and
                   3593: the scanner is subsequently called again, you may get the following
                   3594: message:
                   3595: .nf
                   3596:
                   3597:     fatal flex scanner internal error--end of buffer missed
                   3598:
                   3599: .fi
                   3600: To reenter the scanner, first use
                   3601: .nf
                   3602:
                   3603:     yyrestart( yyin );
                   3604:
                   3605: .fi
                   3606: Note that this call will throw away any buffered input; usually this
                   3607: isn't a problem with an interactive scanner.
                   3608: .IP
                   3609: Also note that flex C++ scanner classes
                   3610: .I are
                   3611: reentrant, so if using C++ is an option for you, you should use
                   3612: them instead.  See "Generating C++ Scanners" above for details.
                   3613: .IP -
                   3614: .B output()
                   3615: is not supported.
                   3616: Output from the
                   3617: .B ECHO
                   3618: macro is done to the file-pointer
                   3619: .I yyout
                   3620: (default
                   3621: .I stdout).
                   3622: .IP
                   3623: .B output()
                   3624: is not part of the POSIX specification.
                   3625: .IP -
                   3626: .I lex
                   3627: does not support exclusive start conditions (%x), though they
                   3628: are in the POSIX specification.
                   3629: .IP -
                   3630: When definitions are expanded,
                   3631: .I flex
                   3632: encloses them in parentheses.
                   3633: With lex, the following:
                   3634: .nf
                   3635:
                   3636:     NAME    [A-Z][A-Z0-9]*
                   3637:     %%
                   3638:     foo{NAME}?      printf( "Found it\\n" );
                   3639:     %%
                   3640:
                   3641: .fi
                   3642: will not match the string "foo" because when the macro
                   3643: is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
                   3644: and the precedence is such that the '?' is associated with
                   3645: "[A-Z0-9]*".  With
                   3646: .I flex,
                   3647: the rule will be expanded to
                   3648: "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
                   3649: .IP
                   3650: Note that if the definition begins with
                   3651: .B ^
                   3652: or ends with
                   3653: .B $
                   3654: then it is
                   3655: .I not
                   3656: expanded with parentheses, to allow these operators to appear in
                   3657: definitions without losing their special meanings.  But the
                   3658: .B <s>, /,
                   3659: and
                   3660: .B <<EOF>>
                   3661: operators cannot be used in a
                   3662: .I flex
                   3663: definition.
                   3664: .IP
                   3665: Using
                   3666: .B \-l
                   3667: results in the
                   3668: .I lex
                   3669: behavior of no parentheses around the definition.
                   3670: .IP
                   3671: The POSIX specification is that the definition be enclosed in parentheses.
                   3672: .IP -
                   3673: Some implementations of
                   3674: .I lex
                   3675: allow a rule's action to begin on a separate line, if the rule's pattern
                   3676: has trailing whitespace:
                   3677: .nf
                   3678:
                   3679:     %%
                   3680:     foo|bar<space here>
                   3681:       { foobar_action(); }
                   3682:
                   3683: .fi
                   3684: .I flex
                   3685: does not support this feature.
                   3686: .IP -
                   3687: The
                   3688: .I lex
                   3689: .B %r
                   3690: (generate a Ratfor scanner) option is not supported.  It is not part
                   3691: of the POSIX specification.
                   3692: .IP -
                   3693: After a call to
                   3694: .B unput(),
                   3695: .I yytext
                   3696: is undefined until the next token is matched, unless the scanner
                   3697: was built using
                   3698: .B %array.
                   3699: This is not the case with
                   3700: .I lex
                   3701: or the POSIX specification.  The
                   3702: .B \-l
                   3703: option does away with this incompatibility.
                   3704: .IP -
                   3705: The precedence of the
                   3706: .B {}
                   3707: (numeric range) operator is different.
                   3708: .I lex
                   3709: interprets "abc{1,3}" as "match one, two, or
                   3710: three occurrences of 'abc'", whereas
                   3711: .I flex
                   3712: interprets it as "match 'ab'
                   3713: followed by one, two, or three occurrences of 'c'".  The latter is
                   3714: in agreement with the POSIX specification.
                   3715: .IP -
                   3716: The precedence of the
                   3717: .B ^
                   3718: operator is different.
                   3719: .I lex
                   3720: interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
                   3721: or 'bar' anywhere", whereas
                   3722: .I flex
                   3723: interprets it as "match either 'foo' or 'bar' if they come at the beginning
                   3724: of a line".  The latter is in agreement with the POSIX specification.
                   3725: .IP -
                   3726: The special table-size declarations such as
                   3727: .B %a
                   3728: supported by
                   3729: .I lex
                   3730: are not required by
                   3731: .I flex
                   3732: scanners;
                   3733: .I flex
                   3734: ignores them.
                   3735: .IP -
                   3736: The name
                   3737: .bd
                   3738: FLEX_SCANNER
                   3739: is #define'd so scanners may be written for use with either
                   3740: .I flex
                   3741: or
                   3742: .I lex.
                   3743: Scanners also include
                   3744: .B YY_FLEX_MAJOR_VERSION
                   3745: and
                   3746: .B YY_FLEX_MINOR_VERSION
                   3747: indicating which version of
                   3748: .I flex
                   3749: generated the scanner
                   3750: (for example, for the 2.5 release, these defines would be 2 and 5
                   3751: respectively).
                   3752: .PP
                   3753: The following
                   3754: .I flex
                   3755: features are not included in
                   3756: .I lex
                   3757: or the POSIX specification:
                   3758: .nf
                   3759:
                   3760:     C++ scanners
                   3761:     %option
                   3762:     start condition scopes
                   3763:     start condition stacks
                   3764:     interactive/non-interactive scanners
                   3765:     yy_scan_string() and friends
                   3766:     yyterminate()
                   3767:     yy_set_interactive()
                   3768:     yy_set_bol()
                   3769:     YY_AT_BOL()
                   3770:     <<EOF>>
                   3771:     <*>
                   3772:     YY_DECL
                   3773:     YY_START
                   3774:     YY_USER_ACTION
                   3775:     YY_USER_INIT
                   3776:     #line directives
                   3777:     %{}'s around actions
                   3778:     multiple actions on a line
                   3779:
                   3780: .fi
                   3781: plus almost all of the flex flags.
                   3782: The last feature in the list refers to the fact that with
                   3783: .I flex
                   3784: you can put multiple actions on the same line, separated with
                   3785: semi-colons, while with
                   3786: .I lex,
                   3787: the following
                   3788: .nf
                   3789:
                   3790:     foo    handle_foo(); ++num_foos_seen;
                   3791:
                   3792: .fi
                   3793: is (rather surprisingly) truncated to
                   3794: .nf
                   3795:
                   3796:     foo    handle_foo();
                   3797:
                   3798: .fi
                   3799: .I flex
                   3800: does not truncate the action.  Actions that are not enclosed in
                   3801: braces are simply terminated at the end of the line.
                   3802: .SH DIAGNOSTICS
                   3803: .PP
                   3804: .I warning, rule cannot be matched
                   3805: indicates that the given rule
                   3806: cannot be matched because it follows other rules that will
                   3807: always match the same text as it.  For
                   3808: example, in the following "foo" cannot be matched because it comes after
                   3809: an identifier "catch-all" rule:
                   3810: .nf
                   3811:
                   3812:     [a-z]+    got_identifier();
                   3813:     foo       got_foo();
                   3814:
                   3815: .fi
                   3816: Using
                   3817: .B REJECT
                   3818: in a scanner suppresses this warning.
                   3819: .PP
                   3820: .I warning,
                   3821: .B \-s
                   3822: .I
                   3823: option given but default rule can be matched
                   3824: means that it is possible (perhaps only in a particular start condition)
                   3825: that the default rule (match any single character) is the only one
                   3826: that will match a particular input.  Since
                   3827: .B \-s
                   3828: was given, presumably this is not intended.
                   3829: .PP
                   3830: .I reject_used_but_not_detected undefined
                   3831: or
                   3832: .I yymore_used_but_not_detected undefined -
                   3833: These errors can occur at compile time.  They indicate that the
                   3834: scanner uses
                   3835: .B REJECT
                   3836: or
                   3837: .B yymore()
                   3838: but that
                   3839: .I flex
                   3840: failed to notice the fact, meaning that
                   3841: .I flex
                   3842: scanned the first two sections looking for occurrences of these actions
1.10      deraadt  3843: and failed to find any, but somehow you snuck some in (via an #include
1.1       deraadt  3844: file, for example).  Use
                   3845: .B %option reject
                   3846: or
                   3847: .B %option yymore
                   3848: to indicate to flex that you really do use these features.
                   3849: .PP
                   3850: .I flex scanner jammed -
                   3851: a scanner compiled with
                   3852: .B \-s
                   3853: has encountered an input string which wasn't matched by
                   3854: any of its rules.  This error can also occur due to internal problems.
                   3855: .PP
                   3856: .I token too large, exceeds YYLMAX -
                   3857: your scanner uses
                   3858: .B %array
                   3859: and one of its rules matched a string longer than the
                   3860: .B YYLMAX
                   3861: constant (8K bytes by default).  You can increase the value by
                   3862: #define'ing
                   3863: .B YYLMAX
                   3864: in the definitions section of your
                   3865: .I flex
                   3866: input.
                   3867: .PP
                   3868: .I scanner requires \-8 flag to
                   3869: .I use the character 'x' -
                   3870: Your scanner specification includes recognizing the 8-bit character
                   3871: .I 'x'
                   3872: and you did not specify the \-8 flag, and your scanner defaulted to 7-bit
                   3873: because you used the
                   3874: .B \-Cf
                   3875: or
                   3876: .B \-CF
                   3877: table compression options.  See the discussion of the
                   3878: .B \-7
                   3879: flag for details.
                   3880: .PP
                   3881: .I flex scanner push-back overflow -
                   3882: you used
                   3883: .B unput()
                   3884: to push back so much text that the scanner's buffer could not hold
                   3885: both the pushed-back text and the current token in
                   3886: .B yytext.
                   3887: Ideally the scanner should dynamically resize the buffer in this case, but at
                   3888: present it does not.
                   3889: .PP
                   3890: .I
                   3891: input buffer overflow, can't enlarge buffer because scanner uses REJECT -
                   3892: the scanner was working on matching an extremely large token and needed
                   3893: to expand the input buffer.  This doesn't work with scanners that use
                   3894: .B
                   3895: REJECT.
                   3896: .PP
                   3897: .I
                   3898: fatal flex scanner internal error--end of buffer missed -
                   3899: This can occur in an scanner which is reentered after a long-jump
                   3900: has jumped out (or over) the scanner's activation frame.  Before
                   3901: reentering the scanner, use:
                   3902: .nf
                   3903:
                   3904:     yyrestart( yyin );
                   3905:
                   3906: .fi
                   3907: or, as noted above, switch to using the C++ scanner class.
                   3908: .PP
                   3909: .I too many start conditions in <> construct! -
                   3910: you listed more start conditions in a <> construct than exist (so
                   3911: you must have listed at least one of them twice).
                   3912: .SH FILES
                   3913: .TP
                   3914: .B \-lfl
                   3915: library with which scanners must be linked.
                   3916: .TP
                   3917: .I lex.yy.c
                   3918: generated scanner (called
                   3919: .I lexyy.c
                   3920: on some systems).
                   3921: .TP
                   3922: .I lex.yy.cc
                   3923: generated C++ scanner class, when using
                   3924: .B -+.
                   3925: .TP
1.5       deraadt  3926: .I <g++/FlexLexer.h>
1.1       deraadt  3927: header file defining the C++ scanner base class,
                   3928: .B FlexLexer,
                   3929: and its derived class,
                   3930: .B yyFlexLexer.
                   3931: .TP
                   3932: .I flex.skl
                   3933: skeleton scanner.  This file is only used when building flex, not when
                   3934: flex executes.
                   3935: .TP
                   3936: .I lex.backup
                   3937: backing-up information for
                   3938: .B \-b
                   3939: flag (called
                   3940: .I lex.bck
                   3941: on some systems).
                   3942: .SH DEFICIENCIES / BUGS
                   3943: .PP
                   3944: Some trailing context
                   3945: patterns cannot be properly matched and generate
                   3946: warning messages ("dangerous trailing context").  These are
                   3947: patterns where the ending of the
                   3948: first part of the rule matches the beginning of the second
                   3949: part, such as "zx*/xy*", where the 'x*' matches the 'x' at
                   3950: the beginning of the trailing context.  (Note that the POSIX draft
                   3951: states that the text matched by such patterns is undefined.)
                   3952: .PP
                   3953: For some trailing context rules, parts which are actually fixed-length are
1.3       deraadt  3954: not recognized as such, leading to the above mentioned performance loss.
1.1       deraadt  3955: In particular, parts using '|' or {n} (such as "foo{3}") are always
                   3956: considered variable-length.
                   3957: .PP
                   3958: Combining trailing context with the special '|' action can result in
                   3959: .I fixed
                   3960: trailing context being turned into the more expensive
                   3961: .I variable
                   3962: trailing context.  For example, in the following:
                   3963: .nf
                   3964:
                   3965:     %%
                   3966:     abc      |
                   3967:     xyz/def
                   3968:
                   3969: .fi
                   3970: .PP
                   3971: Use of
                   3972: .B unput()
                   3973: invalidates yytext and yyleng, unless the
                   3974: .B %array
                   3975: directive
                   3976: or the
                   3977: .B \-l
                   3978: option has been used.
                   3979: .PP
                   3980: Pattern-matching of NUL's is substantially slower than matching other
                   3981: characters.
                   3982: .PP
                   3983: Dynamic resizing of the input buffer is slow, as it entails rescanning
                   3984: all the text matched so far by the current (generally huge) token.
                   3985: .PP
                   3986: Due to both buffering of input and read-ahead, you cannot intermix
                   3987: calls to <stdio.h> routines, such as, for example,
                   3988: .B getchar(),
                   3989: with
                   3990: .I flex
                   3991: rules and expect it to work.  Call
                   3992: .B input()
                   3993: instead.
                   3994: .PP
                   3995: The total table entries listed by the
                   3996: .B \-v
                   3997: flag excludes the number of table entries needed to determine
                   3998: what rule has been matched.  The number of entries is equal
                   3999: to the number of DFA states if the scanner does not use
                   4000: .B REJECT,
                   4001: and somewhat greater than the number of states if it does.
                   4002: .PP
                   4003: .B REJECT
                   4004: cannot be used with the
                   4005: .B \-f
                   4006: or
                   4007: .B \-F
                   4008: options.
                   4009: .PP
                   4010: The
                   4011: .I flex
                   4012: internal algorithms need documentation.
                   4013: .SH SEE ALSO
                   4014: .PP
                   4015: lex(1), yacc(1), sed(1), awk(1).
                   4016: .PP
                   4017: John Levine, Tony Mason, and Doug Brown,
                   4018: .I Lex & Yacc,
                   4019: O'Reilly and Associates.  Be sure to get the 2nd edition.
                   4020: .PP
                   4021: M. E. Lesk and E. Schmidt,
                   4022: .I LEX \- Lexical Analyzer Generator
                   4023: .PP
                   4024: Alfred Aho, Ravi Sethi and Jeffrey Ullman,
                   4025: .I Compilers: Principles, Techniques and Tools,
                   4026: Addison-Wesley (1986).  Describes the pattern-matching techniques used by
                   4027: .I flex
                   4028: (deterministic finite automata).
                   4029: .SH AUTHOR
                   4030: Vern Paxson, with the help of many ideas and much inspiration from
                   4031: Van Jacobson.  Original version by Jef Poskanzer.  The fast table
                   4032: representation is a partial implementation of a design done by Van
                   4033: Jacobson.  The implementation was done by Kevin Gong and Vern Paxson.
                   4034: .PP
                   4035: Thanks to the many
                   4036: .I flex
                   4037: beta-testers, feedbackers, and contributors, especially Francois Pinard,
                   4038: Casey Leedom,
                   4039: Robert Abramovitz,
                   4040: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
                   4041: Neal Becker, Nelson H.F. Beebe, benson@odi.com,
                   4042: Karl Berry, Peter A. Bigot, Simon Blanchard,
                   4043: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
                   4044: Brian Clapper, J.T. Conklin,
                   4045: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11      deraadt  4046: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1       deraadt  4047: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
                   4048: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
                   4049: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
                   4050: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
                   4051: Jan Hajic, Charles Hemphill, NORO Hideo,
                   4052: Jarkko Hietaniemi, Scott Hofmann,
                   4053: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
                   4054: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
                   4055: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
                   4056: Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
                   4057: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
                   4058: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
                   4059: David Loffredo, Mike Long,
                   4060: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
                   4061: Bengt Martensson, Chris Metcalf,
                   4062: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
                   4063: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
                   4064: Richard Ohnemus, Karsten Pahnke,
                   4065: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond
                   4066: Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
                   4067: Frederic Raimbault, Pat Rankin, Rick Richardson,
                   4068: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
                   4069: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
                   4070: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
                   4071: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
                   4072: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
                   4073: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
                   4074: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
                   4075: Yap, Ron Zellar, Nathan Zelle, David Zuhn,
                   4076: and those whose names have slipped my marginal
                   4077: mail-archiving skills but whose contributions are appreciated all the
                   4078: same.
                   4079: .PP
                   4080: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
                   4081: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
                   4082: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
                   4083: distribution headaches.
                   4084: .PP
                   4085: Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to
                   4086: Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom
                   4087: Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to
                   4088: Eric Hughes for support of multiple buffers.
                   4089: .PP
                   4090: This work was primarily done when I was with the Real Time Systems Group
                   4091: at the Lawrence Berkeley Laboratory in Berkeley, CA.  Many thanks to all there
                   4092: for the support I received.
                   4093: .PP
                   4094: Send comments to vern@ee.lbl.gov.