src/usr.bin/lex/flex.1 - annotate

Return to flex.1 CVS log
Up to [local] / src / usr.bin / lex
Annotation of src/usr.bin/lex/flex.1, Revision 1.12

1.12    ! jmc         1: .\"    $OpenBSD: flex.1,v 1.11 2003/01/04 22:36:13 deraadt Exp $
        !             2: .\"
        !             3: .\" Copyright (c) 1990 The Regents of the University of California.
        !             4: .\" All rights reserved.
1.2       deraadt     5: .\"
1.12    ! jmc         6: .\" This code is derived from software contributed to Berkeley by
        !             7: .\" Vern Paxson.
        !             8: .\"
        !             9: .\" The United States Government has rights in this work pursuant
        !            10: .\" to contract no. DE-AC03-76SF00098 between the United States
        !            11: .\" Department of Energy and the University of California.
        !            12: .\"
        !            13: .\" Redistribution and use in source and binary forms, with or without
        !            14: .\" modification, are permitted provided that: (1) source distributions
        !            15: .\" retain this entire copyright notice and comment, and (2) distributions
        !            16: .\" including binaries display the following acknowledgement:  ``This product
        !            17: .\" includes software developed by the University of California, Berkeley
        !            18: .\" and its contributors'' in the documentation or other materials provided
        !            19: .\" with the distribution and in all advertising materials mentioning
        !            20: .\" features or use of this software. Neither the name of the University nor
        !            21: .\" the names of its contributors may be used to endorse or promote products
        !            22: .\" derived from this software without specific prior written permission.
        !            23: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
        !            24: .\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
        !            25: .\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
        !            26: .\"
1.1       deraadt    27: .TH FLEX 1 "April 1995" "Version 2.5"
                     28: .SH NAME
                     29: flex \- fast lexical analyzer generator
                     30: .SH SYNOPSIS
                     31: .B flex
                     32: .B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton]
                     33: .B [\-\-help \-\-version]
                     34: .I [filename ...]
                     35: .SH OVERVIEW
                     36: This manual describes
                     37: .I flex,
                     38: a tool for generating programs that perform pattern-matching on text.  The
                     39: manual includes both tutorial and reference sections:
                     40: .nf
                     41:
                     42:     Description
                     43:         a brief overview of the tool
                     44:
                     45:     Some Simple Examples
                     46:
                     47:     Format Of The Input File
                     48:
                     49:     Patterns
                     50:         the extended regular expressions used by flex
                     51:
                     52:     How The Input Is Matched
                     53:         the rules for determining what has been matched
                     54:
                     55:     Actions
                     56:         how to specify what to do when a pattern is matched
                     57:
                     58:     The Generated Scanner
                     59:         details regarding the scanner that flex produces;
                     60:         how to control the input source
                     61:
                     62:     Start Conditions
                     63:         introducing context into your scanners, and
                     64:         managing "mini-scanners"
                     65:
                     66:     Multiple Input Buffers
                     67:         how to manipulate multiple input sources; how to
                     68:         scan from strings instead of files
                     69:
                     70:     End-of-file Rules
                     71:         special rules for matching the end of the input
                     72:
                     73:     Miscellaneous Macros
                     74:         a summary of macros available to the actions
                     75:
                     76:     Values Available To The User
                     77:         a summary of values available to the actions
                     78:
                     79:     Interfacing With Yacc
                     80:         connecting flex scanners together with yacc parsers
                     81:
                     82:     Options
                     83:         flex command-line options, and the "%option"
                     84:         directive
                     85:
                     86:     Performance Considerations
                     87:         how to make your scanner go as fast as possible
                     88:
                     89:     Generating C++ Scanners
                     90:         the (experimental) facility for generating C++
                     91:         scanner classes
                     92:
                     93:     Incompatibilities With Lex And POSIX
                     94:         how flex differs from AT&T lex and the POSIX lex
                     95:         standard
                     96:
                     97:     Diagnostics
                     98:         those error messages produced by flex (or scanners
                     99:         it generates) whose meanings might not be apparent
                    100:
                    101:     Files
                    102:         files used by flex
                    103:
                    104:     Deficiencies / Bugs
                    105:         known problems with flex
                    106:
                    107:     See Also
                    108:         other documentation, related tools
                    109:
                    110:     Author
                    111:         includes contact information
                    112:
                    113: .fi
                    114: .SH DESCRIPTION
                    115: .I flex
                    116: is a tool for generating
                    117: .I scanners:
1.9       millert   118: programs which recognize lexical patterns in text.
1.1       deraadt   119: .I flex
                    120: reads
                    121: the given input files, or its standard input if no file names are given,
                    122: for a description of a scanner to generate.  The description is in
                    123: the form of pairs
                    124: of regular expressions and C code, called
                    125: .I rules.  flex
                    126: generates as output a C source file,
                    127: .B lex.yy.c,
                    128: which defines a routine
                    129: .B yylex().
                    130: This file is compiled and linked with the
                    131: .B \-lfl
                    132: library to produce an executable.  When the executable is run,
                    133: it analyzes its input for occurrences
                    134: of the regular expressions.  Whenever it finds one, it executes
                    135: the corresponding C code.
                    136: .SH SOME SIMPLE EXAMPLES
                    137: .PP
                    138: First some simple examples to get the flavor of how one uses
                    139: .I flex.
                    140: The following
                    141: .I flex
                    142: input specifies a scanner which whenever it encounters the string
                    143: "username" will replace it with the user's login name:
                    144: .nf
                    145:
                    146:     %%
                    147:     username    printf( "%s", getlogin() );
                    148:
                    149: .fi
                    150: By default, any text not matched by a
                    151: .I flex
                    152: scanner
                    153: is copied to the output, so the net effect of this scanner is
                    154: to copy its input file to its output with each occurrence
                    155: of "username" expanded.
                    156: In this input, there is just one rule.  "username" is the
                    157: .I pattern
                    158: and the "printf" is the
                    159: .I action.
                    160: The "%%" marks the beginning of the rules.
                    161: .PP
                    162: Here's another simple example:
                    163: .nf
                    164:
                    165:             int num_lines = 0, num_chars = 0;
                    166:
                    167:     %%
                    168:     \\n      ++num_lines; ++num_chars;
                    169:     .       ++num_chars;
                    170:
                    171:     %%
                    172:     main()
                    173:             {
                    174:             yylex();
                    175:             printf( "# of lines = %d, # of chars = %d\\n",
                    176:                     num_lines, num_chars );
                    177:             }
                    178:
                    179: .fi
                    180: This scanner counts the number of characters and the number
                    181: of lines in its input (it produces no output other than the
                    182: final report on the counts).  The first line
                    183: declares two globals, "num_lines" and "num_chars", which are accessible
                    184: both inside
                    185: .B yylex()
                    186: and in the
                    187: .B main()
                    188: routine declared after the second "%%".  There are two rules, one
                    189: which matches a newline ("\\n") and increments both the line count and
                    190: the character count, and one which matches any character other than
                    191: a newline (indicated by the "." regular expression).
                    192: .PP
                    193: A somewhat more complicated example:
                    194: .nf
                    195:
                    196:     /* scanner for a toy Pascal-like language */
                    197:
                    198:     %{
                    199:     /* need this for the call to atof() below */
                    200:     #include <math.h>
                    201:     %}
                    202:
                    203:     DIGIT    [0-9]
                    204:     ID       [a-z][a-z0-9]*
                    205:
                    206:     %%
                    207:
                    208:     {DIGIT}+    {
                    209:                 printf( "An integer: %s (%d)\\n", yytext,
                    210:                         atoi( yytext ) );
                    211:                 }
                    212:
                    213:     {DIGIT}+"."{DIGIT}*        {
                    214:                 printf( "A float: %s (%g)\\n", yytext,
                    215:                         atof( yytext ) );
                    216:                 }
                    217:
                    218:     if|then|begin|end|procedure|function        {
                    219:                 printf( "A keyword: %s\\n", yytext );
                    220:                 }
                    221:
                    222:     {ID}        printf( "An identifier: %s\\n", yytext );
                    223:
                    224:     "+"|"-"|"*"|"/"   printf( "An operator: %s\\n", yytext );
                    225:
                    226:     "{"[^}\\n]*"}"     /* eat up one-line comments */
                    227:
                    228:     [ \\t\\n]+          /* eat up whitespace */
                    229:
                    230:     .           printf( "Unrecognized character: %s\\n", yytext );
                    231:
                    232:     %%
                    233:
                    234:     main( argc, argv )
                    235:     int argc;
                    236:     char **argv;
                    237:         {
                    238:         ++argv, --argc;  /* skip over program name */
                    239:         if ( argc > 0 )
                    240:                 yyin = fopen( argv[0], "r" );
                    241:         else
                    242:                 yyin = stdin;
1.7       aaron     243:
1.1       deraadt   244:         yylex();
                    245:         }
                    246:
                    247: .fi
                    248: This is the beginnings of a simple scanner for a language like
                    249: Pascal.  It identifies different types of
                    250: .I tokens
                    251: and reports on what it has seen.
                    252: .PP
                    253: The details of this example will be explained in the following
                    254: sections.
                    255: .SH FORMAT OF THE INPUT FILE
                    256: The
                    257: .I flex
                    258: input file consists of three sections, separated by a line with just
                    259: .B %%
                    260: in it:
                    261: .nf
                    262:
                    263:     definitions
                    264:     %%
                    265:     rules
                    266:     %%
                    267:     user code
                    268:
                    269: .fi
                    270: The
                    271: .I definitions
                    272: section contains declarations of simple
                    273: .I name
                    274: definitions to simplify the scanner specification, and declarations of
                    275: .I start conditions,
                    276: which are explained in a later section.
                    277: .PP
                    278: Name definitions have the form:
                    279: .nf
                    280:
                    281:     name definition
                    282:
                    283: .fi
                    284: The "name" is a word beginning with a letter or an underscore ('_')
                    285: followed by zero or more letters, digits, '_', or '-' (dash).
1.8       aaron     286: The definition is taken to begin at the first non-whitespace character
1.1       deraadt   287: following the name and continuing to the end of the line.
                    288: The definition can subsequently be referred to using "{name}", which
                    289: will expand to "(definition)".  For example,
                    290: .nf
                    291:
                    292:     DIGIT    [0-9]
                    293:     ID       [a-z][a-z0-9]*
                    294:
                    295: .fi
                    296: defines "DIGIT" to be a regular expression which matches a
                    297: single digit, and
                    298: "ID" to be a regular expression which matches a letter
                    299: followed by zero-or-more letters-or-digits.
                    300: A subsequent reference to
                    301: .nf
                    302:
                    303:     {DIGIT}+"."{DIGIT}*
                    304:
                    305: .fi
                    306: is identical to
                    307: .nf
                    308:
                    309:     ([0-9])+"."([0-9])*
                    310:
                    311: .fi
                    312: and matches one-or-more digits followed by a '.' followed
                    313: by zero-or-more digits.
                    314: .PP
                    315: The
                    316: .I rules
                    317: section of the
                    318: .I flex
                    319: input contains a series of rules of the form:
                    320: .nf
                    321:
                    322:     pattern   action
                    323:
                    324: .fi
                    325: where the pattern must be unindented and the action must begin
                    326: on the same line.
                    327: .PP
                    328: See below for a further description of patterns and actions.
                    329: .PP
                    330: Finally, the user code section is simply copied to
                    331: .B lex.yy.c
                    332: verbatim.
                    333: It is used for companion routines which call or are called
                    334: by the scanner.  The presence of this section is optional;
                    335: if it is missing, the second
                    336: .B %%
                    337: in the input file may be skipped, too.
                    338: .PP
                    339: In the definitions and rules sections, any
                    340: .I indented
                    341: text or text enclosed in
                    342: .B %{
                    343: and
                    344: .B %}
                    345: is copied verbatim to the output (with the %{}'s removed).
                    346: The %{}'s must appear unindented on lines by themselves.
                    347: .PP
                    348: In the rules section,
                    349: any indented or %{} text appearing before the
                    350: first rule may be used to declare variables
                    351: which are local to the scanning routine and (after the declarations)
                    352: code which is to be executed whenever the scanning routine is entered.
                    353: Other indented or %{} text in the rule section is still copied to the output,
                    354: but its meaning is not well-defined and it may well cause compile-time
                    355: errors (this feature is present for
                    356: .I POSIX
                    357: compliance; see below for other such features).
                    358: .PP
                    359: In the definitions section (but not in the rules section),
                    360: an unindented comment (i.e., a line
                    361: beginning with "/*") is also copied verbatim to the output up
                    362: to the next "*/".
                    363: .SH PATTERNS
                    364: The patterns in the input are written using an extended set of regular
                    365: expressions.  These are:
                    366: .nf
                    367:
                    368:     x          match the character 'x'
                    369:     .          any character (byte) except newline
                    370:     [xyz]      a "character class"; in this case, the pattern
                    371:                  matches either an 'x', a 'y', or a 'z'
                    372:     [abj-oZ]   a "character class" with a range in it; matches
                    373:                  an 'a', a 'b', any letter from 'j' through 'o',
                    374:                  or a 'Z'
                    375:     [^A-Z]     a "negated character class", i.e., any character
                    376:                  but those in the class.  In this case, any
                    377:                  character EXCEPT an uppercase letter.
                    378:     [^A-Z\\n]   any character EXCEPT an uppercase letter or
                    379:                  a newline
                    380:     r*         zero or more r's, where r is any regular expression
                    381:     r+         one or more r's
                    382:     r?         zero or one r's (that is, "an optional r")
                    383:     r{2,5}     anywhere from two to five r's
                    384:     r{2,}      two or more r's
                    385:     r{4}       exactly 4 r's
                    386:     {name}     the expansion of the "name" definition
                    387:                (see above)
                    388:     "[xyz]\\"foo"
                    389:                the literal string: [xyz]"foo
                    390:     \\X         if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
                    391:                  then the ANSI-C interpretation of \\x.
                    392:                  Otherwise, a literal 'X' (used to escape
                    393:                  operators such as '*')
                    394:     \\0         a NUL character (ASCII code 0)
                    395:     \\123       the character with octal value 123
                    396:     \\x2a       the character with hexadecimal value 2a
                    397:     (r)        match an r; parentheses are used to override
                    398:                  precedence (see below)
                    399:
                    400:
                    401:     rs         the regular expression r followed by the
                    402:                  regular expression s; called "concatenation"
                    403:
                    404:
                    405:     r|s        either an r or an s
                    406:
                    407:
                    408:     r/s        an r but only if it is followed by an s.  The
                    409:                  text matched by s is included when determining
                    410:                  whether this rule is the "longest match",
                    411:                  but is then returned to the input before
                    412:                  the action is executed.  So the action only
                    413:                  sees the text matched by r.  This type
                    414:                  of pattern is called trailing context".
                    415:                  (There are some combinations of r/s that flex
                    416:                  cannot match correctly; see notes in the
                    417:                  Deficiencies / Bugs section below regarding
                    418:                  "dangerous trailing context".)
                    419:     ^r         an r, but only at the beginning of a line (i.e.,
1.10      deraadt   420:                  just starting to scan, or right after a
1.1       deraadt   421:                  newline has been scanned).
                    422:     r$         an r, but only at the end of a line (i.e., just
                    423:                  before a newline).  Equivalent to "r/\\n".
                    424:
                    425:                Note that flex's notion of "newline" is exactly
                    426:                whatever the C compiler used to compile flex
                    427:                interprets '\\n' as; in particular, on some DOS
                    428:                systems you must either filter out \\r's in the
                    429:                input yourself, or explicitly use r/\\r\\n for "r$".
                    430:
                    431:
                    432:     <s>r       an r, but only in start condition s (see
                    433:                  below for discussion of start conditions)
                    434:     <s1,s2,s3>r
                    435:                same, but in any of start conditions s1,
                    436:                  s2, or s3
                    437:     <*>r       an r in any start condition, even an exclusive one.
                    438:
                    439:
                    440:     <<EOF>>    an end-of-file
                    441:     <s1,s2><<EOF>>
                    442:                an end-of-file when in start condition s1 or s2
                    443:
                    444: .fi
                    445: Note that inside of a character class, all regular expression operators
                    446: lose their special meaning except escape ('\\') and the character class
                    447: operators, '-', ']', and, at the beginning of the class, '^'.
                    448: .PP
                    449: The regular expressions listed above are grouped according to
                    450: precedence, from highest precedence at the top to lowest at the bottom.
                    451: Those grouped together have equal precedence.  For example,
                    452: .nf
                    453:
                    454:     foo|bar*
                    455:
                    456: .fi
                    457: is the same as
                    458: .nf
                    459:
                    460:     (foo)|(ba(r*))
                    461:
                    462: .fi
                    463: since the '*' operator has higher precedence than concatenation,
                    464: and concatenation higher than alternation ('|').  This pattern
                    465: therefore matches
                    466: .I either
                    467: the string "foo"
                    468: .I or
                    469: the string "ba" followed by zero-or-more r's.
                    470: To match "foo" or zero-or-more "bar"'s, use:
                    471: .nf
                    472:
                    473:     foo|(bar)*
                    474:
                    475: .fi
                    476: and to match zero-or-more "foo"'s-or-"bar"'s:
                    477: .nf
                    478:
                    479:     (foo|bar)*
                    480:
                    481: .fi
                    482: .PP
                    483: In addition to characters and ranges of characters, character classes
                    484: can also contain character class
                    485: .I expressions.
                    486: These are expressions enclosed inside
                    487: .B [:
                    488: and
                    489: .B :]
                    490: delimiters (which themselves must appear between the '[' and ']' of the
                    491: character class; other elements may occur inside the character class, too).
                    492: The valid expressions are:
                    493: .nf
                    494:
                    495:     [:alnum:] [:alpha:] [:blank:]
                    496:     [:cntrl:] [:digit:] [:graph:]
                    497:     [:lower:] [:print:] [:punct:]
                    498:     [:space:] [:upper:] [:xdigit:]
                    499:
                    500: .fi
                    501: These expressions all designate a set of characters equivalent to
                    502: the corresponding standard C
                    503: .B isXXX
                    504: function.  For example,
                    505: .B [:alnum:]
                    506: designates those characters for which
                    507: .B isalnum()
                    508: returns true - i.e., any alphabetic or numeric.
                    509: Some systems don't provide
                    510: .B isblank(),
                    511: so flex defines
                    512: .B [:blank:]
                    513: as a blank or a tab.
                    514: .PP
                    515: For example, the following character classes are all equivalent:
                    516: .nf
                    517:
                    518:     [[:alnum:]]
1.4       deraadt   519:     [[:alpha:][:digit:]]
1.1       deraadt   520:     [[:alpha:]0-9]
                    521:     [a-zA-Z0-9]
                    522:
                    523: .fi
                    524: If your scanner is case-insensitive (the
                    525: .B \-i
                    526: flag), then
                    527: .B [:upper:]
                    528: and
                    529: .B [:lower:]
                    530: are equivalent to
                    531: .B [:alpha:].
                    532: .PP
                    533: Some notes on patterns:
                    534: .IP -
                    535: A negated character class such as the example "[^A-Z]"
                    536: above
                    537: .I will match a newline
                    538: unless "\\n" (or an equivalent escape sequence) is one of the
                    539: characters explicitly present in the negated character class
                    540: (e.g., "[^A-Z\\n]").  This is unlike how many other regular
                    541: expression tools treat negated character classes, but unfortunately
                    542: the inconsistency is historically entrenched.
                    543: Matching newlines means that a pattern like [^"]* can match the entire
                    544: input unless there's another quote in the input.
                    545: .IP -
                    546: A rule can have at most one instance of trailing context (the '/' operator
                    547: or the '$' operator).  The start condition, '^', and "<<EOF>>" patterns
                    548: can only occur at the beginning of a pattern, and, as well as with '/' and '$',
                    549: cannot be grouped inside parentheses.  A '^' which does not occur at
                    550: the beginning of a rule or a '$' which does not occur at the end of
                    551: a rule loses its special properties and is treated as a normal character.
                    552: .IP
                    553: The following are illegal:
                    554: .nf
                    555:
                    556:     foo/bar$
                    557:     <sc1>foo<sc2>bar
                    558:
                    559: .fi
                    560: Note that the first of these, can be written "foo/bar\\n".
                    561: .IP
                    562: The following will result in '$' or '^' being treated as a normal character:
                    563: .nf
                    564:
                    565:     foo|(bar$)
                    566:     foo|^bar
                    567:
                    568: .fi
                    569: If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
                    570: could be used (the special '|' action is explained below):
                    571: .nf
                    572:
                    573:     foo      |
                    574:     bar$     /* action goes here */
                    575:
                    576: .fi
                    577: A similar trick will work for matching a foo or a
                    578: bar-at-the-beginning-of-a-line.
                    579: .SH HOW THE INPUT IS MATCHED
                    580: When the generated scanner is run, it analyzes its input looking
                    581: for strings which match any of its patterns.  If it finds more than
                    582: one match, it takes the one matching the most text (for trailing
                    583: context rules, this includes the length of the trailing part, even
                    584: though it will then be returned to the input).  If it finds two
                    585: or more matches of the same length, the
                    586: rule listed first in the
                    587: .I flex
                    588: input file is chosen.
                    589: .PP
                    590: Once the match is determined, the text corresponding to the match
                    591: (called the
                    592: .I token)
                    593: is made available in the global character pointer
                    594: .B yytext,
                    595: and its length in the global integer
                    596: .B yyleng.
                    597: The
                    598: .I action
                    599: corresponding to the matched pattern is then executed (a more
                    600: detailed description of actions follows), and then the remaining
                    601: input is scanned for another match.
                    602: .PP
                    603: If no match is found, then the
                    604: .I default rule
                    605: is executed: the next character in the input is considered matched and
                    606: copied to the standard output.  Thus, the simplest legal
                    607: .I flex
                    608: input is:
                    609: .nf
                    610:
                    611:     %%
                    612:
                    613: .fi
                    614: which generates a scanner that simply copies its input (one character
                    615: at a time) to its output.
                    616: .PP
                    617: Note that
                    618: .B yytext
                    619: can be defined in two different ways: either as a character
                    620: .I pointer
                    621: or as a character
                    622: .I array.
                    623: You can control which definition
                    624: .I flex
                    625: uses by including one of the special directives
                    626: .B %pointer
                    627: or
                    628: .B %array
                    629: in the first (definitions) section of your flex input.  The default is
                    630: .B %pointer,
                    631: unless you use the
                    632: .B -l
                    633: lex compatibility option, in which case
                    634: .B yytext
                    635: will be an array.
                    636: The advantage of using
                    637: .B %pointer
                    638: is substantially faster scanning and no buffer overflow when matching
                    639: very large tokens (unless you run out of dynamic memory).  The disadvantage
                    640: is that you are restricted in how your actions can modify
                    641: .B yytext
                    642: (see the next section), and calls to the
                    643: .B unput()
1.10      deraadt   644: function destroy the present contents of
1.1       deraadt   645: .B yytext,
                    646: which can be a considerable porting headache when moving between different
                    647: .I lex
                    648: versions.
                    649: .PP
                    650: The advantage of
                    651: .B %array
                    652: is that you can then modify
                    653: .B yytext
                    654: to your heart's content, and calls to
                    655: .B unput()
                    656: do not destroy
                    657: .B yytext
                    658: (see below).  Furthermore, existing
                    659: .I lex
                    660: programs sometimes access
                    661: .B yytext
                    662: externally using declarations of the form:
                    663: .nf
                    664:     extern char yytext[];
                    665: .fi
                    666: This definition is erroneous when used with
                    667: .B %pointer,
                    668: but correct for
                    669: .B %array.
                    670: .PP
                    671: .B %array
                    672: defines
                    673: .B yytext
                    674: to be an array of
                    675: .B YYLMAX
                    676: characters, which defaults to a fairly large value.  You can change
                    677: the size by simply #define'ing
                    678: .B YYLMAX
                    679: to a different value in the first section of your
                    680: .I flex
                    681: input.  As mentioned above, with
                    682: .B %pointer
                    683: yytext grows dynamically to accommodate large tokens.  While this means your
                    684: .B %pointer
                    685: scanner can accommodate very large tokens (such as matching entire blocks
                    686: of comments), bear in mind that each time the scanner must resize
                    687: .B yytext
                    688: it also must rescan the entire token from the beginning, so matching such
                    689: tokens can prove slow.
                    690: .B yytext
                    691: presently does
                    692: .I not
                    693: dynamically grow if a call to
                    694: .B unput()
                    695: results in too much text being pushed back; instead, a run-time error results.
                    696: .PP
                    697: Also note that you cannot use
                    698: .B %array
                    699: with C++ scanner classes
                    700: (the
                    701: .B c++
                    702: option; see below).
                    703: .SH ACTIONS
                    704: Each pattern in a rule has a corresponding action, which can be any
                    705: arbitrary C statement.  The pattern ends at the first non-escaped
                    706: whitespace character; the remainder of the line is its action.  If the
                    707: action is empty, then when the pattern is matched the input token
                    708: is simply discarded.  For example, here is the specification for a program
                    709: which deletes all occurrences of "zap me" from its input:
                    710: .nf
                    711:
                    712:     %%
                    713:     "zap me"
                    714:
                    715: .fi
                    716: (It will copy all other characters in the input to the output since
                    717: they will be matched by the default rule.)
                    718: .PP
                    719: Here is a program which compresses multiple blanks and tabs down to
                    720: a single blank, and throws away whitespace found at the end of a line:
                    721: .nf
                    722:
                    723:     %%
                    724:     [ \\t]+        putchar( ' ' );
                    725:     [ \\t]+$       /* ignore this token */
                    726:
                    727: .fi
                    728: .PP
                    729: If the action contains a '{', then the action spans till the balancing '}'
                    730: is found, and the action may cross multiple lines.
1.7       aaron     731: .I flex
1.1       deraadt   732: knows about C strings and comments and won't be fooled by braces found
                    733: within them, but also allows actions to begin with
                    734: .B %{
                    735: and will consider the action to be all the text up to the next
                    736: .B %}
                    737: (regardless of ordinary braces inside the action).
                    738: .PP
                    739: An action consisting solely of a vertical bar ('|') means "same as
                    740: the action for the next rule."  See below for an illustration.
                    741: .PP
                    742: Actions can include arbitrary C code, including
                    743: .B return
                    744: statements to return a value to whatever routine called
                    745: .B yylex().
                    746: Each time
                    747: .B yylex()
                    748: is called it continues processing tokens from where it last left
                    749: off until it either reaches
                    750: the end of the file or executes a return.
                    751: .PP
                    752: Actions are free to modify
                    753: .B yytext
                    754: except for lengthening it (adding
                    755: characters to its end--these will overwrite later characters in the
                    756: input stream).  This however does not apply when using
                    757: .B %array
                    758: (see above); in that case,
                    759: .B yytext
                    760: may be freely modified in any way.
                    761: .PP
                    762: Actions are free to modify
                    763: .B yyleng
                    764: except they should not do so if the action also includes use of
                    765: .B yymore()
                    766: (see below).
                    767: .PP
                    768: There are a number of special directives which can be included within
                    769: an action:
                    770: .IP -
                    771: .B ECHO
                    772: copies yytext to the scanner's output.
                    773: .IP -
                    774: .B BEGIN
                    775: followed by the name of a start condition places the scanner in the
                    776: corresponding start condition (see below).
                    777: .IP -
                    778: .B REJECT
                    779: directs the scanner to proceed on to the "second best" rule which matched the
                    780: input (or a prefix of the input).  The rule is chosen as described
                    781: above in "How the Input is Matched", and
                    782: .B yytext
                    783: and
                    784: .B yyleng
                    785: set up appropriately.
                    786: It may either be one which matched as much text
                    787: as the originally chosen rule but came later in the
                    788: .I flex
                    789: input file, or one which matched less text.
                    790: For example, the following will both count the
                    791: words in the input and call the routine special() whenever "frob" is seen:
                    792: .nf
                    793:
                    794:             int word_count = 0;
                    795:     %%
                    796:
                    797:     frob        special(); REJECT;
                    798:     [^ \\t\\n]+   ++word_count;
                    799:
                    800: .fi
                    801: Without the
                    802: .B REJECT,
                    803: any "frob"'s in the input would not be counted as words, since the
                    804: scanner normally executes only one action per token.
                    805: Multiple
                    806: .B REJECT's
                    807: are allowed, each one finding the next best choice to the currently
                    808: active rule.  For example, when the following scanner scans the token
                    809: "abcd", it will write "abcdabcaba" to the output:
                    810: .nf
                    811:
                    812:     %%
                    813:     a        |
                    814:     ab       |
                    815:     abc      |
                    816:     abcd     ECHO; REJECT;
                    817:     .|\\n     /* eat up any unmatched character */
                    818:
                    819: .fi
                    820: (The first three rules share the fourth's action since they use
                    821: the special '|' action.)
                    822: .B REJECT
                    823: is a particularly expensive feature in terms of scanner performance;
                    824: if it is used in
                    825: .I any
                    826: of the scanner's actions it will slow down
                    827: .I all
                    828: of the scanner's matching.  Furthermore,
                    829: .B REJECT
                    830: cannot be used with the
                    831: .I -Cf
                    832: or
                    833: .I -CF
                    834: options (see below).
                    835: .IP
                    836: Note also that unlike the other special actions,
                    837: .B REJECT
                    838: is a
                    839: .I branch;
                    840: code immediately following it in the action will
                    841: .I not
                    842: be executed.
                    843: .IP -
                    844: .B yymore()
                    845: tells the scanner that the next time it matches a rule, the corresponding
                    846: token should be
                    847: .I appended
                    848: onto the current value of
                    849: .B yytext
                    850: rather than replacing it.  For example, given the input "mega-kludge"
                    851: the following will write "mega-mega-kludge" to the output:
                    852: .nf
                    853:
                    854:     %%
                    855:     mega-    ECHO; yymore();
                    856:     kludge   ECHO;
                    857:
                    858: .fi
                    859: First "mega-" is matched and echoed to the output.  Then "kludge"
                    860: is matched, but the previous "mega-" is still hanging around at the
                    861: beginning of
                    862: .B yytext
                    863: so the
                    864: .B ECHO
                    865: for the "kludge" rule will actually write "mega-kludge".
                    866: .PP
                    867: Two notes regarding use of
                    868: .B yymore().
                    869: First,
                    870: .B yymore()
                    871: depends on the value of
                    872: .I yyleng
                    873: correctly reflecting the size of the current token, so you must not
                    874: modify
                    875: .I yyleng
                    876: if you are using
                    877: .B yymore().
                    878: Second, the presence of
                    879: .B yymore()
                    880: in the scanner's action entails a minor performance penalty in the
                    881: scanner's matching speed.
                    882: .IP -
                    883: .B yyless(n)
                    884: returns all but the first
                    885: .I n
                    886: characters of the current token back to the input stream, where they
                    887: will be rescanned when the scanner looks for the next match.
                    888: .B yytext
                    889: and
                    890: .B yyleng
                    891: are adjusted appropriately (e.g.,
                    892: .B yyleng
                    893: will now be equal to
                    894: .I n
                    895: ).  For example, on the input "foobar" the following will write out
                    896: "foobarbar":
                    897: .nf
                    898:
                    899:     %%
                    900:     foobar    ECHO; yyless(3);
                    901:     [a-z]+    ECHO;
                    902:
                    903: .fi
                    904: An argument of 0 to
                    905: .B yyless
                    906: will cause the entire current input string to be scanned again.  Unless you've
                    907: changed how the scanner will subsequently process its input (using
                    908: .B BEGIN,
                    909: for example), this will result in an endless loop.
                    910: .PP
                    911: Note that
                    912: .B yyless
                    913: is a macro and can only be used in the flex input file, not from
                    914: other source files.
                    915: .IP -
                    916: .B unput(c)
                    917: puts the character
                    918: .I c
                    919: back onto the input stream.  It will be the next character scanned.
                    920: The following action will take the current token and cause it
                    921: to be rescanned enclosed in parentheses.
                    922: .nf
                    923:
                    924:     {
                    925:     int i;
                    926:     /* Copy yytext because unput() trashes yytext */
                    927:     char *yycopy = strdup( yytext );
                    928:     unput( ')' );
                    929:     for ( i = yyleng - 1; i >= 0; --i )
                    930:         unput( yycopy[i] );
                    931:     unput( '(' );
                    932:     free( yycopy );
                    933:     }
                    934:
                    935: .fi
                    936: Note that since each
                    937: .B unput()
                    938: puts the given character back at the
                    939: .I beginning
                    940: of the input stream, pushing back strings must be done back-to-front.
                    941: .PP
                    942: An important potential problem when using
                    943: .B unput()
                    944: is that if you are using
                    945: .B %pointer
                    946: (the default), a call to
                    947: .B unput()
                    948: .I destroys
                    949: the contents of
                    950: .I yytext,
                    951: starting with its rightmost character and devouring one character to
                    952: the left with each call.  If you need the value of yytext preserved
                    953: after a call to
                    954: .B unput()
                    955: (as in the above example),
                    956: you must either first copy it elsewhere, or build your scanner using
                    957: .B %array
                    958: instead (see How The Input Is Matched).
                    959: .PP
                    960: Finally, note that you cannot put back
                    961: .B EOF
                    962: to attempt to mark the input stream with an end-of-file.
                    963: .IP -
                    964: .B input()
                    965: reads the next character from the input stream.  For example,
                    966: the following is one way to eat up C comments:
                    967: .nf
                    968:
                    969:     %%
                    970:     "/*"        {
                    971:                 register int c;
                    972:
                    973:                 for ( ; ; )
                    974:                     {
                    975:                     while ( (c = input()) != '*' &&
                    976:                             c != EOF )
                    977:                         ;    /* eat up text of comment */
                    978:
                    979:                     if ( c == '*' )
                    980:                         {
                    981:                         while ( (c = input()) == '*' )
                    982:                             ;
                    983:                         if ( c == '/' )
                    984:                             break;    /* found the end */
                    985:                         }
                    986:
                    987:                     if ( c == EOF )
                    988:                         {
                    989:                         error( "EOF in comment" );
                    990:                         break;
                    991:                         }
                    992:                     }
                    993:                 }
                    994:
                    995: .fi
                    996: (Note that if the scanner is compiled using
                    997: .B C++,
                    998: then
                    999: .B input()
                   1000: is instead referred to as
                   1001: .B yyinput(),
                   1002: in order to avoid a name clash with the
                   1003: .B C++
                   1004: stream by the name of
                   1005: .I input.)
                   1006: .IP -
                   1007: .B YY_FLUSH_BUFFER
                   1008: flushes the scanner's internal buffer
                   1009: so that the next time the scanner attempts to match a token, it will
                   1010: first refill the buffer using
                   1011: .B YY_INPUT
                   1012: (see The Generated Scanner, below).  This action is a special case
                   1013: of the more general
                   1014: .B yy_flush_buffer()
                   1015: function, described below in the section Multiple Input Buffers.
                   1016: .IP -
                   1017: .B yyterminate()
                   1018: can be used in lieu of a return statement in an action.  It terminates
                   1019: the scanner and returns a 0 to the scanner's caller, indicating "all done".
                   1020: By default,
                   1021: .B yyterminate()
                   1022: is also called when an end-of-file is encountered.  It is a macro and
                   1023: may be redefined.
                   1024: .SH THE GENERATED SCANNER
                   1025: The output of
                   1026: .I flex
                   1027: is the file
                   1028: .B lex.yy.c,
                   1029: which contains the scanning routine
                   1030: .B yylex(),
                   1031: a number of tables used by it for matching tokens, and a number
                   1032: of auxiliary routines and macros.  By default,
                   1033: .B yylex()
                   1034: is declared as follows:
                   1035: .nf
                   1036:
                   1037:     int yylex()
                   1038:         {
                   1039:         ... various definitions and the actions in here ...
                   1040:         }
                   1041:
                   1042: .fi
                   1043: (If your environment supports function prototypes, then it will
                   1044: be "int yylex( void )".)  This definition may be changed by defining
                   1045: the "YY_DECL" macro.  For example, you could use:
                   1046: .nf
                   1047:
                   1048:     #define YY_DECL float lexscan( a, b ) float a, b;
                   1049:
                   1050: .fi
                   1051: to give the scanning routine the name
                   1052: .I lexscan,
                   1053: returning a float, and taking two floats as arguments.  Note that
                   1054: if you give arguments to the scanning routine using a
                   1055: K&R-style/non-prototyped function declaration, you must terminate
                   1056: the definition with a semi-colon (;).
                   1057: .PP
                   1058: Whenever
                   1059: .B yylex()
                   1060: is called, it scans tokens from the global input file
                   1061: .I yyin
                   1062: (which defaults to stdin).  It continues until it either reaches
                   1063: an end-of-file (at which point it returns the value 0) or
                   1064: one of its actions executes a
                   1065: .I return
                   1066: statement.
                   1067: .PP
                   1068: If the scanner reaches an end-of-file, subsequent calls are undefined
                   1069: unless either
                   1070: .I yyin
                   1071: is pointed at a new input file (in which case scanning continues from
                   1072: that file), or
                   1073: .B yyrestart()
                   1074: is called.
                   1075: .B yyrestart()
                   1076: takes one argument, a
                   1077: .B FILE *
                   1078: pointer (which can be nil, if you've set up
                   1079: .B YY_INPUT
                   1080: to scan from a source other than
                   1081: .I yyin),
                   1082: and initializes
                   1083: .I yyin
                   1084: for scanning from that file.  Essentially there is no difference between
                   1085: just assigning
                   1086: .I yyin
                   1087: to a new input file or using
                   1088: .B yyrestart()
                   1089: to do so; the latter is available for compatibility with previous versions
                   1090: of
                   1091: .I flex,
                   1092: and because it can be used to switch input files in the middle of scanning.
                   1093: It can also be used to throw away the current input buffer, by calling
                   1094: it with an argument of
                   1095: .I yyin;
                   1096: but better is to use
                   1097: .B YY_FLUSH_BUFFER
                   1098: (see above).
                   1099: Note that
                   1100: .B yyrestart()
                   1101: does
                   1102: .I not
                   1103: reset the start condition to
                   1104: .B INITIAL
                   1105: (see Start Conditions, below).
                   1106: .PP
                   1107: If
                   1108: .B yylex()
                   1109: stops scanning due to executing a
                   1110: .I return
                   1111: statement in one of the actions, the scanner may then be called again and it
                   1112: will resume scanning where it left off.
                   1113: .PP
                   1114: By default (and for purposes of efficiency), the scanner uses
                   1115: block-reads rather than simple
                   1116: .I getc()
                   1117: calls to read characters from
                   1118: .I yyin.
                   1119: The nature of how it gets its input can be controlled by defining the
                   1120: .B YY_INPUT
                   1121: macro.
                   1122: YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)".  Its
                   1123: action is to place up to
                   1124: .I max_size
                   1125: characters in the character array
                   1126: .I buf
                   1127: and return in the integer variable
                   1128: .I result
                   1129: either the
                   1130: number of characters read or the constant YY_NULL (0 on Unix systems)
                   1131: to indicate EOF.  The default YY_INPUT reads from the
                   1132: global file-pointer "yyin".
                   1133: .PP
                   1134: A sample definition of YY_INPUT (in the definitions
                   1135: section of the input file):
                   1136: .nf
                   1137:
                   1138:     %{
                   1139:     #define YY_INPUT(buf,result,max_size) \\
                   1140:         { \\
                   1141:         int c = getchar(); \\
                   1142:         result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
                   1143:         }
                   1144:     %}
                   1145:
                   1146: .fi
                   1147: This definition will change the input processing to occur
                   1148: one character at a time.
                   1149: .PP
                   1150: When the scanner receives an end-of-file indication from YY_INPUT,
                   1151: it then checks the
                   1152: .B yywrap()
                   1153: function.  If
                   1154: .B yywrap()
                   1155: returns false (zero), then it is assumed that the
                   1156: function has gone ahead and set up
                   1157: .I yyin
                   1158: to point to another input file, and scanning continues.  If it returns
                   1159: true (non-zero), then the scanner terminates, returning 0 to its
                   1160: caller.  Note that in either case, the start condition remains unchanged;
                   1161: it does
                   1162: .I not
                   1163: revert to
                   1164: .B INITIAL.
                   1165: .PP
                   1166: If you do not supply your own version of
                   1167: .B yywrap(),
                   1168: then you must either use
                   1169: .B %option noyywrap
                   1170: (in which case the scanner behaves as though
                   1171: .B yywrap()
                   1172: returned 1), or you must link with
                   1173: .B \-lfl
                   1174: to obtain the default version of the routine, which always returns 1.
                   1175: .PP
                   1176: Three routines are available for scanning from in-memory buffers rather
                   1177: than files:
                   1178: .B yy_scan_string(), yy_scan_bytes(),
                   1179: and
                   1180: .B yy_scan_buffer().
                   1181: See the discussion of them below in the section Multiple Input Buffers.
                   1182: .PP
                   1183: The scanner writes its
                   1184: .B ECHO
                   1185: output to the
                   1186: .I yyout
                   1187: global (default, stdout), which may be redefined by the user simply
                   1188: by assigning it to some other
                   1189: .B FILE
                   1190: pointer.
                   1191: .SH START CONDITIONS
                   1192: .I flex
                   1193: provides a mechanism for conditionally activating rules.  Any rule
                   1194: whose pattern is prefixed with "<sc>" will only be active when
                   1195: the scanner is in the start condition named "sc".  For example,
                   1196: .nf
                   1197:
                   1198:     <STRING>[^"]*        { /* eat up the string body ... */
                   1199:                 ...
                   1200:                 }
                   1201:
                   1202: .fi
                   1203: will be active only when the scanner is in the "STRING" start
                   1204: condition, and
                   1205: .nf
                   1206:
                   1207:     <INITIAL,STRING,QUOTE>\\.        { /* handle an escape ... */
                   1208:                 ...
                   1209:                 }
                   1210:
                   1211: .fi
                   1212: will be active only when the current start condition is
                   1213: either "INITIAL", "STRING", or "QUOTE".
                   1214: .PP
                   1215: Start conditions
                   1216: are declared in the definitions (first) section of the input
                   1217: using unindented lines beginning with either
                   1218: .B %s
                   1219: or
                   1220: .B %x
                   1221: followed by a list of names.
                   1222: The former declares
                   1223: .I inclusive
                   1224: start conditions, the latter
                   1225: .I exclusive
                   1226: start conditions.  A start condition is activated using the
                   1227: .B BEGIN
                   1228: action.  Until the next
                   1229: .B BEGIN
                   1230: action is executed, rules with the given start
                   1231: condition will be active and
                   1232: rules with other start conditions will be inactive.
                   1233: If the start condition is
                   1234: .I inclusive,
                   1235: then rules with no start conditions at all will also be active.
                   1236: If it is
                   1237: .I exclusive,
                   1238: then
                   1239: .I only
                   1240: rules qualified with the start condition will be active.
                   1241: A set of rules contingent on the same exclusive start condition
                   1242: describe a scanner which is independent of any of the other rules in the
                   1243: .I flex
                   1244: input.  Because of this,
                   1245: exclusive start conditions make it easy to specify "mini-scanners"
                   1246: which scan portions of the input that are syntactically different
                   1247: from the rest (e.g., comments).
                   1248: .PP
                   1249: If the distinction between inclusive and exclusive start conditions
                   1250: is still a little vague, here's a simple example illustrating the
                   1251: connection between the two.  The set of rules:
                   1252: .nf
                   1253:
                   1254:     %s example
                   1255:     %%
                   1256:
                   1257:     <example>foo   do_something();
                   1258:
                   1259:     bar            something_else();
                   1260:
                   1261: .fi
                   1262: is equivalent to
                   1263: .nf
                   1264:
                   1265:     %x example
                   1266:     %%
                   1267:
                   1268:     <example>foo   do_something();
                   1269:
                   1270:     <INITIAL,example>bar    something_else();
                   1271:
                   1272: .fi
                   1273: Without the
                   1274: .B <INITIAL,example>
                   1275: qualifier, the
                   1276: .I bar
                   1277: pattern in the second example wouldn't be active (i.e., couldn't match)
                   1278: when in start condition
                   1279: .B example.
                   1280: If we just used
                   1281: .B <example>
                   1282: to qualify
                   1283: .I bar,
                   1284: though, then it would only be active in
                   1285: .B example
                   1286: and not in
                   1287: .B INITIAL,
                   1288: while in the first example it's active in both, because in the first
                   1289: example the
                   1290: .B example
1.10      deraadt  1291: start condition is an
1.1       deraadt  1292: .I inclusive
                   1293: .B (%s)
                   1294: start condition.
                   1295: .PP
                   1296: Also note that the special start-condition specifier
                   1297: .B <*>
                   1298: matches every start condition.  Thus, the above example could also
                   1299: have been written;
                   1300: .nf
                   1301:
                   1302:     %x example
                   1303:     %%
                   1304:
                   1305:     <example>foo   do_something();
                   1306:
                   1307:     <*>bar    something_else();
                   1308:
                   1309: .fi
                   1310: .PP
                   1311: The default rule (to
                   1312: .B ECHO
                   1313: any unmatched character) remains active in start conditions.  It
                   1314: is equivalent to:
                   1315: .nf
                   1316:
                   1317:     <*>.|\\n     ECHO;
                   1318:
                   1319: .fi
                   1320: .PP
                   1321: .B BEGIN(0)
                   1322: returns to the original state where only the rules with
                   1323: no start conditions are active.  This state can also be
                   1324: referred to as the start-condition "INITIAL", so
                   1325: .B BEGIN(INITIAL)
                   1326: is equivalent to
                   1327: .B BEGIN(0).
                   1328: (The parentheses around the start condition name are not required but
                   1329: are considered good style.)
                   1330: .PP
                   1331: .B BEGIN
                   1332: actions can also be given as indented code at the beginning
                   1333: of the rules section.  For example, the following will cause
                   1334: the scanner to enter the "SPECIAL" start condition whenever
                   1335: .B yylex()
                   1336: is called and the global variable
                   1337: .I enter_special
                   1338: is true:
                   1339: .nf
                   1340:
                   1341:             int enter_special;
                   1342:
                   1343:     %x SPECIAL
                   1344:     %%
                   1345:             if ( enter_special )
                   1346:                 BEGIN(SPECIAL);
                   1347:
                   1348:     <SPECIAL>blahblahblah
                   1349:     ...more rules follow...
                   1350:
                   1351: .fi
                   1352: .PP
                   1353: To illustrate the uses of start conditions,
                   1354: here is a scanner which provides two different interpretations
                   1355: of a string like "123.456".  By default it will treat it as
                   1356: three tokens, the integer "123", a dot ('.'), and the integer "456".
                   1357: But if the string is preceded earlier in the line by the string
                   1358: "expect-floats"
                   1359: it will treat it as a single token, the floating-point number
                   1360: 123.456:
                   1361: .nf
                   1362:
                   1363:     %{
                   1364:     #include <math.h>
                   1365:     %}
                   1366:     %s expect
                   1367:
                   1368:     %%
                   1369:     expect-floats        BEGIN(expect);
                   1370:
                   1371:     <expect>[0-9]+"."[0-9]+      {
                   1372:                 printf( "found a float, = %f\\n",
                   1373:                         atof( yytext ) );
                   1374:                 }
                   1375:     <expect>\\n           {
                   1376:                 /* that's the end of the line, so
                   1377:                  * we need another "expect-number"
                   1378:                  * before we'll recognize any more
                   1379:                  * numbers
                   1380:                  */
                   1381:                 BEGIN(INITIAL);
                   1382:                 }
                   1383:
                   1384:     [0-9]+      {
                   1385:                 printf( "found an integer, = %d\\n",
                   1386:                         atoi( yytext ) );
                   1387:                 }
                   1388:
                   1389:     "."         printf( "found a dot\\n" );
                   1390:
                   1391: .fi
                   1392: Here is a scanner which recognizes (and discards) C comments while
                   1393: maintaining a count of the current input line.
                   1394: .nf
                   1395:
                   1396:     %x comment
                   1397:     %%
                   1398:             int line_num = 1;
                   1399:
                   1400:     "/*"         BEGIN(comment);
                   1401:
                   1402:     <comment>[^*\\n]*        /* eat anything that's not a '*' */
                   1403:     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
                   1404:     <comment>\\n             ++line_num;
                   1405:     <comment>"*"+"/"        BEGIN(INITIAL);
                   1406:
                   1407: .fi
                   1408: This scanner goes to a bit of trouble to match as much
                   1409: text as possible with each rule.  In general, when attempting to write
1.10      deraadt  1410: a high-speed scanner try to match as much as possible in each rule, as
1.1       deraadt  1411: it's a big win.
                   1412: .PP
1.10      deraadt  1413: Note that start-condition names are really integer values and
1.1       deraadt  1414: can be stored as such.  Thus, the above could be extended in the
                   1415: following fashion:
                   1416: .nf
                   1417:
                   1418:     %x comment foo
                   1419:     %%
                   1420:             int line_num = 1;
                   1421:             int comment_caller;
                   1422:
                   1423:     "/*"         {
                   1424:                  comment_caller = INITIAL;
                   1425:                  BEGIN(comment);
                   1426:                  }
                   1427:
                   1428:     ...
                   1429:
                   1430:     <foo>"/*"    {
                   1431:                  comment_caller = foo;
                   1432:                  BEGIN(comment);
                   1433:                  }
                   1434:
                   1435:     <comment>[^*\\n]*        /* eat anything that's not a '*' */
                   1436:     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
                   1437:     <comment>\\n             ++line_num;
                   1438:     <comment>"*"+"/"        BEGIN(comment_caller);
                   1439:
                   1440: .fi
                   1441: Furthermore, you can access the current start condition using
                   1442: the integer-valued
                   1443: .B YY_START
                   1444: macro.  For example, the above assignments to
                   1445: .I comment_caller
                   1446: could instead be written
                   1447: .nf
                   1448:
                   1449:     comment_caller = YY_START;
                   1450:
                   1451: .fi
                   1452: Flex provides
                   1453: .B YYSTATE
                   1454: as an alias for
                   1455: .B YY_START
                   1456: (since that is what's used by AT&T
                   1457: .I lex).
                   1458: .PP
                   1459: Note that start conditions do not have their own name-space; %s's and %x's
                   1460: declare names in the same fashion as #define's.
                   1461: .PP
                   1462: Finally, here's an example of how to match C-style quoted strings using
                   1463: exclusive start conditions, including expanded escape sequences (but
                   1464: not including checking for a string that's too long):
                   1465: .nf
                   1466:
                   1467:     %x str
                   1468:
                   1469:     %%
                   1470:             char string_buf[MAX_STR_CONST];
                   1471:             char *string_buf_ptr;
                   1472:
                   1473:
                   1474:     \\"      string_buf_ptr = string_buf; BEGIN(str);
                   1475:
                   1476:     <str>\\"        { /* saw closing quote - all done */
                   1477:             BEGIN(INITIAL);
                   1478:             *string_buf_ptr = '\\0';
                   1479:             /* return string constant token type and
                   1480:              * value to parser
                   1481:              */
                   1482:             }
                   1483:
                   1484:     <str>\\n        {
                   1485:             /* error - unterminated string constant */
                   1486:             /* generate error message */
                   1487:             }
                   1488:
                   1489:     <str>\\\\[0-7]{1,3} {
                   1490:             /* octal escape sequence */
                   1491:             int result;
                   1492:
                   1493:             (void) sscanf( yytext + 1, "%o", &result );
                   1494:
                   1495:             if ( result > 0xff )
                   1496:                     /* error, constant is out-of-bounds */
                   1497:
                   1498:             *string_buf_ptr++ = result;
                   1499:             }
                   1500:
                   1501:     <str>\\\\[0-9]+ {
                   1502:             /* generate error - bad escape sequence; something
                   1503:              * like '\\48' or '\\0777777'
                   1504:              */
                   1505:             }
                   1506:
                   1507:     <str>\\\\n  *string_buf_ptr++ = '\\n';
                   1508:     <str>\\\\t  *string_buf_ptr++ = '\\t';
                   1509:     <str>\\\\r  *string_buf_ptr++ = '\\r';
                   1510:     <str>\\\\b  *string_buf_ptr++ = '\\b';
                   1511:     <str>\\\\f  *string_buf_ptr++ = '\\f';
                   1512:
                   1513:     <str>\\\\(.|\\n)  *string_buf_ptr++ = yytext[1];
                   1514:
                   1515:     <str>[^\\\\\\n\\"]+        {
                   1516:             char *yptr = yytext;
                   1517:
                   1518:             while ( *yptr )
                   1519:                     *string_buf_ptr++ = *yptr++;
                   1520:             }
                   1521:
                   1522: .fi
                   1523: .PP
                   1524: Often, such as in some of the examples above, you wind up writing a
                   1525: whole bunch of rules all preceded by the same start condition(s).  Flex
                   1526: makes this a little easier and cleaner by introducing a notion of
                   1527: start condition
                   1528: .I scope.
                   1529: A start condition scope is begun with:
                   1530: .nf
                   1531:
                   1532:     <SCs>{
                   1533:
                   1534: .fi
                   1535: where
                   1536: .I SCs
                   1537: is a list of one or more start conditions.  Inside the start condition
                   1538: scope, every rule automatically has the prefix
                   1539: .I <SCs>
                   1540: applied to it, until a
                   1541: .I '}'
                   1542: which matches the initial
                   1543: .I '{'.
                   1544: So, for example,
                   1545: .nf
                   1546:
                   1547:     <ESC>{
                   1548:         "\\\\n"   return '\\n';
                   1549:         "\\\\r"   return '\\r';
                   1550:         "\\\\f"   return '\\f';
                   1551:         "\\\\0"   return '\\0';
                   1552:     }
                   1553:
                   1554: .fi
                   1555: is equivalent to:
                   1556: .nf
                   1557:
                   1558:     <ESC>"\\\\n"  return '\\n';
                   1559:     <ESC>"\\\\r"  return '\\r';
                   1560:     <ESC>"\\\\f"  return '\\f';
                   1561:     <ESC>"\\\\0"  return '\\0';
                   1562:
                   1563: .fi
                   1564: Start condition scopes may be nested.
                   1565: .PP
                   1566: Three routines are available for manipulating stacks of start conditions:
                   1567: .TP
                   1568: .B void yy_push_state(int new_state)
                   1569: pushes the current start condition onto the top of the start condition
                   1570: stack and switches to
                   1571: .I new_state
                   1572: as though you had used
                   1573: .B BEGIN new_state
                   1574: (recall that start condition names are also integers).
                   1575: .TP
                   1576: .B void yy_pop_state()
                   1577: pops the top of the stack and switches to it via
                   1578: .B BEGIN.
                   1579: .TP
                   1580: .B int yy_top_state()
                   1581: returns the top of the stack without altering the stack's contents.
                   1582: .PP
                   1583: The start condition stack grows dynamically and so has no built-in
                   1584: size limitation.  If memory is exhausted, program execution aborts.
                   1585: .PP
                   1586: To use start condition stacks, your scanner must include a
                   1587: .B %option stack
                   1588: directive (see Options below).
                   1589: .SH MULTIPLE INPUT BUFFERS
                   1590: Some scanners (such as those which support "include" files)
                   1591: require reading from several input streams.  As
                   1592: .I flex
                   1593: scanners do a large amount of buffering, one cannot control
                   1594: where the next input will be read from by simply writing a
                   1595: .B YY_INPUT
                   1596: which is sensitive to the scanning context.
                   1597: .B YY_INPUT
                   1598: is only called when the scanner reaches the end of its buffer, which
                   1599: may be a long time after scanning a statement such as an "include"
                   1600: which requires switching the input source.
                   1601: .PP
                   1602: To negotiate these sorts of problems,
                   1603: .I flex
                   1604: provides a mechanism for creating and switching between multiple
                   1605: input buffers.  An input buffer is created by using:
                   1606: .nf
                   1607:
                   1608:     YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
                   1609:
                   1610: .fi
                   1611: which takes a
                   1612: .I FILE
                   1613: pointer and a size and creates a buffer associated with the given
                   1614: file and large enough to hold
                   1615: .I size
                   1616: characters (when in doubt, use
                   1617: .B YY_BUF_SIZE
                   1618: for the size).  It returns a
                   1619: .B YY_BUFFER_STATE
                   1620: handle, which may then be passed to other routines (see below).  The
                   1621: .B YY_BUFFER_STATE
                   1622: type is a pointer to an opaque
                   1623: .B struct yy_buffer_state
                   1624: structure, so you may safely initialize YY_BUFFER_STATE variables to
                   1625: .B ((YY_BUFFER_STATE) 0)
                   1626: if you wish, and also refer to the opaque structure in order to
                   1627: correctly declare input buffers in source files other than that
                   1628: of your scanner.  Note that the
                   1629: .I FILE
                   1630: pointer in the call to
                   1631: .B yy_create_buffer
                   1632: is only used as the value of
                   1633: .I yyin
                   1634: seen by
                   1635: .B YY_INPUT;
                   1636: if you redefine
                   1637: .B YY_INPUT
                   1638: so it no longer uses
                   1639: .I yyin,
                   1640: then you can safely pass a nil
                   1641: .I FILE
                   1642: pointer to
                   1643: .B yy_create_buffer.
                   1644: You select a particular buffer to scan from using:
                   1645: .nf
                   1646:
                   1647:     void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
                   1648:
                   1649: .fi
                   1650: switches the scanner's input buffer so subsequent tokens will
                   1651: come from
                   1652: .I new_buffer.
                   1653: Note that
                   1654: .B yy_switch_to_buffer()
                   1655: may be used by yywrap() to set things up for continued scanning, instead
                   1656: of opening a new file and pointing
                   1657: .I yyin
                   1658: at it.  Note also that switching input sources via either
                   1659: .B yy_switch_to_buffer()
                   1660: or
                   1661: .B yywrap()
                   1662: does
                   1663: .I not
                   1664: change the start condition.
                   1665: .nf
                   1666:
                   1667:     void yy_delete_buffer( YY_BUFFER_STATE buffer )
                   1668:
                   1669: .fi
                   1670: is used to reclaim the storage associated with a buffer.  (
                   1671: .B buffer
                   1672: can be nil, in which case the routine does nothing.)
                   1673: You can also clear the current contents of a buffer using:
                   1674: .nf
                   1675:
                   1676:     void yy_flush_buffer( YY_BUFFER_STATE buffer )
                   1677:
                   1678: .fi
                   1679: This function discards the buffer's contents,
                   1680: so the next time the scanner attempts to match a token from the
                   1681: buffer, it will first fill the buffer anew using
                   1682: .B YY_INPUT.
                   1683: .PP
                   1684: .B yy_new_buffer()
                   1685: is an alias for
                   1686: .B yy_create_buffer(),
                   1687: provided for compatibility with the C++ use of
                   1688: .I new
                   1689: and
                   1690: .I delete
                   1691: for creating and destroying dynamic objects.
                   1692: .PP
                   1693: Finally, the
                   1694: .B YY_CURRENT_BUFFER
                   1695: macro returns a
                   1696: .B YY_BUFFER_STATE
                   1697: handle to the current buffer.
                   1698: .PP
                   1699: Here is an example of using these features for writing a scanner
                   1700: which expands include files (the
                   1701: .B <<EOF>>
                   1702: feature is discussed below):
                   1703: .nf
                   1704:
                   1705:     /* the "incl" state is used for picking up the name
                   1706:      * of an include file
                   1707:      */
                   1708:     %x incl
                   1709:
                   1710:     %{
                   1711:     #define MAX_INCLUDE_DEPTH 10
                   1712:     YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
                   1713:     int include_stack_ptr = 0;
                   1714:     %}
                   1715:
                   1716:     %%
                   1717:     include             BEGIN(incl);
                   1718:
                   1719:     [a-z]+              ECHO;
                   1720:     [^a-z\\n]*\\n?        ECHO;
                   1721:
                   1722:     <incl>[ \\t]*      /* eat the whitespace */
                   1723:     <incl>[^ \\t\\n]+   { /* got the include file name */
                   1724:             if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
                   1725:                 {
                   1726:                 fprintf( stderr, "Includes nested too deeply" );
                   1727:                 exit( 1 );
                   1728:                 }
                   1729:
                   1730:             include_stack[include_stack_ptr++] =
                   1731:                 YY_CURRENT_BUFFER;
                   1732:
                   1733:             yyin = fopen( yytext, "r" );
                   1734:
                   1735:             if ( ! yyin )
                   1736:                 error( ... );
                   1737:
                   1738:             yy_switch_to_buffer(
                   1739:                 yy_create_buffer( yyin, YY_BUF_SIZE ) );
                   1740:
                   1741:             BEGIN(INITIAL);
                   1742:             }
                   1743:
                   1744:     <<EOF>> {
                   1745:             if ( --include_stack_ptr < 0 )
                   1746:                 {
                   1747:                 yyterminate();
                   1748:                 }
                   1749:
                   1750:             else
                   1751:                 {
                   1752:                 yy_delete_buffer( YY_CURRENT_BUFFER );
                   1753:                 yy_switch_to_buffer(
                   1754:                      include_stack[include_stack_ptr] );
                   1755:                 }
                   1756:             }
                   1757:
                   1758: .fi
                   1759: Three routines are available for setting up input buffers for
                   1760: scanning in-memory strings instead of files.  All of them create
                   1761: a new input buffer for scanning the string, and return a corresponding
                   1762: .B YY_BUFFER_STATE
                   1763: handle (which you should delete with
                   1764: .B yy_delete_buffer()
                   1765: when done with it).  They also switch to the new buffer using
                   1766: .B yy_switch_to_buffer(),
                   1767: so the next call to
                   1768: .B yylex()
                   1769: will start scanning the string.
                   1770: .TP
                   1771: .B yy_scan_string(const char *str)
                   1772: scans a NUL-terminated string.
                   1773: .TP
                   1774: .B yy_scan_bytes(const char *bytes, int len)
                   1775: scans
                   1776: .I len
                   1777: bytes (including possibly NUL's)
                   1778: starting at location
                   1779: .I bytes.
                   1780: .PP
                   1781: Note that both of these functions create and scan a
                   1782: .I copy
                   1783: of the string or bytes.  (This may be desirable, since
                   1784: .B yylex()
                   1785: modifies the contents of the buffer it is scanning.)  You can avoid the
                   1786: copy by using:
                   1787: .TP
                   1788: .B yy_scan_buffer(char *base, yy_size_t size)
                   1789: which scans in place the buffer starting at
                   1790: .I base,
                   1791: consisting of
                   1792: .I size
                   1793: bytes, the last two bytes of which
                   1794: .I must
                   1795: be
                   1796: .B YY_END_OF_BUFFER_CHAR
                   1797: (ASCII NUL).
                   1798: These last two bytes are not scanned; thus, scanning
                   1799: consists of
                   1800: .B base[0]
                   1801: through
                   1802: .B base[size-2],
                   1803: inclusive.
                   1804: .IP
                   1805: If you fail to set up
                   1806: .I base
                   1807: in this manner (i.e., forget the final two
                   1808: .B YY_END_OF_BUFFER_CHAR
                   1809: bytes), then
                   1810: .B yy_scan_buffer()
                   1811: returns a nil pointer instead of creating a new input buffer.
                   1812: .IP
                   1813: The type
                   1814: .B yy_size_t
                   1815: is an integral type to which you can cast an integer expression
                   1816: reflecting the size of the buffer.
                   1817: .SH END-OF-FILE RULES
                   1818: The special rule "<<EOF>>" indicates
                   1819: actions which are to be taken when an end-of-file is
                   1820: encountered and yywrap() returns non-zero (i.e., indicates
                   1821: no further files to process).  The action must finish
                   1822: by doing one of four things:
                   1823: .IP -
                   1824: assigning
                   1825: .I yyin
                   1826: to a new input file (in previous versions of flex, after doing the
                   1827: assignment you had to call the special action
                   1828: .B YY_NEW_FILE;
                   1829: this is no longer necessary);
                   1830: .IP -
                   1831: executing a
                   1832: .I return
                   1833: statement;
                   1834: .IP -
                   1835: executing the special
                   1836: .B yyterminate()
                   1837: action;
                   1838: .IP -
                   1839: or, switching to a new buffer using
                   1840: .B yy_switch_to_buffer()
                   1841: as shown in the example above.
                   1842: .PP
                   1843: <<EOF>> rules may not be used with other
                   1844: patterns; they may only be qualified with a list of start
                   1845: conditions.  If an unqualified <<EOF>> rule is given, it
                   1846: applies to
                   1847: .I all
                   1848: start conditions which do not already have <<EOF>> actions.  To
                   1849: specify an <<EOF>> rule for only the initial start condition, use
                   1850: .nf
                   1851:
                   1852:     <INITIAL><<EOF>>
                   1853:
                   1854: .fi
                   1855: .PP
                   1856: These rules are useful for catching things like unclosed comments.
                   1857: An example:
                   1858: .nf
                   1859:
                   1860:     %x quote
                   1861:     %%
                   1862:
                   1863:     ...other rules for dealing with quotes...
                   1864:
                   1865:     <quote><<EOF>>   {
                   1866:              error( "unterminated quote" );
                   1867:              yyterminate();
                   1868:              }
                   1869:     <<EOF>>  {
                   1870:              if ( *++filelist )
                   1871:                  yyin = fopen( *filelist, "r" );
                   1872:              else
                   1873:                 yyterminate();
                   1874:              }
                   1875:
                   1876: .fi
                   1877: .SH MISCELLANEOUS MACROS
                   1878: The macro
                   1879: .B YY_USER_ACTION
                   1880: can be defined to provide an action
                   1881: which is always executed prior to the matched rule's action.  For example,
                   1882: it could be #define'd to call a routine to convert yytext to lower-case.
                   1883: When
                   1884: .B YY_USER_ACTION
                   1885: is invoked, the variable
                   1886: .I yy_act
                   1887: gives the number of the matched rule (rules are numbered starting with 1).
                   1888: Suppose you want to profile how often each of your rules is matched.  The
                   1889: following would do the trick:
                   1890: .nf
                   1891:
                   1892:     #define YY_USER_ACTION ++ctr[yy_act]
                   1893:
                   1894: .fi
                   1895: where
                   1896: .I ctr
                   1897: is an array to hold the counts for the different rules.  Note that
                   1898: the macro
                   1899: .B YY_NUM_RULES
                   1900: gives the total number of rules (including the default rule, even if
                   1901: you use
                   1902: .B \-s),
                   1903: so a correct declaration for
                   1904: .I ctr
                   1905: is:
                   1906: .nf
                   1907:
                   1908:     int ctr[YY_NUM_RULES];
                   1909:
                   1910: .fi
                   1911: .PP
                   1912: The macro
                   1913: .B YY_USER_INIT
                   1914: may be defined to provide an action which is always executed before
                   1915: the first scan (and before the scanner's internal initializations are done).
                   1916: For example, it could be used to call a routine to read
                   1917: in a data table or open a logging file.
                   1918: .PP
                   1919: The macro
                   1920: .B yy_set_interactive(is_interactive)
                   1921: can be used to control whether the current buffer is considered
                   1922: .I interactive.
                   1923: An interactive buffer is processed more slowly,
                   1924: but must be used when the scanner's input source is indeed
                   1925: interactive to avoid problems due to waiting to fill buffers
                   1926: (see the discussion of the
                   1927: .B \-I
                   1928: flag below).  A non-zero value
1.7       aaron    1929: in the macro invocation marks the buffer as interactive, a zero
1.1       deraadt  1930: value as non-interactive.  Note that use of this macro overrides
                   1931: .B %option always-interactive
                   1932: or
                   1933: .B %option never-interactive
                   1934: (see Options below).
                   1935: .B yy_set_interactive()
                   1936: must be invoked prior to beginning to scan the buffer that is
                   1937: (or is not) to be considered interactive.
                   1938: .PP
                   1939: The macro
                   1940: .B yy_set_bol(at_bol)
                   1941: can be used to control whether the current buffer's scanning
                   1942: context for the next token match is done as though at the
                   1943: beginning of a line.  A non-zero macro argument makes rules anchored with
1.10      deraadt  1944: \'^' active, while a zero argument makes '^' rules inactive.
1.1       deraadt  1945: .PP
                   1946: The macro
                   1947: .B YY_AT_BOL()
                   1948: returns true if the next token scanned from the current buffer
                   1949: will have '^' rules active, false otherwise.
                   1950: .PP
                   1951: In the generated scanner, the actions are all gathered in one large
                   1952: switch statement and separated using
                   1953: .B YY_BREAK,
                   1954: which may be redefined.  By default, it is simply a "break", to separate
1.10      deraadt  1955: each rule's action from the following rules.
1.1       deraadt  1956: Redefining
                   1957: .B YY_BREAK
                   1958: allows, for example, C++ users to
                   1959: #define YY_BREAK to do nothing (while being very careful that every
                   1960: rule ends with a "break" or a "return"!) to avoid suffering from
                   1961: unreachable statement warnings where because a rule's action ends with
                   1962: "return", the
                   1963: .B YY_BREAK
                   1964: is inaccessible.
                   1965: .SH VALUES AVAILABLE TO THE USER
                   1966: This section summarizes the various values available to the user
                   1967: in the rule actions.
                   1968: .IP -
                   1969: .B char *yytext
                   1970: holds the text of the current token.  It may be modified but not lengthened
                   1971: (you cannot append characters to the end).
                   1972: .IP
                   1973: If the special directive
                   1974: .B %array
                   1975: appears in the first section of the scanner description, then
                   1976: .B yytext
                   1977: is instead declared
                   1978: .B char yytext[YYLMAX],
                   1979: where
                   1980: .B YYLMAX
                   1981: is a macro definition that you can redefine in the first section
                   1982: if you don't like the default value (generally 8KB).  Using
                   1983: .B %array
                   1984: results in somewhat slower scanners, but the value of
                   1985: .B yytext
                   1986: becomes immune to calls to
                   1987: .I input()
                   1988: and
                   1989: .I unput(),
                   1990: which potentially destroy its value when
                   1991: .B yytext
                   1992: is a character pointer.  The opposite of
                   1993: .B %array
                   1994: is
                   1995: .B %pointer,
                   1996: which is the default.
                   1997: .IP
                   1998: You cannot use
                   1999: .B %array
                   2000: when generating C++ scanner classes
                   2001: (the
                   2002: .B \-+
                   2003: flag).
                   2004: .IP -
                   2005: .B int yyleng
                   2006: holds the length of the current token.
                   2007: .IP -
                   2008: .B FILE *yyin
                   2009: is the file which by default
                   2010: .I flex
                   2011: reads from.  It may be redefined but doing so only makes sense before
                   2012: scanning begins or after an EOF has been encountered.  Changing it in
                   2013: the midst of scanning will have unexpected results since
                   2014: .I flex
                   2015: buffers its input; use
                   2016: .B yyrestart()
                   2017: instead.
                   2018: Once scanning terminates because an end-of-file
                   2019: has been seen, you can assign
                   2020: .I yyin
                   2021: at the new input file and then call the scanner again to continue scanning.
                   2022: .IP -
                   2023: .B void yyrestart( FILE *new_file )
                   2024: may be called to point
                   2025: .I yyin
                   2026: at the new input file.  The switch-over to the new file is immediate
                   2027: (any previously buffered-up input is lost).  Note that calling
                   2028: .B yyrestart()
                   2029: with
                   2030: .I yyin
                   2031: as an argument thus throws away the current input buffer and continues
                   2032: scanning the same input file.
                   2033: .IP -
                   2034: .B FILE *yyout
                   2035: is the file to which
                   2036: .B ECHO
                   2037: actions are done.  It can be reassigned by the user.
                   2038: .IP -
                   2039: .B YY_CURRENT_BUFFER
                   2040: returns a
                   2041: .B YY_BUFFER_STATE
                   2042: handle to the current buffer.
                   2043: .IP -
                   2044: .B YY_START
                   2045: returns an integer value corresponding to the current start
                   2046: condition.  You can subsequently use this value with
                   2047: .B BEGIN
                   2048: to return to that start condition.
                   2049: .SH INTERFACING WITH YACC
                   2050: One of the main uses of
                   2051: .I flex
                   2052: is as a companion to the
                   2053: .I yacc
                   2054: parser-generator.
                   2055: .I yacc
                   2056: parsers expect to call a routine named
                   2057: .B yylex()
                   2058: to find the next input token.  The routine is supposed to
                   2059: return the type of the next token as well as putting any associated
                   2060: value in the global
                   2061: .B yylval.
                   2062: To use
                   2063: .I flex
                   2064: with
                   2065: .I yacc,
                   2066: one specifies the
                   2067: .B \-d
                   2068: option to
                   2069: .I yacc
                   2070: to instruct it to generate the file
                   2071: .B y.tab.h
                   2072: containing definitions of all the
                   2073: .B %tokens
                   2074: appearing in the
                   2075: .I yacc
                   2076: input.  This file is then included in the
                   2077: .I flex
                   2078: scanner.  For example, if one of the tokens is "TOK_NUMBER",
                   2079: part of the scanner might look like:
                   2080: .nf
                   2081:
                   2082:     %{
                   2083:     #include "y.tab.h"
                   2084:     %}
                   2085:
                   2086:     %%
                   2087:
                   2088:     [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
                   2089:
                   2090: .fi
                   2091: .SH OPTIONS
                   2092: .I flex
                   2093: has the following options:
                   2094: .TP
                   2095: .B \-b
                   2096: Generate backing-up information to
                   2097: .I lex.backup.
                   2098: This is a list of scanner states which require backing up
                   2099: and the input characters on which they do so.  By adding rules one
                   2100: can remove backing-up states.  If
                   2101: .I all
                   2102: backing-up states are eliminated and
                   2103: .B \-Cf
                   2104: or
                   2105: .B \-CF
                   2106: is used, the generated scanner will run faster (see the
                   2107: .B \-p
                   2108: flag).  Only users who wish to squeeze every last cycle out of their
                   2109: scanners need worry about this option.  (See the section on Performance
                   2110: Considerations below.)
                   2111: .TP
                   2112: .B \-c
                   2113: is a do-nothing, deprecated option included for POSIX compliance.
                   2114: .TP
                   2115: .B \-d
                   2116: makes the generated scanner run in
                   2117: .I debug
                   2118: mode.  Whenever a pattern is recognized and the global
                   2119: .B yy_flex_debug
                   2120: is non-zero (which is the default),
                   2121: the scanner will write to
                   2122: .I stderr
                   2123: a line of the form:
                   2124: .nf
                   2125:
                   2126:     --accepting rule at line 53 ("the matched text")
                   2127:
                   2128: .fi
                   2129: The line number refers to the location of the rule in the file
                   2130: defining the scanner (i.e., the file that was fed to flex).  Messages
                   2131: are also generated when the scanner backs up, accepts the
                   2132: default rule, reaches the end of its input buffer (or encounters
                   2133: a NUL; at this point, the two look the same as far as the scanner's concerned),
                   2134: or reaches an end-of-file.
                   2135: .TP
                   2136: .B \-f
                   2137: specifies
                   2138: .I fast scanner.
                   2139: No table compression is done and stdio is bypassed.
                   2140: The result is large but fast.  This option is equivalent to
                   2141: .B \-Cfr
                   2142: (see below).
                   2143: .TP
                   2144: .B \-h
                   2145: generates a "help" summary of
                   2146: .I flex's
                   2147: options to
1.7       aaron    2148: .I stdout
1.1       deraadt  2149: and then exits.
                   2150: .B \-?
                   2151: and
                   2152: .B \-\-help
                   2153: are synonyms for
                   2154: .B \-h.
                   2155: .TP
                   2156: .B \-i
                   2157: instructs
                   2158: .I flex
                   2159: to generate a
                   2160: .I case-insensitive
                   2161: scanner.  The case of letters given in the
                   2162: .I flex
                   2163: input patterns will
                   2164: be ignored, and tokens in the input will be matched regardless of case.  The
                   2165: matched text given in
                   2166: .I yytext
                   2167: will have the preserved case (i.e., it will not be folded).
                   2168: .TP
                   2169: .B \-l
                   2170: turns on maximum compatibility with the original AT&T
                   2171: .I lex
                   2172: implementation.  Note that this does not mean
                   2173: .I full
                   2174: compatibility.  Use of this option costs a considerable amount of
                   2175: performance, and it cannot be used with the
                   2176: .B \-+, -f, -F, -Cf,
                   2177: or
                   2178: .B -CF
                   2179: options.  For details on the compatibilities it provides, see the section
                   2180: "Incompatibilities With Lex And POSIX" below.  This option also results
                   2181: in the name
                   2182: .B YY_FLEX_LEX_COMPAT
                   2183: being #define'd in the generated scanner.
                   2184: .TP
                   2185: .B \-n
                   2186: is another do-nothing, deprecated option included only for
                   2187: POSIX compliance.
                   2188: .TP
                   2189: .B \-p
                   2190: generates a performance report to stderr.  The report
                   2191: consists of comments regarding features of the
                   2192: .I flex
                   2193: input file which will cause a serious loss of performance in the resulting
                   2194: scanner.  If you give the flag twice, you will also get comments regarding
                   2195: features that lead to minor performance losses.
                   2196: .IP
                   2197: Note that the use of
                   2198: .B REJECT,
                   2199: .B %option yylineno,
                   2200: and variable trailing context (see the Deficiencies / Bugs section below)
                   2201: entails a substantial performance penalty; use of
                   2202: .I yymore(),
                   2203: the
                   2204: .B ^
                   2205: operator,
                   2206: and the
                   2207: .B \-I
                   2208: flag entail minor performance penalties.
                   2209: .TP
                   2210: .B \-s
                   2211: causes the
                   2212: .I default rule
                   2213: (that unmatched scanner input is echoed to
                   2214: .I stdout)
                   2215: to be suppressed.  If the scanner encounters input that does not
                   2216: match any of its rules, it aborts with an error.  This option is
                   2217: useful for finding holes in a scanner's rule set.
                   2218: .TP
                   2219: .B \-t
                   2220: instructs
                   2221: .I flex
                   2222: to write the scanner it generates to standard output instead
                   2223: of
                   2224: .B lex.yy.c.
                   2225: .TP
                   2226: .B \-v
                   2227: specifies that
                   2228: .I flex
                   2229: should write to
                   2230: .I stderr
                   2231: a summary of statistics regarding the scanner it generates.
                   2232: Most of the statistics are meaningless to the casual
                   2233: .I flex
                   2234: user, but the first line identifies the version of
                   2235: .I flex
                   2236: (same as reported by
                   2237: .B \-V),
                   2238: and the next line the flags used when generating the scanner, including
                   2239: those that are on by default.
                   2240: .TP
                   2241: .B \-w
                   2242: suppresses warning messages.
                   2243: .TP
                   2244: .B \-B
                   2245: instructs
                   2246: .I flex
                   2247: to generate a
                   2248: .I batch
                   2249: scanner, the opposite of
                   2250: .I interactive
                   2251: scanners generated by
                   2252: .B \-I
                   2253: (see below).  In general, you use
                   2254: .B \-B
                   2255: when you are
                   2256: .I certain
                   2257: that your scanner will never be used interactively, and you want to
                   2258: squeeze a
                   2259: .I little
                   2260: more performance out of it.  If your goal is instead to squeeze out a
                   2261: .I lot
                   2262: more performance, you should  be using the
                   2263: .B \-Cf
                   2264: or
                   2265: .B \-CF
                   2266: options (discussed below), which turn on
                   2267: .B \-B
                   2268: automatically anyway.
                   2269: .TP
                   2270: .B \-F
                   2271: specifies that the
                   2272: .ul
                   2273: fast
                   2274: scanner table representation should be used (and stdio
                   2275: bypassed).  This representation is
                   2276: about as fast as the full table representation
                   2277: .B (-f),
                   2278: and for some sets of patterns will be considerably smaller (and for
                   2279: others, larger).  In general, if the pattern set contains both "keywords"
                   2280: and a catch-all, "identifier" rule, such as in the set:
                   2281: .nf
                   2282:
                   2283:     "case"    return TOK_CASE;
                   2284:     "switch"  return TOK_SWITCH;
                   2285:     ...
                   2286:     "default" return TOK_DEFAULT;
                   2287:     [a-z]+    return TOK_ID;
                   2288:
                   2289: .fi
                   2290: then you're better off using the full table representation.  If only
                   2291: the "identifier" rule is present and you then use a hash table or some such
                   2292: to detect the keywords, you're better off using
                   2293: .B -F.
                   2294: .IP
                   2295: This option is equivalent to
                   2296: .B \-CFr
                   2297: (see below).  It cannot be used with
                   2298: .B \-+.
                   2299: .TP
                   2300: .B \-I
                   2301: instructs
                   2302: .I flex
                   2303: to generate an
                   2304: .I interactive
                   2305: scanner.  An interactive scanner is one that only looks ahead to decide
                   2306: what token has been matched if it absolutely must.  It turns out that
                   2307: always looking one extra character ahead, even if the scanner has already
                   2308: seen enough text to disambiguate the current token, is a bit faster than
                   2309: only looking ahead when necessary.  But scanners that always look ahead
                   2310: give dreadful interactive performance; for example, when a user types
                   2311: a newline, it is not recognized as a newline token until they enter
                   2312: .I another
                   2313: token, which often means typing in another whole line.
                   2314: .IP
                   2315: .I Flex
                   2316: scanners default to
                   2317: .I interactive
                   2318: unless you use the
                   2319: .B \-Cf
                   2320: or
                   2321: .B \-CF
                   2322: table-compression options (see below).  That's because if you're looking
                   2323: for high-performance you should be using one of these options, so if you
                   2324: didn't,
                   2325: .I flex
                   2326: assumes you'd rather trade off a bit of run-time performance for intuitive
                   2327: interactive behavior.  Note also that you
                   2328: .I cannot
                   2329: use
                   2330: .B \-I
                   2331: in conjunction with
                   2332: .B \-Cf
                   2333: or
                   2334: .B \-CF.
                   2335: Thus, this option is not really needed; it is on by default for all those
                   2336: cases in which it is allowed.
                   2337: .IP
                   2338: You can force a scanner to
                   2339: .I not
                   2340: be interactive by using
                   2341: .B \-B
                   2342: (see above).
                   2343: .TP
                   2344: .B \-L
                   2345: instructs
                   2346: .I flex
                   2347: not to generate
                   2348: .B #line
                   2349: directives.  Without this option,
                   2350: .I flex
                   2351: peppers the generated scanner
                   2352: with #line directives so error messages in the actions will be correctly
                   2353: located with respect to either the original
                   2354: .I flex
                   2355: input file (if the errors are due to code in the input file), or
                   2356: .B lex.yy.c
                   2357: (if the errors are
                   2358: .I flex's
                   2359: fault -- you should report these sorts of errors to the email address
                   2360: given below).
                   2361: .TP
                   2362: .B \-T
                   2363: makes
                   2364: .I flex
                   2365: run in
                   2366: .I trace
                   2367: mode.  It will generate a lot of messages to
                   2368: .I stderr
                   2369: concerning
                   2370: the form of the input and the resultant non-deterministic and deterministic
                   2371: finite automata.  This option is mostly for use in maintaining
                   2372: .I flex.
                   2373: .TP
                   2374: .B \-V
                   2375: prints the version number to
                   2376: .I stdout
                   2377: and exits.
                   2378: .B \-\-version
                   2379: is a synonym for
                   2380: .B \-V.
                   2381: .TP
                   2382: .B \-7
                   2383: instructs
                   2384: .I flex
                   2385: to generate a 7-bit scanner, i.e., one which can only recognized 7-bit
                   2386: characters in its input.  The advantage of using
                   2387: .B \-7
                   2388: is that the scanner's tables can be up to half the size of those generated
                   2389: using the
                   2390: .B \-8
                   2391: option (see below).  The disadvantage is that such scanners often hang
                   2392: or crash if their input contains an 8-bit character.
                   2393: .IP
                   2394: Note, however, that unless you generate your scanner using the
                   2395: .B \-Cf
                   2396: or
                   2397: .B \-CF
                   2398: table compression options, use of
                   2399: .B \-7
                   2400: will save only a small amount of table space, and make your scanner
                   2401: considerably less portable.
                   2402: .I Flex's
                   2403: default behavior is to generate an 8-bit scanner unless you use the
                   2404: .B \-Cf
                   2405: or
                   2406: .B \-CF,
                   2407: in which case
                   2408: .I flex
                   2409: defaults to generating 7-bit scanners unless your site was always
                   2410: configured to generate 8-bit scanners (as will often be the case
                   2411: with non-USA sites).  You can tell whether flex generated a 7-bit
                   2412: or an 8-bit scanner by inspecting the flag summary in the
                   2413: .B \-v
                   2414: output as described above.
                   2415: .IP
                   2416: Note that if you use
                   2417: .B \-Cfe
                   2418: or
                   2419: .B \-CFe
                   2420: (those table compression options, but also using equivalence classes as
                   2421: discussed see below), flex still defaults to generating an 8-bit
                   2422: scanner, since usually with these compression options full 8-bit tables
                   2423: are not much more expensive than 7-bit tables.
                   2424: .TP
                   2425: .B \-8
                   2426: instructs
                   2427: .I flex
                   2428: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
                   2429: characters.  This flag is only needed for scanners generated using
                   2430: .B \-Cf
                   2431: or
                   2432: .B \-CF,
                   2433: as otherwise flex defaults to generating an 8-bit scanner anyway.
                   2434: .IP
                   2435: See the discussion of
                   2436: .B \-7
                   2437: above for flex's default behavior and the tradeoffs between 7-bit
                   2438: and 8-bit scanners.
                   2439: .TP
                   2440: .B \-+
                   2441: specifies that you want flex to generate a C++
                   2442: scanner class.  See the section on Generating C++ Scanners below for
                   2443: details.
1.7       aaron    2444: .TP
1.1       deraadt  2445: .B \-C[aefFmr]
                   2446: controls the degree of table compression and, more generally, trade-offs
                   2447: between small scanners and fast scanners.
                   2448: .IP
                   2449: .B \-Ca
                   2450: ("align") instructs flex to trade off larger tables in the
                   2451: generated scanner for faster performance because the elements of
                   2452: the tables are better aligned for memory access and computation.  On some
                   2453: RISC architectures, fetching and manipulating longwords is more efficient
                   2454: than with smaller-sized units such as shortwords.  This option can
                   2455: double the size of the tables used by your scanner.
                   2456: .IP
                   2457: .B \-Ce
                   2458: directs
                   2459: .I flex
                   2460: to construct
                   2461: .I equivalence classes,
                   2462: i.e., sets of characters
                   2463: which have identical lexical properties (for example, if the only
                   2464: appearance of digits in the
                   2465: .I flex
                   2466: input is in the character class
                   2467: "[0-9]" then the digits '0', '1', ..., '9' will all be put
                   2468: in the same equivalence class).  Equivalence classes usually give
                   2469: dramatic reductions in the final table/object file sizes (typically
                   2470: a factor of 2-5) and are pretty cheap performance-wise (one array
                   2471: look-up per character scanned).
                   2472: .IP
                   2473: .B \-Cf
                   2474: specifies that the
                   2475: .I full
                   2476: scanner tables should be generated -
                   2477: .I flex
                   2478: should not compress the
1.10      deraadt  2479: tables by taking advantage of similar transition functions for
1.1       deraadt  2480: different states.
                   2481: .IP
                   2482: .B \-CF
                   2483: specifies that the alternate fast scanner representation (described
                   2484: above under the
                   2485: .B \-F
                   2486: flag)
                   2487: should be used.  This option cannot be used with
                   2488: .B \-+.
                   2489: .IP
                   2490: .B \-Cm
                   2491: directs
                   2492: .I flex
                   2493: to construct
                   2494: .I meta-equivalence classes,
                   2495: which are sets of equivalence classes (or characters, if equivalence
                   2496: classes are not being used) that are commonly used together.  Meta-equivalence
                   2497: classes are often a big win when using compressed tables, but they
                   2498: have a moderate performance impact (one or two "if" tests and one
                   2499: array look-up per character scanned).
                   2500: .IP
                   2501: .B \-Cr
                   2502: causes the generated scanner to
                   2503: .I bypass
                   2504: use of the standard I/O library (stdio) for input.  Instead of calling
                   2505: .B fread()
                   2506: or
                   2507: .B getc(),
                   2508: the scanner will use the
                   2509: .B read()
                   2510: system call, resulting in a performance gain which varies from system
                   2511: to system, but in general is probably negligible unless you are also using
                   2512: .B \-Cf
                   2513: or
                   2514: .B \-CF.
                   2515: Using
                   2516: .B \-Cr
                   2517: can cause strange behavior if, for example, you read from
                   2518: .I yyin
                   2519: using stdio prior to calling the scanner (because the scanner will miss
                   2520: whatever text your previous reads left in the stdio input buffer).
                   2521: .IP
                   2522: .B \-Cr
                   2523: has no effect if you define
                   2524: .B YY_INPUT
                   2525: (see The Generated Scanner above).
                   2526: .IP
                   2527: A lone
                   2528: .B \-C
                   2529: specifies that the scanner tables should be compressed but neither
                   2530: equivalence classes nor meta-equivalence classes should be used.
                   2531: .IP
                   2532: The options
                   2533: .B \-Cf
                   2534: or
                   2535: .B \-CF
                   2536: and
                   2537: .B \-Cm
                   2538: do not make sense together - there is no opportunity for meta-equivalence
                   2539: classes if the table is not being compressed.  Otherwise the options
                   2540: may be freely mixed, and are cumulative.
                   2541: .IP
                   2542: The default setting is
                   2543: .B \-Cem,
                   2544: which specifies that
                   2545: .I flex
                   2546: should generate equivalence classes
                   2547: and meta-equivalence classes.  This setting provides the highest
                   2548: degree of table compression.  You can trade off
                   2549: faster-executing scanners at the cost of larger tables with
                   2550: the following generally being true:
                   2551: .nf
                   2552:
                   2553:     slowest & smallest
                   2554:           -Cem
                   2555:           -Cm
                   2556:           -Ce
                   2557:           -C
                   2558:           -C{f,F}e
                   2559:           -C{f,F}
                   2560:           -C{f,F}a
                   2561:     fastest & largest
                   2562:
                   2563: .fi
                   2564: Note that scanners with the smallest tables are usually generated and
                   2565: compiled the quickest, so
                   2566: during development you will usually want to use the default, maximal
                   2567: compression.
                   2568: .IP
                   2569: .B \-Cfe
                   2570: is often a good compromise between speed and size for production
                   2571: scanners.
                   2572: .TP
                   2573: .B \-ooutput
                   2574: directs flex to write the scanner to the file
                   2575: .B output
                   2576: instead of
                   2577: .B lex.yy.c.
                   2578: If you combine
                   2579: .B \-o
                   2580: with the
                   2581: .B \-t
                   2582: option, then the scanner is written to
                   2583: .I stdout
                   2584: but its
                   2585: .B #line
                   2586: directives (see the
                   2587: .B \\-L
                   2588: option above) refer to the file
                   2589: .B output.
                   2590: .TP
                   2591: .B \-Pprefix
                   2592: changes the default
                   2593: .I "yy"
                   2594: prefix used by
                   2595: .I flex
1.6       aaron    2596: for all globally visible variable and function names to instead be
1.1       deraadt  2597: .I prefix.
                   2598: For example,
                   2599: .B \-Pfoo
                   2600: changes the name of
                   2601: .B yytext
                   2602: to
                   2603: .B footext.
                   2604: It also changes the name of the default output file from
                   2605: .B lex.yy.c
                   2606: to
                   2607: .B lex.foo.c.
                   2608: Here are all of the names affected:
                   2609: .nf
                   2610:
                   2611:     yy_create_buffer
                   2612:     yy_delete_buffer
                   2613:     yy_flex_debug
                   2614:     yy_init_buffer
                   2615:     yy_flush_buffer
                   2616:     yy_load_buffer_state
                   2617:     yy_switch_to_buffer
                   2618:     yyin
                   2619:     yyleng
                   2620:     yylex
                   2621:     yylineno
                   2622:     yyout
                   2623:     yyrestart
                   2624:     yytext
                   2625:     yywrap
                   2626:
                   2627: .fi
                   2628: (If you are using a C++ scanner, then only
                   2629: .B yywrap
                   2630: and
                   2631: .B yyFlexLexer
                   2632: are affected.)
                   2633: Within your scanner itself, you can still refer to the global variables
                   2634: and functions using either version of their name; but externally, they
                   2635: have the modified name.
                   2636: .IP
                   2637: This option lets you easily link together multiple
                   2638: .I flex
                   2639: programs into the same executable.  Note, though, that using this
                   2640: option also renames
                   2641: .B yywrap(),
                   2642: so you now
                   2643: .I must
                   2644: either
1.6       aaron    2645: provide your own (appropriately named) version of the routine for your
1.1       deraadt  2646: scanner, or use
                   2647: .B %option noyywrap,
                   2648: as linking with
                   2649: .B \-lfl
                   2650: no longer provides one for you by default.
                   2651: .TP
                   2652: .B \-Sskeleton_file
                   2653: overrides the default skeleton file from which
                   2654: .I flex
                   2655: constructs its scanners.  You'll never need this option unless you are doing
                   2656: .I flex
                   2657: maintenance or development.
                   2658: .PP
                   2659: .I flex
                   2660: also provides a mechanism for controlling options within the
                   2661: scanner specification itself, rather than from the flex command-line.
                   2662: This is done by including
                   2663: .B %option
                   2664: directives in the first section of the scanner specification.
                   2665: You can specify multiple options with a single
                   2666: .B %option
                   2667: directive, and multiple directives in the first section of your flex input
                   2668: file.
                   2669: .PP
                   2670: Most options are given simply as names, optionally preceded by the
                   2671: word "no" (with no intervening whitespace) to negate their meaning.
                   2672: A number are equivalent to flex flags or their negation:
                   2673: .nf
                   2674:
                   2675:     7bit            -7 option
                   2676:     8bit            -8 option
                   2677:     align           -Ca option
                   2678:     backup          -b option
                   2679:     batch           -B option
                   2680:     c++             -+ option
                   2681:
                   2682:     caseful or
                   2683:     case-sensitive  opposite of -i (default)
                   2684:
                   2685:     case-insensitive or
                   2686:     caseless        -i option
                   2687:
                   2688:     debug           -d option
                   2689:     default         opposite of -s option
                   2690:     ecs             -Ce option
                   2691:     fast            -F option
                   2692:     full            -f option
                   2693:     interactive     -I option
                   2694:     lex-compat      -l option
                   2695:     meta-ecs        -Cm option
                   2696:     perf-report     -p option
                   2697:     read            -Cr option
                   2698:     stdout          -t option
                   2699:     verbose         -v option
                   2700:     warn            opposite of -w option
                   2701:                     (use "%option nowarn" for -w)
                   2702:
                   2703:     array           equivalent to "%array"
                   2704:     pointer         equivalent to "%pointer" (default)
                   2705:
                   2706: .fi
                   2707: Some
                   2708: .B %option's
                   2709: provide features otherwise not available:
                   2710: .TP
                   2711: .B always-interactive
                   2712: instructs flex to generate a scanner which always considers its input
                   2713: "interactive".  Normally, on each new input file the scanner calls
                   2714: .B isatty()
                   2715: in an attempt to determine whether
                   2716: the scanner's input source is interactive and thus should be read a
                   2717: character at a time.  When this option is used, however, then no
                   2718: such call is made.
                   2719: .TP
                   2720: .B main
                   2721: directs flex to provide a default
                   2722: .B main()
                   2723: program for the scanner, which simply calls
                   2724: .B yylex().
                   2725: This option implies
                   2726: .B noyywrap
                   2727: (see below).
                   2728: .TP
                   2729: .B never-interactive
                   2730: instructs flex to generate a scanner which never considers its input
                   2731: "interactive" (again, no call made to
                   2732: .B isatty()).
                   2733: This is the opposite of
                   2734: .B always-interactive.
                   2735: .TP
                   2736: .B stack
                   2737: enables the use of start condition stacks (see Start Conditions above).
                   2738: .TP
                   2739: .B stdinit
                   2740: if set (i.e.,
                   2741: .B %option stdinit)
                   2742: initializes
                   2743: .I yyin
                   2744: and
                   2745: .I yyout
                   2746: to
                   2747: .I stdin
                   2748: and
                   2749: .I stdout,
                   2750: instead of the default of
                   2751: .I nil.
                   2752: Some existing
                   2753: .I lex
                   2754: programs depend on this behavior, even though it is not compliant with
                   2755: ANSI C, which does not require
                   2756: .I stdin
                   2757: and
                   2758: .I stdout
                   2759: to be compile-time constant.
                   2760: .TP
                   2761: .B yylineno
                   2762: directs
                   2763: .I flex
                   2764: to generate a scanner that maintains the number of the current line
                   2765: read from its input in the global variable
                   2766: .B yylineno.
                   2767: This option is implied by
                   2768: .B %option lex-compat.
                   2769: .TP
                   2770: .B yywrap
                   2771: if unset (i.e.,
                   2772: .B %option noyywrap),
                   2773: makes the scanner not call
                   2774: .B yywrap()
                   2775: upon an end-of-file, but simply assume that there are no more
                   2776: files to scan (until the user points
                   2777: .I yyin
                   2778: at a new file and calls
                   2779: .B yylex()
                   2780: again).
                   2781: .PP
                   2782: .I flex
                   2783: scans your rule actions to determine whether you use the
                   2784: .B REJECT
                   2785: or
                   2786: .B yymore()
                   2787: features.  The
                   2788: .B reject
                   2789: and
                   2790: .B yymore
                   2791: options are available to override its decision as to whether you use the
                   2792: options, either by setting them (e.g.,
                   2793: .B %option reject)
                   2794: to indicate the feature is indeed used, or
                   2795: unsetting them to indicate it actually is not used
                   2796: (e.g.,
                   2797: .B %option noyymore).
                   2798: .PP
                   2799: Three options take string-delimited values, offset with '=':
                   2800: .nf
                   2801:
                   2802:     %option outfile="ABC"
                   2803:
                   2804: .fi
                   2805: is equivalent to
                   2806: .B -oABC,
                   2807: and
                   2808: .nf
                   2809:
                   2810:     %option prefix="XYZ"
                   2811:
                   2812: .fi
                   2813: is equivalent to
                   2814: .B -PXYZ.
                   2815: Finally,
                   2816: .nf
                   2817:
                   2818:     %option yyclass="foo"
                   2819:
                   2820: .fi
                   2821: only applies when generating a C++ scanner (
                   2822: .B \-+
                   2823: option).  It informs
                   2824: .I flex
                   2825: that you have derived
                   2826: .B foo
                   2827: as a subclass of
                   2828: .B yyFlexLexer,
                   2829: so
                   2830: .I flex
                   2831: will place your actions in the member function
                   2832: .B foo::yylex()
                   2833: instead of
                   2834: .B yyFlexLexer::yylex().
                   2835: It also generates a
                   2836: .B yyFlexLexer::yylex()
                   2837: member function that emits a run-time error (by invoking
                   2838: .B yyFlexLexer::LexerError())
                   2839: if called.
                   2840: See Generating C++ Scanners, below, for additional information.
                   2841: .PP
                   2842: A number of options are available for lint purists who want to suppress
                   2843: the appearance of unneeded routines in the generated scanner.  Each of the
                   2844: following, if unset
                   2845: (e.g.,
                   2846: .B %option nounput
                   2847: ), results in the corresponding routine not appearing in
                   2848: the generated scanner:
                   2849: .nf
                   2850:
                   2851:     input, unput
                   2852:     yy_push_state, yy_pop_state, yy_top_state
                   2853:     yy_scan_buffer, yy_scan_bytes, yy_scan_string
                   2854:
                   2855: .fi
                   2856: (though
                   2857: .B yy_push_state()
                   2858: and friends won't appear anyway unless you use
                   2859: .B %option stack).
                   2860: .SH PERFORMANCE CONSIDERATIONS
                   2861: The main design goal of
                   2862: .I flex
                   2863: is that it generate high-performance scanners.  It has been optimized
                   2864: for dealing well with large sets of rules.  Aside from the effects on
                   2865: scanner speed of the table compression
                   2866: .B \-C
                   2867: options outlined above,
                   2868: there are a number of options/actions which degrade performance.  These
                   2869: are, from most expensive to least:
                   2870: .nf
                   2871:
                   2872:     REJECT
                   2873:     %option yylineno
                   2874:     arbitrary trailing context
                   2875:
                   2876:     pattern sets that require backing up
                   2877:     %array
                   2878:     %option interactive
                   2879:     %option always-interactive
                   2880:
                   2881:     '^' beginning-of-line operator
                   2882:     yymore()
                   2883:
                   2884: .fi
                   2885: with the first three all being quite expensive and the last two
                   2886: being quite cheap.  Note also that
                   2887: .B unput()
                   2888: is implemented as a routine call that potentially does quite a bit of
                   2889: work, while
                   2890: .B yyless()
                   2891: is a quite-cheap macro; so if just putting back some excess text you
                   2892: scanned, use
                   2893: .B yyless().
                   2894: .PP
                   2895: .B REJECT
                   2896: should be avoided at all costs when performance is important.
                   2897: It is a particularly expensive option.
                   2898: .PP
                   2899: Getting rid of backing up is messy and often may be an enormous
                   2900: amount of work for a complicated scanner.  In principal, one begins
                   2901: by using the
1.7       aaron    2902: .B \-b
1.1       deraadt  2903: flag to generate a
                   2904: .I lex.backup
                   2905: file.  For example, on the input
                   2906: .nf
                   2907:
                   2908:     %%
                   2909:     foo        return TOK_KEYWORD;
                   2910:     foobar     return TOK_KEYWORD;
                   2911:
                   2912: .fi
                   2913: the file looks like:
                   2914: .nf
                   2915:
                   2916:     State #6 is non-accepting -
                   2917:      associated rule line numbers:
                   2918:            2       3
                   2919:      out-transitions: [ o ]
                   2920:      jam-transitions: EOF [ \\001-n  p-\\177 ]
                   2921:
                   2922:     State #8 is non-accepting -
                   2923:      associated rule line numbers:
                   2924:            3
                   2925:      out-transitions: [ a ]
                   2926:      jam-transitions: EOF [ \\001-`  b-\\177 ]
                   2927:
                   2928:     State #9 is non-accepting -
                   2929:      associated rule line numbers:
                   2930:            3
                   2931:      out-transitions: [ r ]
                   2932:      jam-transitions: EOF [ \\001-q  s-\\177 ]
                   2933:
                   2934:     Compressed tables always back up.
                   2935:
                   2936: .fi
                   2937: The first few lines tell us that there's a scanner state in
                   2938: which it can make a transition on an 'o' but not on any other
                   2939: character, and that in that state the currently scanned text does not match
                   2940: any rule.  The state occurs when trying to match the rules found
                   2941: at lines 2 and 3 in the input file.
                   2942: If the scanner is in that state and then reads
                   2943: something other than an 'o', it will have to back up to find
                   2944: a rule which is matched.  With
                   2945: a bit of headscratching one can see that this must be the
                   2946: state it's in when it has seen "fo".  When this has happened,
                   2947: if anything other than another 'o' is seen, the scanner will
                   2948: have to back up to simply match the 'f' (by the default rule).
                   2949: .PP
                   2950: The comment regarding State #8 indicates there's a problem
                   2951: when "foob" has been scanned.  Indeed, on any character other
                   2952: than an 'a', the scanner will have to back up to accept "foo".
                   2953: Similarly, the comment for State #9 concerns when "fooba" has
                   2954: been scanned and an 'r' does not follow.
                   2955: .PP
                   2956: The final comment reminds us that there's no point going to
                   2957: all the trouble of removing backing up from the rules unless
                   2958: we're using
                   2959: .B \-Cf
                   2960: or
                   2961: .B \-CF,
                   2962: since there's no performance gain doing so with compressed scanners.
                   2963: .PP
                   2964: The way to remove the backing up is to add "error" rules:
                   2965: .nf
                   2966:
                   2967:     %%
                   2968:     foo         return TOK_KEYWORD;
                   2969:     foobar      return TOK_KEYWORD;
                   2970:
                   2971:     fooba       |
                   2972:     foob        |
                   2973:     fo          {
                   2974:                 /* false alarm, not really a keyword */
                   2975:                 return TOK_ID;
                   2976:                 }
                   2977:
                   2978: .fi
                   2979: .PP
                   2980: Eliminating backing up among a list of keywords can also be
                   2981: done using a "catch-all" rule:
                   2982: .nf
                   2983:
                   2984:     %%
                   2985:     foo         return TOK_KEYWORD;
                   2986:     foobar      return TOK_KEYWORD;
                   2987:
                   2988:     [a-z]+      return TOK_ID;
                   2989:
                   2990: .fi
                   2991: This is usually the best solution when appropriate.
                   2992: .PP
                   2993: Backing up messages tend to cascade.
                   2994: With a complicated set of rules it's not uncommon to get hundreds
                   2995: of messages.  If one can decipher them, though, it often
                   2996: only takes a dozen or so rules to eliminate the backing up (though
                   2997: it's easy to make a mistake and have an error rule accidentally match
                   2998: a valid token.  A possible future
                   2999: .I flex
                   3000: feature will be to automatically add rules to eliminate backing up).
                   3001: .PP
                   3002: It's important to keep in mind that you gain the benefits of eliminating
                   3003: backing up only if you eliminate
                   3004: .I every
                   3005: instance of backing up.  Leaving just one means you gain nothing.
                   3006: .PP
                   3007: .I Variable
                   3008: trailing context (where both the leading and trailing parts do not have
                   3009: a fixed length) entails almost the same performance loss as
                   3010: .B REJECT
                   3011: (i.e., substantial).  So when possible a rule like:
                   3012: .nf
                   3013:
                   3014:     %%
                   3015:     mouse|rat/(cat|dog)   run();
                   3016:
                   3017: .fi
                   3018: is better written:
                   3019: .nf
                   3020:
                   3021:     %%
                   3022:     mouse/cat|dog         run();
                   3023:     rat/cat|dog           run();
                   3024:
                   3025: .fi
                   3026: or as
                   3027: .nf
                   3028:
                   3029:     %%
                   3030:     mouse|rat/cat         run();
                   3031:     mouse|rat/dog         run();
                   3032:
                   3033: .fi
                   3034: Note that here the special '|' action does
                   3035: .I not
                   3036: provide any savings, and can even make things worse (see
                   3037: Deficiencies / Bugs below).
                   3038: .LP
                   3039: Another area where the user can increase a scanner's performance
                   3040: (and one that's easier to implement) arises from the fact that
                   3041: the longer the tokens matched, the faster the scanner will run.
                   3042: This is because with long tokens the processing of most input
                   3043: characters takes place in the (short) inner scanning loop, and
                   3044: does not often have to go through the additional work of setting up
                   3045: the scanning environment (e.g.,
                   3046: .B yytext)
                   3047: for the action.  Recall the scanner for C comments:
                   3048: .nf
                   3049:
                   3050:     %x comment
                   3051:     %%
                   3052:             int line_num = 1;
                   3053:
                   3054:     "/*"         BEGIN(comment);
                   3055:
                   3056:     <comment>[^*\\n]*
                   3057:     <comment>"*"+[^*/\\n]*
                   3058:     <comment>\\n             ++line_num;
                   3059:     <comment>"*"+"/"        BEGIN(INITIAL);
                   3060:
                   3061: .fi
                   3062: This could be sped up by writing it as:
                   3063: .nf
                   3064:
                   3065:     %x comment
                   3066:     %%
                   3067:             int line_num = 1;
                   3068:
                   3069:     "/*"         BEGIN(comment);
                   3070:
                   3071:     <comment>[^*\\n]*
                   3072:     <comment>[^*\\n]*\\n      ++line_num;
                   3073:     <comment>"*"+[^*/\\n]*
                   3074:     <comment>"*"+[^*/\\n]*\\n ++line_num;
                   3075:     <comment>"*"+"/"        BEGIN(INITIAL);
                   3076:
                   3077: .fi
                   3078: Now instead of each newline requiring the processing of another
                   3079: action, recognizing the newlines is "distributed" over the other rules
                   3080: to keep the matched text as long as possible.  Note that
                   3081: .I adding
                   3082: rules does
                   3083: .I not
                   3084: slow down the scanner!  The speed of the scanner is independent
                   3085: of the number of rules or (modulo the considerations given at the
                   3086: beginning of this section) how complicated the rules are with
                   3087: regard to operators such as '*' and '|'.
                   3088: .PP
                   3089: A final example in speeding up a scanner: suppose you want to scan
                   3090: through a file containing identifiers and keywords, one per line
                   3091: and with no other extraneous characters, and recognize all the
                   3092: keywords.  A natural first approach is:
                   3093: .nf
                   3094:
                   3095:     %%
                   3096:     asm      |
                   3097:     auto     |
                   3098:     break    |
                   3099:     ... etc ...
                   3100:     volatile |
                   3101:     while    /* it's a keyword */
                   3102:
                   3103:     .|\\n     /* it's not a keyword */
                   3104:
                   3105: .fi
                   3106: To eliminate the back-tracking, introduce a catch-all rule:
                   3107: .nf
                   3108:
                   3109:     %%
                   3110:     asm      |
                   3111:     auto     |
                   3112:     break    |
                   3113:     ... etc ...
                   3114:     volatile |
                   3115:     while    /* it's a keyword */
                   3116:
                   3117:     [a-z]+   |
                   3118:     .|\\n     /* it's not a keyword */
                   3119:
                   3120: .fi
                   3121: Now, if it's guaranteed that there's exactly one word per line,
                   3122: then we can reduce the total number of matches by a half by
                   3123: merging in the recognition of newlines with that of the other
                   3124: tokens:
                   3125: .nf
                   3126:
                   3127:     %%
                   3128:     asm\\n    |
                   3129:     auto\\n   |
                   3130:     break\\n  |
                   3131:     ... etc ...
                   3132:     volatile\\n |
                   3133:     while\\n  /* it's a keyword */
                   3134:
                   3135:     [a-z]+\\n |
                   3136:     .|\\n     /* it's not a keyword */
                   3137:
                   3138: .fi
                   3139: One has to be careful here, as we have now reintroduced backing up
                   3140: into the scanner.  In particular, while
                   3141: .I we
                   3142: know that there will never be any characters in the input stream
                   3143: other than letters or newlines,
                   3144: .I flex
                   3145: can't figure this out, and it will plan for possibly needing to back up
                   3146: when it has scanned a token like "auto" and then the next character
                   3147: is something other than a newline or a letter.  Previously it would
                   3148: then just match the "auto" rule and be done, but now it has no "auto"
1.10      deraadt  3149: rule, only an "auto\\n" rule.  To eliminate the possibility of backing up,
1.1       deraadt  3150: we could either duplicate all rules but without final newlines, or,
                   3151: since we never expect to encounter such an input and therefore don't
                   3152: how it's classified, we can introduce one more catch-all rule, this
                   3153: one which doesn't include a newline:
                   3154: .nf
                   3155:
                   3156:     %%
                   3157:     asm\\n    |
                   3158:     auto\\n   |
                   3159:     break\\n  |
                   3160:     ... etc ...
                   3161:     volatile\\n |
                   3162:     while\\n  /* it's a keyword */
                   3163:
                   3164:     [a-z]+\\n |
                   3165:     [a-z]+   |
                   3166:     .|\\n     /* it's not a keyword */
                   3167:
                   3168: .fi
                   3169: Compiled with
                   3170: .B \-Cf,
                   3171: this is about as fast as one can get a
1.7       aaron    3172: .I flex
1.1       deraadt  3173: scanner to go for this particular problem.
                   3174: .PP
                   3175: A final note:
                   3176: .I flex
                   3177: is slow when matching NUL's, particularly when a token contains
                   3178: multiple NUL's.
                   3179: It's best to write rules which match
                   3180: .I short
                   3181: amounts of text if it's anticipated that the text will often include NUL's.
                   3182: .PP
                   3183: Another final note regarding performance: as mentioned above in the section
                   3184: How the Input is Matched, dynamically resizing
                   3185: .B yytext
                   3186: to accommodate huge tokens is a slow process because it presently requires that
                   3187: the (huge) token be rescanned from the beginning.  Thus if performance is
                   3188: vital, you should attempt to match "large" quantities of text but not
                   3189: "huge" quantities, where the cutoff between the two is at about 8K
                   3190: characters/token.
                   3191: .SH GENERATING C++ SCANNERS
                   3192: .I flex
                   3193: provides two different ways to generate scanners for use with C++.  The
                   3194: first way is to simply compile a scanner generated by
                   3195: .I flex
                   3196: using a C++ compiler instead of a C compiler.  You should not encounter
1.10      deraadt  3197: any compilation errors (please report any you find to the email address
1.1       deraadt  3198: given in the Author section below).  You can then use C++ code in your
                   3199: rule actions instead of C code.  Note that the default input source for
                   3200: your scanner remains
                   3201: .I yyin,
                   3202: and default echoing is still done to
                   3203: .I yyout.
                   3204: Both of these remain
                   3205: .I FILE *
                   3206: variables and not C++
                   3207: .I streams.
                   3208: .PP
                   3209: You can also use
                   3210: .I flex
                   3211: to generate a C++ scanner class, using the
                   3212: .B \-+
                   3213: option (or, equivalently,
                   3214: .B %option c++),
                   3215: which is automatically specified if the name of the flex
                   3216: executable ends in a '+', such as
                   3217: .I flex++.
                   3218: When using this option, flex defaults to generating the scanner to the file
                   3219: .B lex.yy.cc
                   3220: instead of
                   3221: .B lex.yy.c.
                   3222: The generated scanner includes the header file
1.5       deraadt  3223: .I g++/FlexLexer.h,
1.1       deraadt  3224: which defines the interface to two C++ classes.
                   3225: .PP
                   3226: The first class,
                   3227: .B FlexLexer,
                   3228: provides an abstract base class defining the general scanner class
                   3229: interface.  It provides the following member functions:
                   3230: .TP
                   3231: .B const char* YYText()
                   3232: returns the text of the most recently matched token, the equivalent of
                   3233: .B yytext.
                   3234: .TP
                   3235: .B int YYLeng()
                   3236: returns the length of the most recently matched token, the equivalent of
                   3237: .B yyleng.
                   3238: .TP
                   3239: .B int lineno() const
                   3240: returns the current input line number
                   3241: (see
                   3242: .B %option yylineno),
                   3243: or
                   3244: .B 1
                   3245: if
                   3246: .B %option yylineno
                   3247: was not used.
                   3248: .TP
                   3249: .B void set_debug( int flag )
                   3250: sets the debugging flag for the scanner, equivalent to assigning to
                   3251: .B yy_flex_debug
                   3252: (see the Options section above).  Note that you must build the scanner
                   3253: using
                   3254: .B %option debug
                   3255: to include debugging information in it.
                   3256: .TP
                   3257: .B int debug() const
                   3258: returns the current setting of the debugging flag.
                   3259: .PP
                   3260: Also provided are member functions equivalent to
                   3261: .B yy_switch_to_buffer(),
                   3262: .B yy_create_buffer()
                   3263: (though the first argument is an
                   3264: .B istream*
                   3265: object pointer and not a
                   3266: .B FILE*),
                   3267: .B yy_flush_buffer(),
                   3268: .B yy_delete_buffer(),
                   3269: and
                   3270: .B yyrestart()
1.10      deraadt  3271: (again, the first argument is an
1.1       deraadt  3272: .B istream*
                   3273: object pointer).
                   3274: .PP
                   3275: The second class defined in
1.5       deraadt  3276: .I g++/FlexLexer.h
1.1       deraadt  3277: is
                   3278: .B yyFlexLexer,
                   3279: which is derived from
                   3280: .B FlexLexer.
                   3281: It defines the following additional member functions:
                   3282: .TP
                   3283: .B
                   3284: yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
                   3285: constructs a
                   3286: .B yyFlexLexer
                   3287: object using the given streams for input and output.  If not specified,
                   3288: the streams default to
                   3289: .B cin
                   3290: and
                   3291: .B cout,
                   3292: respectively.
                   3293: .TP
                   3294: .B virtual int yylex()
1.10      deraadt  3295: performs the same role as
1.1       deraadt  3296: .B yylex()
                   3297: does for ordinary flex scanners: it scans the input stream, consuming
                   3298: tokens, until a rule's action returns a value.  If you derive a subclass
                   3299: .B S
                   3300: from
                   3301: .B yyFlexLexer
                   3302: and want to access the member functions and variables of
                   3303: .B S
                   3304: inside
                   3305: .B yylex(),
                   3306: then you need to use
                   3307: .B %option yyclass="S"
                   3308: to inform
                   3309: .I flex
                   3310: that you will be using that subclass instead of
                   3311: .B yyFlexLexer.
                   3312: In this case, rather than generating
                   3313: .B yyFlexLexer::yylex(),
                   3314: .I flex
                   3315: generates
                   3316: .B S::yylex()
                   3317: (and also generates a dummy
                   3318: .B yyFlexLexer::yylex()
                   3319: that calls
                   3320: .B yyFlexLexer::LexerError()
                   3321: if called).
                   3322: .TP
                   3323: .B
                   3324: virtual void switch_streams(istream* new_in = 0,
                   3325: .B
                   3326: ostream* new_out = 0)
                   3327: reassigns
                   3328: .B yyin
                   3329: to
                   3330: .B new_in
                   3331: (if non-nil)
                   3332: and
                   3333: .B yyout
                   3334: to
                   3335: .B new_out
                   3336: (ditto), deleting the previous input buffer if
                   3337: .B yyin
                   3338: is reassigned.
                   3339: .TP
                   3340: .B
                   3341: int yylex( istream* new_in, ostream* new_out = 0 )
                   3342: first switches the input streams via
                   3343: .B switch_streams( new_in, new_out )
                   3344: and then returns the value of
                   3345: .B yylex().
                   3346: .PP
                   3347: In addition,
                   3348: .B yyFlexLexer
                   3349: defines the following protected virtual functions which you can redefine
                   3350: in derived classes to tailor the scanner:
                   3351: .TP
                   3352: .B
                   3353: virtual int LexerInput( char* buf, int max_size )
                   3354: reads up to
                   3355: .B max_size
                   3356: characters into
                   3357: .B buf
                   3358: and returns the number of characters read.  To indicate end-of-input,
                   3359: return 0 characters.  Note that "interactive" scanners (see the
                   3360: .B \-B
                   3361: and
                   3362: .B \-I
                   3363: flags) define the macro
                   3364: .B YY_INTERACTIVE.
                   3365: If you redefine
                   3366: .B LexerInput()
                   3367: and need to take different actions depending on whether or not
                   3368: the scanner might be scanning an interactive input source, you can
                   3369: test for the presence of this name via
                   3370: .B #ifdef.
                   3371: .TP
                   3372: .B
                   3373: virtual void LexerOutput( const char* buf, int size )
                   3374: writes out
                   3375: .B size
                   3376: characters from the buffer
                   3377: .B buf,
                   3378: which, while NUL-terminated, may also contain "internal" NUL's if
                   3379: the scanner's rules can match text with NUL's in them.
                   3380: .TP
                   3381: .B
                   3382: virtual void LexerError( const char* msg )
                   3383: reports a fatal error message.  The default version of this function
                   3384: writes the message to the stream
                   3385: .B cerr
                   3386: and exits.
                   3387: .PP
                   3388: Note that a
                   3389: .B yyFlexLexer
                   3390: object contains its
                   3391: .I entire
                   3392: scanning state.  Thus you can use such objects to create reentrant
                   3393: scanners.  You can instantiate multiple instances of the same
                   3394: .B yyFlexLexer
                   3395: class, and you can also combine multiple C++ scanner classes together
                   3396: in the same program using the
                   3397: .B \-P
                   3398: option discussed above.
                   3399: .PP
                   3400: Finally, note that the
                   3401: .B %array
                   3402: feature is not available to C++ scanner classes; you must use
                   3403: .B %pointer
                   3404: (the default).
                   3405: .PP
                   3406: Here is an example of a simple C++ scanner:
                   3407: .nf
                   3408:
                   3409:         // An example of using the flex C++ scanner class.
                   3410:
                   3411:     %{
                   3412:     int mylineno = 0;
                   3413:     %}
                   3414:
                   3415:     string  \\"[^\\n"]+\\"
                   3416:
                   3417:     ws      [ \\t]+
                   3418:
                   3419:     alpha   [A-Za-z]
                   3420:     dig     [0-9]
                   3421:     name    ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])*
                   3422:     num1    [-+]?{dig}+\\.?([eE][-+]?{dig}+)?
                   3423:     num2    [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)?
                   3424:     number  {num1}|{num2}
                   3425:
                   3426:     %%
                   3427:
                   3428:     {ws}    /* skip blanks and tabs */
                   3429:
                   3430:     "/*"    {
                   3431:             int c;
                   3432:
                   3433:             while((c = yyinput()) != 0)
                   3434:                 {
                   3435:                 if(c == '\\n')
                   3436:                     ++mylineno;
                   3437:
                   3438:                 else if(c == '*')
                   3439:                     {
                   3440:                     if((c = yyinput()) == '/')
                   3441:                         break;
                   3442:                     else
                   3443:                         unput(c);
                   3444:                     }
                   3445:                 }
                   3446:             }
                   3447:
                   3448:     {number}  cout << "number " << YYText() << '\\n';
                   3449:
                   3450:     \\n        mylineno++;
                   3451:
                   3452:     {name}    cout << "name " << YYText() << '\\n';
                   3453:
                   3454:     {string}  cout << "string " << YYText() << '\\n';
                   3455:
                   3456:     %%
                   3457:
                   3458:     int main( int /* argc */, char** /* argv */ )
                   3459:         {
                   3460:         FlexLexer* lexer = new yyFlexLexer;
                   3461:         while(lexer->yylex() != 0)
                   3462:             ;
                   3463:         return 0;
                   3464:         }
                   3465: .fi
                   3466: If you want to create multiple (different) lexer classes, you use the
                   3467: .B \-P
                   3468: flag (or the
                   3469: .B prefix=
                   3470: option) to rename each
                   3471: .B yyFlexLexer
                   3472: to some other
                   3473: .B xxFlexLexer.
                   3474: You then can include
1.5       deraadt  3475: .B <g++/FlexLexer.h>
1.1       deraadt  3476: in your other sources once per lexer class, first renaming
                   3477: .B yyFlexLexer
                   3478: as follows:
                   3479: .nf
                   3480:
                   3481:     #undef yyFlexLexer
                   3482:     #define yyFlexLexer xxFlexLexer
1.5       deraadt  3483:     #include <g++/FlexLexer.h>
1.1       deraadt  3484:
                   3485:     #undef yyFlexLexer
                   3486:     #define yyFlexLexer zzFlexLexer
1.5       deraadt  3487:     #include <g++/FlexLexer.h>
1.1       deraadt  3488:
                   3489: .fi
                   3490: if, for example, you used
                   3491: .B %option prefix="xx"
                   3492: for one of your scanners and
                   3493: .B %option prefix="zz"
                   3494: for the other.
                   3495: .PP
                   3496: IMPORTANT: the present form of the scanning class is
                   3497: .I experimental
1.7       aaron    3498: and may change considerably between major releases.
1.1       deraadt  3499: .SH INCOMPATIBILITIES WITH LEX AND POSIX
                   3500: .I flex
                   3501: is a rewrite of the AT&T Unix
                   3502: .I lex
                   3503: tool (the two implementations do not share any code, though),
                   3504: with some extensions and incompatibilities, both of which
                   3505: are of concern to those who wish to write scanners acceptable
                   3506: to either implementation.  Flex is fully compliant with the POSIX
                   3507: .I lex
                   3508: specification, except that when using
                   3509: .B %pointer
                   3510: (the default), a call to
                   3511: .B unput()
                   3512: destroys the contents of
                   3513: .B yytext,
                   3514: which is counter to the POSIX specification.
                   3515: .PP
                   3516: In this section we discuss all of the known areas of incompatibility
                   3517: between flex, AT&T lex, and the POSIX specification.
                   3518: .PP
                   3519: .I flex's
                   3520: .B \-l
                   3521: option turns on maximum compatibility with the original AT&T
                   3522: .I lex
                   3523: implementation, at the cost of a major loss in the generated scanner's
                   3524: performance.  We note below which incompatibilities can be overcome
                   3525: using the
                   3526: .B \-l
                   3527: option.
                   3528: .PP
                   3529: .I flex
                   3530: is fully compatible with
                   3531: .I lex
                   3532: with the following exceptions:
                   3533: .IP -
                   3534: The undocumented
                   3535: .I lex
                   3536: scanner internal variable
                   3537: .B yylineno
                   3538: is not supported unless
                   3539: .B \-l
                   3540: or
                   3541: .B %option yylineno
                   3542: is used.
                   3543: .IP
                   3544: .B yylineno
                   3545: should be maintained on a per-buffer basis, rather than a per-scanner
                   3546: (single global variable) basis.
                   3547: .IP
                   3548: .B yylineno
                   3549: is not part of the POSIX specification.
                   3550: .IP -
                   3551: The
                   3552: .B input()
                   3553: routine is not redefinable, though it may be called to read characters
                   3554: following whatever has been matched by a rule.  If
                   3555: .B input()
                   3556: encounters an end-of-file the normal
                   3557: .B yywrap()
                   3558: processing is done.  A ``real'' end-of-file is returned by
                   3559: .B input()
                   3560: as
                   3561: .I EOF.
                   3562: .IP
                   3563: Input is instead controlled by defining the
                   3564: .B YY_INPUT
                   3565: macro.
                   3566: .IP
                   3567: The
                   3568: .I flex
                   3569: restriction that
                   3570: .B input()
                   3571: cannot be redefined is in accordance with the POSIX specification,
                   3572: which simply does not specify any way of controlling the
                   3573: scanner's input other than by making an initial assignment to
                   3574: .I yyin.
                   3575: .IP -
                   3576: The
                   3577: .B unput()
                   3578: routine is not redefinable.  This restriction is in accordance with POSIX.
                   3579: .IP -
                   3580: .I flex
                   3581: scanners are not as reentrant as
                   3582: .I lex
                   3583: scanners.  In particular, if you have an interactive scanner and
                   3584: an interrupt handler which long-jumps out of the scanner, and
                   3585: the scanner is subsequently called again, you may get the following
                   3586: message:
                   3587: .nf
                   3588:
                   3589:     fatal flex scanner internal error--end of buffer missed
                   3590:
                   3591: .fi
                   3592: To reenter the scanner, first use
                   3593: .nf
                   3594:
                   3595:     yyrestart( yyin );
                   3596:
                   3597: .fi
                   3598: Note that this call will throw away any buffered input; usually this
                   3599: isn't a problem with an interactive scanner.
                   3600: .IP
                   3601: Also note that flex C++ scanner classes
                   3602: .I are
                   3603: reentrant, so if using C++ is an option for you, you should use
                   3604: them instead.  See "Generating C++ Scanners" above for details.
                   3605: .IP -
                   3606: .B output()
                   3607: is not supported.
                   3608: Output from the
                   3609: .B ECHO
                   3610: macro is done to the file-pointer
                   3611: .I yyout
                   3612: (default
                   3613: .I stdout).
                   3614: .IP
                   3615: .B output()
                   3616: is not part of the POSIX specification.
                   3617: .IP -
                   3618: .I lex
                   3619: does not support exclusive start conditions (%x), though they
                   3620: are in the POSIX specification.
                   3621: .IP -
                   3622: When definitions are expanded,
                   3623: .I flex
                   3624: encloses them in parentheses.
                   3625: With lex, the following:
                   3626: .nf
                   3627:
                   3628:     NAME    [A-Z][A-Z0-9]*
                   3629:     %%
                   3630:     foo{NAME}?      printf( "Found it\\n" );
                   3631:     %%
                   3632:
                   3633: .fi
                   3634: will not match the string "foo" because when the macro
                   3635: is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
                   3636: and the precedence is such that the '?' is associated with
                   3637: "[A-Z0-9]*".  With
                   3638: .I flex,
                   3639: the rule will be expanded to
                   3640: "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
                   3641: .IP
                   3642: Note that if the definition begins with
                   3643: .B ^
                   3644: or ends with
                   3645: .B $
                   3646: then it is
                   3647: .I not
                   3648: expanded with parentheses, to allow these operators to appear in
                   3649: definitions without losing their special meanings.  But the
                   3650: .B <s>, /,
                   3651: and
                   3652: .B <<EOF>>
                   3653: operators cannot be used in a
                   3654: .I flex
                   3655: definition.
                   3656: .IP
                   3657: Using
                   3658: .B \-l
                   3659: results in the
                   3660: .I lex
                   3661: behavior of no parentheses around the definition.
                   3662: .IP
                   3663: The POSIX specification is that the definition be enclosed in parentheses.
                   3664: .IP -
                   3665: Some implementations of
                   3666: .I lex
                   3667: allow a rule's action to begin on a separate line, if the rule's pattern
                   3668: has trailing whitespace:
                   3669: .nf
                   3670:
                   3671:     %%
                   3672:     foo|bar<space here>
                   3673:       { foobar_action(); }
                   3674:
                   3675: .fi
                   3676: .I flex
                   3677: does not support this feature.
                   3678: .IP -
                   3679: The
                   3680: .I lex
                   3681: .B %r
                   3682: (generate a Ratfor scanner) option is not supported.  It is not part
                   3683: of the POSIX specification.
                   3684: .IP -
                   3685: After a call to
                   3686: .B unput(),
                   3687: .I yytext
                   3688: is undefined until the next token is matched, unless the scanner
                   3689: was built using
                   3690: .B %array.
                   3691: This is not the case with
                   3692: .I lex
                   3693: or the POSIX specification.  The
                   3694: .B \-l
                   3695: option does away with this incompatibility.
                   3696: .IP -
                   3697: The precedence of the
                   3698: .B {}
                   3699: (numeric range) operator is different.
                   3700: .I lex
                   3701: interprets "abc{1,3}" as "match one, two, or
                   3702: three occurrences of 'abc'", whereas
                   3703: .I flex
                   3704: interprets it as "match 'ab'
                   3705: followed by one, two, or three occurrences of 'c'".  The latter is
                   3706: in agreement with the POSIX specification.
                   3707: .IP -
                   3708: The precedence of the
                   3709: .B ^
                   3710: operator is different.
                   3711: .I lex
                   3712: interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
                   3713: or 'bar' anywhere", whereas
                   3714: .I flex
                   3715: interprets it as "match either 'foo' or 'bar' if they come at the beginning
                   3716: of a line".  The latter is in agreement with the POSIX specification.
                   3717: .IP -
                   3718: The special table-size declarations such as
                   3719: .B %a
                   3720: supported by
                   3721: .I lex
                   3722: are not required by
                   3723: .I flex
                   3724: scanners;
                   3725: .I flex
                   3726: ignores them.
                   3727: .IP -
                   3728: The name
                   3729: .bd
                   3730: FLEX_SCANNER
                   3731: is #define'd so scanners may be written for use with either
                   3732: .I flex
                   3733: or
                   3734: .I lex.
                   3735: Scanners also include
                   3736: .B YY_FLEX_MAJOR_VERSION
                   3737: and
                   3738: .B YY_FLEX_MINOR_VERSION
                   3739: indicating which version of
                   3740: .I flex
                   3741: generated the scanner
                   3742: (for example, for the 2.5 release, these defines would be 2 and 5
                   3743: respectively).
                   3744: .PP
                   3745: The following
                   3746: .I flex
                   3747: features are not included in
                   3748: .I lex
                   3749: or the POSIX specification:
                   3750: .nf
                   3751:
                   3752:     C++ scanners
                   3753:     %option
                   3754:     start condition scopes
                   3755:     start condition stacks
                   3756:     interactive/non-interactive scanners
                   3757:     yy_scan_string() and friends
                   3758:     yyterminate()
                   3759:     yy_set_interactive()
                   3760:     yy_set_bol()
                   3761:     YY_AT_BOL()
                   3762:     <<EOF>>
                   3763:     <*>
                   3764:     YY_DECL
                   3765:     YY_START
                   3766:     YY_USER_ACTION
                   3767:     YY_USER_INIT
                   3768:     #line directives
                   3769:     %{}'s around actions
                   3770:     multiple actions on a line
                   3771:
                   3772: .fi
                   3773: plus almost all of the flex flags.
                   3774: The last feature in the list refers to the fact that with
                   3775: .I flex
                   3776: you can put multiple actions on the same line, separated with
                   3777: semi-colons, while with
                   3778: .I lex,
                   3779: the following
                   3780: .nf
                   3781:
                   3782:     foo    handle_foo(); ++num_foos_seen;
                   3783:
                   3784: .fi
                   3785: is (rather surprisingly) truncated to
                   3786: .nf
                   3787:
                   3788:     foo    handle_foo();
                   3789:
                   3790: .fi
                   3791: .I flex
                   3792: does not truncate the action.  Actions that are not enclosed in
                   3793: braces are simply terminated at the end of the line.
                   3794: .SH DIAGNOSTICS
                   3795: .PP
                   3796: .I warning, rule cannot be matched
                   3797: indicates that the given rule
                   3798: cannot be matched because it follows other rules that will
                   3799: always match the same text as it.  For
                   3800: example, in the following "foo" cannot be matched because it comes after
                   3801: an identifier "catch-all" rule:
                   3802: .nf
                   3803:
                   3804:     [a-z]+    got_identifier();
                   3805:     foo       got_foo();
                   3806:
                   3807: .fi
                   3808: Using
                   3809: .B REJECT
                   3810: in a scanner suppresses this warning.
                   3811: .PP
                   3812: .I warning,
                   3813: .B \-s
                   3814: .I
                   3815: option given but default rule can be matched
                   3816: means that it is possible (perhaps only in a particular start condition)
                   3817: that the default rule (match any single character) is the only one
                   3818: that will match a particular input.  Since
                   3819: .B \-s
                   3820: was given, presumably this is not intended.
                   3821: .PP
                   3822: .I reject_used_but_not_detected undefined
                   3823: or
                   3824: .I yymore_used_but_not_detected undefined -
                   3825: These errors can occur at compile time.  They indicate that the
                   3826: scanner uses
                   3827: .B REJECT
                   3828: or
                   3829: .B yymore()
                   3830: but that
                   3831: .I flex
                   3832: failed to notice the fact, meaning that
                   3833: .I flex
                   3834: scanned the first two sections looking for occurrences of these actions
1.10      deraadt  3835: and failed to find any, but somehow you snuck some in (via an #include
1.1       deraadt  3836: file, for example).  Use
                   3837: .B %option reject
                   3838: or
                   3839: .B %option yymore
                   3840: to indicate to flex that you really do use these features.
                   3841: .PP
                   3842: .I flex scanner jammed -
                   3843: a scanner compiled with
                   3844: .B \-s
                   3845: has encountered an input string which wasn't matched by
                   3846: any of its rules.  This error can also occur due to internal problems.
                   3847: .PP
                   3848: .I token too large, exceeds YYLMAX -
                   3849: your scanner uses
                   3850: .B %array
                   3851: and one of its rules matched a string longer than the
                   3852: .B YYLMAX
                   3853: constant (8K bytes by default).  You can increase the value by
                   3854: #define'ing
                   3855: .B YYLMAX
                   3856: in the definitions section of your
                   3857: .I flex
                   3858: input.
                   3859: .PP
                   3860: .I scanner requires \-8 flag to
                   3861: .I use the character 'x' -
                   3862: Your scanner specification includes recognizing the 8-bit character
                   3863: .I 'x'
                   3864: and you did not specify the \-8 flag, and your scanner defaulted to 7-bit
                   3865: because you used the
                   3866: .B \-Cf
                   3867: or
                   3868: .B \-CF
                   3869: table compression options.  See the discussion of the
                   3870: .B \-7
                   3871: flag for details.
                   3872: .PP
                   3873: .I flex scanner push-back overflow -
                   3874: you used
                   3875: .B unput()
                   3876: to push back so much text that the scanner's buffer could not hold
                   3877: both the pushed-back text and the current token in
                   3878: .B yytext.
                   3879: Ideally the scanner should dynamically resize the buffer in this case, but at
                   3880: present it does not.
                   3881: .PP
                   3882: .I
                   3883: input buffer overflow, can't enlarge buffer because scanner uses REJECT -
                   3884: the scanner was working on matching an extremely large token and needed
                   3885: to expand the input buffer.  This doesn't work with scanners that use
                   3886: .B
                   3887: REJECT.
                   3888: .PP
                   3889: .I
                   3890: fatal flex scanner internal error--end of buffer missed -
                   3891: This can occur in an scanner which is reentered after a long-jump
                   3892: has jumped out (or over) the scanner's activation frame.  Before
                   3893: reentering the scanner, use:
                   3894: .nf
                   3895:
                   3896:     yyrestart( yyin );
                   3897:
                   3898: .fi
                   3899: or, as noted above, switch to using the C++ scanner class.
                   3900: .PP
                   3901: .I too many start conditions in <> construct! -
                   3902: you listed more start conditions in a <> construct than exist (so
                   3903: you must have listed at least one of them twice).
                   3904: .SH FILES
                   3905: .TP
                   3906: .B \-lfl
                   3907: library with which scanners must be linked.
                   3908: .TP
                   3909: .I lex.yy.c
                   3910: generated scanner (called
                   3911: .I lexyy.c
                   3912: on some systems).
                   3913: .TP
                   3914: .I lex.yy.cc
                   3915: generated C++ scanner class, when using
                   3916: .B -+.
                   3917: .TP
1.5       deraadt  3918: .I <g++/FlexLexer.h>
1.1       deraadt  3919: header file defining the C++ scanner base class,
                   3920: .B FlexLexer,
                   3921: and its derived class,
                   3922: .B yyFlexLexer.
                   3923: .TP
                   3924: .I flex.skl
                   3925: skeleton scanner.  This file is only used when building flex, not when
                   3926: flex executes.
                   3927: .TP
                   3928: .I lex.backup
                   3929: backing-up information for
                   3930: .B \-b
                   3931: flag (called
                   3932: .I lex.bck
                   3933: on some systems).
                   3934: .SH DEFICIENCIES / BUGS
                   3935: .PP
                   3936: Some trailing context
                   3937: patterns cannot be properly matched and generate
                   3938: warning messages ("dangerous trailing context").  These are
                   3939: patterns where the ending of the
                   3940: first part of the rule matches the beginning of the second
                   3941: part, such as "zx*/xy*", where the 'x*' matches the 'x' at
                   3942: the beginning of the trailing context.  (Note that the POSIX draft
                   3943: states that the text matched by such patterns is undefined.)
                   3944: .PP
                   3945: For some trailing context rules, parts which are actually fixed-length are
1.3       deraadt  3946: not recognized as such, leading to the above mentioned performance loss.
1.1       deraadt  3947: In particular, parts using '|' or {n} (such as "foo{3}") are always
                   3948: considered variable-length.
                   3949: .PP
                   3950: Combining trailing context with the special '|' action can result in
                   3951: .I fixed
                   3952: trailing context being turned into the more expensive
                   3953: .I variable
                   3954: trailing context.  For example, in the following:
                   3955: .nf
                   3956:
                   3957:     %%
                   3958:     abc      |
                   3959:     xyz/def
                   3960:
                   3961: .fi
                   3962: .PP
                   3963: Use of
                   3964: .B unput()
                   3965: invalidates yytext and yyleng, unless the
                   3966: .B %array
                   3967: directive
                   3968: or the
                   3969: .B \-l
                   3970: option has been used.
                   3971: .PP
                   3972: Pattern-matching of NUL's is substantially slower than matching other
                   3973: characters.
                   3974: .PP
                   3975: Dynamic resizing of the input buffer is slow, as it entails rescanning
                   3976: all the text matched so far by the current (generally huge) token.
                   3977: .PP
                   3978: Due to both buffering of input and read-ahead, you cannot intermix
                   3979: calls to <stdio.h> routines, such as, for example,
                   3980: .B getchar(),
                   3981: with
                   3982: .I flex
                   3983: rules and expect it to work.  Call
                   3984: .B input()
                   3985: instead.
                   3986: .PP
                   3987: The total table entries listed by the
                   3988: .B \-v
                   3989: flag excludes the number of table entries needed to determine
                   3990: what rule has been matched.  The number of entries is equal
                   3991: to the number of DFA states if the scanner does not use
                   3992: .B REJECT,
                   3993: and somewhat greater than the number of states if it does.
                   3994: .PP
                   3995: .B REJECT
                   3996: cannot be used with the
                   3997: .B \-f
                   3998: or
                   3999: .B \-F
                   4000: options.
                   4001: .PP
                   4002: The
                   4003: .I flex
                   4004: internal algorithms need documentation.
                   4005: .SH SEE ALSO
                   4006: .PP
                   4007: lex(1), yacc(1), sed(1), awk(1).
                   4008: .PP
                   4009: John Levine, Tony Mason, and Doug Brown,
                   4010: .I Lex & Yacc,
                   4011: O'Reilly and Associates.  Be sure to get the 2nd edition.
                   4012: .PP
                   4013: M. E. Lesk and E. Schmidt,
                   4014: .I LEX \- Lexical Analyzer Generator
                   4015: .PP
                   4016: Alfred Aho, Ravi Sethi and Jeffrey Ullman,
                   4017: .I Compilers: Principles, Techniques and Tools,
                   4018: Addison-Wesley (1986).  Describes the pattern-matching techniques used by
                   4019: .I flex
                   4020: (deterministic finite automata).
                   4021: .SH AUTHOR
                   4022: Vern Paxson, with the help of many ideas and much inspiration from
                   4023: Van Jacobson.  Original version by Jef Poskanzer.  The fast table
                   4024: representation is a partial implementation of a design done by Van
                   4025: Jacobson.  The implementation was done by Kevin Gong and Vern Paxson.
                   4026: .PP
                   4027: Thanks to the many
                   4028: .I flex
                   4029: beta-testers, feedbackers, and contributors, especially Francois Pinard,
                   4030: Casey Leedom,
                   4031: Robert Abramovitz,
                   4032: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
                   4033: Neal Becker, Nelson H.F. Beebe, benson@odi.com,
                   4034: Karl Berry, Peter A. Bigot, Simon Blanchard,
                   4035: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
                   4036: Brian Clapper, J.T. Conklin,
                   4037: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11      deraadt  4038: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1       deraadt  4039: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
                   4040: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
                   4041: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
                   4042: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
                   4043: Jan Hajic, Charles Hemphill, NORO Hideo,
                   4044: Jarkko Hietaniemi, Scott Hofmann,
                   4045: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
                   4046: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
                   4047: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
                   4048: Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
                   4049: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
                   4050: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
                   4051: David Loffredo, Mike Long,
                   4052: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
                   4053: Bengt Martensson, Chris Metcalf,
                   4054: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
                   4055: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
                   4056: Richard Ohnemus, Karsten Pahnke,
                   4057: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond
                   4058: Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
                   4059: Frederic Raimbault, Pat Rankin, Rick Richardson,
                   4060: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
                   4061: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
                   4062: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
                   4063: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
                   4064: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
                   4065: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
                   4066: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
                   4067: Yap, Ron Zellar, Nathan Zelle, David Zuhn,
                   4068: and those whose names have slipped my marginal
                   4069: mail-archiving skills but whose contributions are appreciated all the
                   4070: same.
                   4071: .PP
                   4072: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
                   4073: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
                   4074: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
                   4075: distribution headaches.
                   4076: .PP
                   4077: Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to
                   4078: Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom
                   4079: Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to
                   4080: Eric Hughes for support of multiple buffers.
                   4081: .PP
                   4082: This work was primarily done when I was with the Real Time Systems Group
                   4083: at the Lawrence Berkeley Laboratory in Berkeley, CA.  Many thanks to all there
                   4084: for the support I received.
                   4085: .PP
                   4086: Send comments to vern@ee.lbl.gov.