src/usr.bin/lex/flex.1 - annotate

Return to flex.1 CVS log
Up to [local] / src / usr.bin / lex
Annotation of src/usr.bin/lex/flex.1, Revision 1.13

1.13    ! millert     1: .\"    $OpenBSD: flex.1,v 1.12 2003/02/18 07:43:36 jmc Exp $
1.12      jmc         2: .\"
                      3: .\" Copyright (c) 1990 The Regents of the University of California.
                      4: .\" All rights reserved.
1.2       deraadt     5: .\"
1.12      jmc         6: .\" This code is derived from software contributed to Berkeley by
                      7: .\" Vern Paxson.
                      8: .\"
                      9: .\" The United States Government has rights in this work pursuant
                     10: .\" to contract no. DE-AC03-76SF00098 between the United States
                     11: .\" Department of Energy and the University of California.
                     12: .\"
                     13: .\" Redistribution and use in source and binary forms, with or without
1.13    ! millert    14: .\" modification, are permitted provided that the following conditions
        !            15: .\" are met:
        !            16: .\"
        !            17: .\" 1. Redistributions of source code must retain the above copyright
        !            18: .\"    notice, this list of conditions and the following disclaimer.
        !            19: .\" 2. Redistributions in binary form must reproduce the above copyright
        !            20: .\"    notice, this list of conditions and the following disclaimer in the
        !            21: .\"    documentation and/or other materials provided with the distribution.
        !            22: .\"
        !            23: .\" Neither the name of the University nor the names of its contributors
        !            24: .\" may be used to endorse or promote products derived from this software
        !            25: .\" without specific prior written permission.
        !            26: .\"
        !            27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
        !            28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
        !            29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
        !            30: .\" PURPOSE.
1.12      jmc        31: .\"
1.1       deraadt    32: .TH FLEX 1 "April 1995" "Version 2.5"
                     33: .SH NAME
                     34: flex \- fast lexical analyzer generator
                     35: .SH SYNOPSIS
                     36: .B flex
                     37: .B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton]
                     38: .B [\-\-help \-\-version]
                     39: .I [filename ...]
                     40: .SH OVERVIEW
                     41: This manual describes
                     42: .I flex,
                     43: a tool for generating programs that perform pattern-matching on text.  The
                     44: manual includes both tutorial and reference sections:
                     45: .nf
                     46:
                     47:     Description
                     48:         a brief overview of the tool
                     49:
                     50:     Some Simple Examples
                     51:
                     52:     Format Of The Input File
                     53:
                     54:     Patterns
                     55:         the extended regular expressions used by flex
                     56:
                     57:     How The Input Is Matched
                     58:         the rules for determining what has been matched
                     59:
                     60:     Actions
                     61:         how to specify what to do when a pattern is matched
                     62:
                     63:     The Generated Scanner
                     64:         details regarding the scanner that flex produces;
                     65:         how to control the input source
                     66:
                     67:     Start Conditions
                     68:         introducing context into your scanners, and
                     69:         managing "mini-scanners"
                     70:
                     71:     Multiple Input Buffers
                     72:         how to manipulate multiple input sources; how to
                     73:         scan from strings instead of files
                     74:
                     75:     End-of-file Rules
                     76:         special rules for matching the end of the input
                     77:
                     78:     Miscellaneous Macros
                     79:         a summary of macros available to the actions
                     80:
                     81:     Values Available To The User
                     82:         a summary of values available to the actions
                     83:
                     84:     Interfacing With Yacc
                     85:         connecting flex scanners together with yacc parsers
                     86:
                     87:     Options
                     88:         flex command-line options, and the "%option"
                     89:         directive
                     90:
                     91:     Performance Considerations
                     92:         how to make your scanner go as fast as possible
                     93:
                     94:     Generating C++ Scanners
                     95:         the (experimental) facility for generating C++
                     96:         scanner classes
                     97:
                     98:     Incompatibilities With Lex And POSIX
                     99:         how flex differs from AT&T lex and the POSIX lex
                    100:         standard
                    101:
                    102:     Diagnostics
                    103:         those error messages produced by flex (or scanners
                    104:         it generates) whose meanings might not be apparent
                    105:
                    106:     Files
                    107:         files used by flex
                    108:
                    109:     Deficiencies / Bugs
                    110:         known problems with flex
                    111:
                    112:     See Also
                    113:         other documentation, related tools
                    114:
                    115:     Author
                    116:         includes contact information
                    117:
                    118: .fi
                    119: .SH DESCRIPTION
                    120: .I flex
                    121: is a tool for generating
                    122: .I scanners:
1.9       millert   123: programs which recognize lexical patterns in text.
1.1       deraadt   124: .I flex
                    125: reads
                    126: the given input files, or its standard input if no file names are given,
                    127: for a description of a scanner to generate.  The description is in
                    128: the form of pairs
                    129: of regular expressions and C code, called
                    130: .I rules.  flex
                    131: generates as output a C source file,
                    132: .B lex.yy.c,
                    133: which defines a routine
                    134: .B yylex().
                    135: This file is compiled and linked with the
                    136: .B \-lfl
                    137: library to produce an executable.  When the executable is run,
                    138: it analyzes its input for occurrences
                    139: of the regular expressions.  Whenever it finds one, it executes
                    140: the corresponding C code.
                    141: .SH SOME SIMPLE EXAMPLES
                    142: .PP
                    143: First some simple examples to get the flavor of how one uses
                    144: .I flex.
                    145: The following
                    146: .I flex
                    147: input specifies a scanner which whenever it encounters the string
                    148: "username" will replace it with the user's login name:
                    149: .nf
                    150:
                    151:     %%
                    152:     username    printf( "%s", getlogin() );
                    153:
                    154: .fi
                    155: By default, any text not matched by a
                    156: .I flex
                    157: scanner
                    158: is copied to the output, so the net effect of this scanner is
                    159: to copy its input file to its output with each occurrence
                    160: of "username" expanded.
                    161: In this input, there is just one rule.  "username" is the
                    162: .I pattern
                    163: and the "printf" is the
                    164: .I action.
                    165: The "%%" marks the beginning of the rules.
                    166: .PP
                    167: Here's another simple example:
                    168: .nf
                    169:
                    170:             int num_lines = 0, num_chars = 0;
                    171:
                    172:     %%
                    173:     \\n      ++num_lines; ++num_chars;
                    174:     .       ++num_chars;
                    175:
                    176:     %%
                    177:     main()
                    178:             {
                    179:             yylex();
                    180:             printf( "# of lines = %d, # of chars = %d\\n",
                    181:                     num_lines, num_chars );
                    182:             }
                    183:
                    184: .fi
                    185: This scanner counts the number of characters and the number
                    186: of lines in its input (it produces no output other than the
                    187: final report on the counts).  The first line
                    188: declares two globals, "num_lines" and "num_chars", which are accessible
                    189: both inside
                    190: .B yylex()
                    191: and in the
                    192: .B main()
                    193: routine declared after the second "%%".  There are two rules, one
                    194: which matches a newline ("\\n") and increments both the line count and
                    195: the character count, and one which matches any character other than
                    196: a newline (indicated by the "." regular expression).
                    197: .PP
                    198: A somewhat more complicated example:
                    199: .nf
                    200:
                    201:     /* scanner for a toy Pascal-like language */
                    202:
                    203:     %{
                    204:     /* need this for the call to atof() below */
                    205:     #include <math.h>
                    206:     %}
                    207:
                    208:     DIGIT    [0-9]
                    209:     ID       [a-z][a-z0-9]*
                    210:
                    211:     %%
                    212:
                    213:     {DIGIT}+    {
                    214:                 printf( "An integer: %s (%d)\\n", yytext,
                    215:                         atoi( yytext ) );
                    216:                 }
                    217:
                    218:     {DIGIT}+"."{DIGIT}*        {
                    219:                 printf( "A float: %s (%g)\\n", yytext,
                    220:                         atof( yytext ) );
                    221:                 }
                    222:
                    223:     if|then|begin|end|procedure|function        {
                    224:                 printf( "A keyword: %s\\n", yytext );
                    225:                 }
                    226:
                    227:     {ID}        printf( "An identifier: %s\\n", yytext );
                    228:
                    229:     "+"|"-"|"*"|"/"   printf( "An operator: %s\\n", yytext );
                    230:
                    231:     "{"[^}\\n]*"}"     /* eat up one-line comments */
                    232:
                    233:     [ \\t\\n]+          /* eat up whitespace */
                    234:
                    235:     .           printf( "Unrecognized character: %s\\n", yytext );
                    236:
                    237:     %%
                    238:
                    239:     main( argc, argv )
                    240:     int argc;
                    241:     char **argv;
                    242:         {
                    243:         ++argv, --argc;  /* skip over program name */
                    244:         if ( argc > 0 )
                    245:                 yyin = fopen( argv[0], "r" );
                    246:         else
                    247:                 yyin = stdin;
1.7       aaron     248:
1.1       deraadt   249:         yylex();
                    250:         }
                    251:
                    252: .fi
                    253: This is the beginnings of a simple scanner for a language like
                    254: Pascal.  It identifies different types of
                    255: .I tokens
                    256: and reports on what it has seen.
                    257: .PP
                    258: The details of this example will be explained in the following
                    259: sections.
                    260: .SH FORMAT OF THE INPUT FILE
                    261: The
                    262: .I flex
                    263: input file consists of three sections, separated by a line with just
                    264: .B %%
                    265: in it:
                    266: .nf
                    267:
                    268:     definitions
                    269:     %%
                    270:     rules
                    271:     %%
                    272:     user code
                    273:
                    274: .fi
                    275: The
                    276: .I definitions
                    277: section contains declarations of simple
                    278: .I name
                    279: definitions to simplify the scanner specification, and declarations of
                    280: .I start conditions,
                    281: which are explained in a later section.
                    282: .PP
                    283: Name definitions have the form:
                    284: .nf
                    285:
                    286:     name definition
                    287:
                    288: .fi
                    289: The "name" is a word beginning with a letter or an underscore ('_')
                    290: followed by zero or more letters, digits, '_', or '-' (dash).
1.8       aaron     291: The definition is taken to begin at the first non-whitespace character
1.1       deraadt   292: following the name and continuing to the end of the line.
                    293: The definition can subsequently be referred to using "{name}", which
                    294: will expand to "(definition)".  For example,
                    295: .nf
                    296:
                    297:     DIGIT    [0-9]
                    298:     ID       [a-z][a-z0-9]*
                    299:
                    300: .fi
                    301: defines "DIGIT" to be a regular expression which matches a
                    302: single digit, and
                    303: "ID" to be a regular expression which matches a letter
                    304: followed by zero-or-more letters-or-digits.
                    305: A subsequent reference to
                    306: .nf
                    307:
                    308:     {DIGIT}+"."{DIGIT}*
                    309:
                    310: .fi
                    311: is identical to
                    312: .nf
                    313:
                    314:     ([0-9])+"."([0-9])*
                    315:
                    316: .fi
                    317: and matches one-or-more digits followed by a '.' followed
                    318: by zero-or-more digits.
                    319: .PP
                    320: The
                    321: .I rules
                    322: section of the
                    323: .I flex
                    324: input contains a series of rules of the form:
                    325: .nf
                    326:
                    327:     pattern   action
                    328:
                    329: .fi
                    330: where the pattern must be unindented and the action must begin
                    331: on the same line.
                    332: .PP
                    333: See below for a further description of patterns and actions.
                    334: .PP
                    335: Finally, the user code section is simply copied to
                    336: .B lex.yy.c
                    337: verbatim.
                    338: It is used for companion routines which call or are called
                    339: by the scanner.  The presence of this section is optional;
                    340: if it is missing, the second
                    341: .B %%
                    342: in the input file may be skipped, too.
                    343: .PP
                    344: In the definitions and rules sections, any
                    345: .I indented
                    346: text or text enclosed in
                    347: .B %{
                    348: and
                    349: .B %}
                    350: is copied verbatim to the output (with the %{}'s removed).
                    351: The %{}'s must appear unindented on lines by themselves.
                    352: .PP
                    353: In the rules section,
                    354: any indented or %{} text appearing before the
                    355: first rule may be used to declare variables
                    356: which are local to the scanning routine and (after the declarations)
                    357: code which is to be executed whenever the scanning routine is entered.
                    358: Other indented or %{} text in the rule section is still copied to the output,
                    359: but its meaning is not well-defined and it may well cause compile-time
                    360: errors (this feature is present for
                    361: .I POSIX
                    362: compliance; see below for other such features).
                    363: .PP
                    364: In the definitions section (but not in the rules section),
                    365: an unindented comment (i.e., a line
                    366: beginning with "/*") is also copied verbatim to the output up
                    367: to the next "*/".
                    368: .SH PATTERNS
                    369: The patterns in the input are written using an extended set of regular
                    370: expressions.  These are:
                    371: .nf
                    372:
                    373:     x          match the character 'x'
                    374:     .          any character (byte) except newline
                    375:     [xyz]      a "character class"; in this case, the pattern
                    376:                  matches either an 'x', a 'y', or a 'z'
                    377:     [abj-oZ]   a "character class" with a range in it; matches
                    378:                  an 'a', a 'b', any letter from 'j' through 'o',
                    379:                  or a 'Z'
                    380:     [^A-Z]     a "negated character class", i.e., any character
                    381:                  but those in the class.  In this case, any
                    382:                  character EXCEPT an uppercase letter.
                    383:     [^A-Z\\n]   any character EXCEPT an uppercase letter or
                    384:                  a newline
                    385:     r*         zero or more r's, where r is any regular expression
                    386:     r+         one or more r's
                    387:     r?         zero or one r's (that is, "an optional r")
                    388:     r{2,5}     anywhere from two to five r's
                    389:     r{2,}      two or more r's
                    390:     r{4}       exactly 4 r's
                    391:     {name}     the expansion of the "name" definition
                    392:                (see above)
                    393:     "[xyz]\\"foo"
                    394:                the literal string: [xyz]"foo
                    395:     \\X         if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
                    396:                  then the ANSI-C interpretation of \\x.
                    397:                  Otherwise, a literal 'X' (used to escape
                    398:                  operators such as '*')
                    399:     \\0         a NUL character (ASCII code 0)
                    400:     \\123       the character with octal value 123
                    401:     \\x2a       the character with hexadecimal value 2a
                    402:     (r)        match an r; parentheses are used to override
                    403:                  precedence (see below)
                    404:
                    405:
                    406:     rs         the regular expression r followed by the
                    407:                  regular expression s; called "concatenation"
                    408:
                    409:
                    410:     r|s        either an r or an s
                    411:
                    412:
                    413:     r/s        an r but only if it is followed by an s.  The
                    414:                  text matched by s is included when determining
                    415:                  whether this rule is the "longest match",
                    416:                  but is then returned to the input before
                    417:                  the action is executed.  So the action only
                    418:                  sees the text matched by r.  This type
                    419:                  of pattern is called trailing context".
                    420:                  (There are some combinations of r/s that flex
                    421:                  cannot match correctly; see notes in the
                    422:                  Deficiencies / Bugs section below regarding
                    423:                  "dangerous trailing context".)
                    424:     ^r         an r, but only at the beginning of a line (i.e.,
1.10      deraadt   425:                  just starting to scan, or right after a
1.1       deraadt   426:                  newline has been scanned).
                    427:     r$         an r, but only at the end of a line (i.e., just
                    428:                  before a newline).  Equivalent to "r/\\n".
                    429:
                    430:                Note that flex's notion of "newline" is exactly
                    431:                whatever the C compiler used to compile flex
                    432:                interprets '\\n' as; in particular, on some DOS
                    433:                systems you must either filter out \\r's in the
                    434:                input yourself, or explicitly use r/\\r\\n for "r$".
                    435:
                    436:
                    437:     <s>r       an r, but only in start condition s (see
                    438:                  below for discussion of start conditions)
                    439:     <s1,s2,s3>r
                    440:                same, but in any of start conditions s1,
                    441:                  s2, or s3
                    442:     <*>r       an r in any start condition, even an exclusive one.
                    443:
                    444:
                    445:     <<EOF>>    an end-of-file
                    446:     <s1,s2><<EOF>>
                    447:                an end-of-file when in start condition s1 or s2
                    448:
                    449: .fi
                    450: Note that inside of a character class, all regular expression operators
                    451: lose their special meaning except escape ('\\') and the character class
                    452: operators, '-', ']', and, at the beginning of the class, '^'.
                    453: .PP
                    454: The regular expressions listed above are grouped according to
                    455: precedence, from highest precedence at the top to lowest at the bottom.
                    456: Those grouped together have equal precedence.  For example,
                    457: .nf
                    458:
                    459:     foo|bar*
                    460:
                    461: .fi
                    462: is the same as
                    463: .nf
                    464:
                    465:     (foo)|(ba(r*))
                    466:
                    467: .fi
                    468: since the '*' operator has higher precedence than concatenation,
                    469: and concatenation higher than alternation ('|').  This pattern
                    470: therefore matches
                    471: .I either
                    472: the string "foo"
                    473: .I or
                    474: the string "ba" followed by zero-or-more r's.
                    475: To match "foo" or zero-or-more "bar"'s, use:
                    476: .nf
                    477:
                    478:     foo|(bar)*
                    479:
                    480: .fi
                    481: and to match zero-or-more "foo"'s-or-"bar"'s:
                    482: .nf
                    483:
                    484:     (foo|bar)*
                    485:
                    486: .fi
                    487: .PP
                    488: In addition to characters and ranges of characters, character classes
                    489: can also contain character class
                    490: .I expressions.
                    491: These are expressions enclosed inside
                    492: .B [:
                    493: and
                    494: .B :]
                    495: delimiters (which themselves must appear between the '[' and ']' of the
                    496: character class; other elements may occur inside the character class, too).
                    497: The valid expressions are:
                    498: .nf
                    499:
                    500:     [:alnum:] [:alpha:] [:blank:]
                    501:     [:cntrl:] [:digit:] [:graph:]
                    502:     [:lower:] [:print:] [:punct:]
                    503:     [:space:] [:upper:] [:xdigit:]
                    504:
                    505: .fi
                    506: These expressions all designate a set of characters equivalent to
                    507: the corresponding standard C
                    508: .B isXXX
                    509: function.  For example,
                    510: .B [:alnum:]
                    511: designates those characters for which
                    512: .B isalnum()
                    513: returns true - i.e., any alphabetic or numeric.
                    514: Some systems don't provide
                    515: .B isblank(),
                    516: so flex defines
                    517: .B [:blank:]
                    518: as a blank or a tab.
                    519: .PP
                    520: For example, the following character classes are all equivalent:
                    521: .nf
                    522:
                    523:     [[:alnum:]]
1.4       deraadt   524:     [[:alpha:][:digit:]]
1.1       deraadt   525:     [[:alpha:]0-9]
                    526:     [a-zA-Z0-9]
                    527:
                    528: .fi
                    529: If your scanner is case-insensitive (the
                    530: .B \-i
                    531: flag), then
                    532: .B [:upper:]
                    533: and
                    534: .B [:lower:]
                    535: are equivalent to
                    536: .B [:alpha:].
                    537: .PP
                    538: Some notes on patterns:
                    539: .IP -
                    540: A negated character class such as the example "[^A-Z]"
                    541: above
                    542: .I will match a newline
                    543: unless "\\n" (or an equivalent escape sequence) is one of the
                    544: characters explicitly present in the negated character class
                    545: (e.g., "[^A-Z\\n]").  This is unlike how many other regular
                    546: expression tools treat negated character classes, but unfortunately
                    547: the inconsistency is historically entrenched.
                    548: Matching newlines means that a pattern like [^"]* can match the entire
                    549: input unless there's another quote in the input.
                    550: .IP -
                    551: A rule can have at most one instance of trailing context (the '/' operator
                    552: or the '$' operator).  The start condition, '^', and "<<EOF>>" patterns
                    553: can only occur at the beginning of a pattern, and, as well as with '/' and '$',
                    554: cannot be grouped inside parentheses.  A '^' which does not occur at
                    555: the beginning of a rule or a '$' which does not occur at the end of
                    556: a rule loses its special properties and is treated as a normal character.
                    557: .IP
                    558: The following are illegal:
                    559: .nf
                    560:
                    561:     foo/bar$
                    562:     <sc1>foo<sc2>bar
                    563:
                    564: .fi
                    565: Note that the first of these, can be written "foo/bar\\n".
                    566: .IP
                    567: The following will result in '$' or '^' being treated as a normal character:
                    568: .nf
                    569:
                    570:     foo|(bar$)
                    571:     foo|^bar
                    572:
                    573: .fi
                    574: If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
                    575: could be used (the special '|' action is explained below):
                    576: .nf
                    577:
                    578:     foo      |
                    579:     bar$     /* action goes here */
                    580:
                    581: .fi
                    582: A similar trick will work for matching a foo or a
                    583: bar-at-the-beginning-of-a-line.
                    584: .SH HOW THE INPUT IS MATCHED
                    585: When the generated scanner is run, it analyzes its input looking
                    586: for strings which match any of its patterns.  If it finds more than
                    587: one match, it takes the one matching the most text (for trailing
                    588: context rules, this includes the length of the trailing part, even
                    589: though it will then be returned to the input).  If it finds two
                    590: or more matches of the same length, the
                    591: rule listed first in the
                    592: .I flex
                    593: input file is chosen.
                    594: .PP
                    595: Once the match is determined, the text corresponding to the match
                    596: (called the
                    597: .I token)
                    598: is made available in the global character pointer
                    599: .B yytext,
                    600: and its length in the global integer
                    601: .B yyleng.
                    602: The
                    603: .I action
                    604: corresponding to the matched pattern is then executed (a more
                    605: detailed description of actions follows), and then the remaining
                    606: input is scanned for another match.
                    607: .PP
                    608: If no match is found, then the
                    609: .I default rule
                    610: is executed: the next character in the input is considered matched and
                    611: copied to the standard output.  Thus, the simplest legal
                    612: .I flex
                    613: input is:
                    614: .nf
                    615:
                    616:     %%
                    617:
                    618: .fi
                    619: which generates a scanner that simply copies its input (one character
                    620: at a time) to its output.
                    621: .PP
                    622: Note that
                    623: .B yytext
                    624: can be defined in two different ways: either as a character
                    625: .I pointer
                    626: or as a character
                    627: .I array.
                    628: You can control which definition
                    629: .I flex
                    630: uses by including one of the special directives
                    631: .B %pointer
                    632: or
                    633: .B %array
                    634: in the first (definitions) section of your flex input.  The default is
                    635: .B %pointer,
                    636: unless you use the
                    637: .B -l
                    638: lex compatibility option, in which case
                    639: .B yytext
                    640: will be an array.
                    641: The advantage of using
                    642: .B %pointer
                    643: is substantially faster scanning and no buffer overflow when matching
                    644: very large tokens (unless you run out of dynamic memory).  The disadvantage
                    645: is that you are restricted in how your actions can modify
                    646: .B yytext
                    647: (see the next section), and calls to the
                    648: .B unput()
1.10      deraadt   649: function destroy the present contents of
1.1       deraadt   650: .B yytext,
                    651: which can be a considerable porting headache when moving between different
                    652: .I lex
                    653: versions.
                    654: .PP
                    655: The advantage of
                    656: .B %array
                    657: is that you can then modify
                    658: .B yytext
                    659: to your heart's content, and calls to
                    660: .B unput()
                    661: do not destroy
                    662: .B yytext
                    663: (see below).  Furthermore, existing
                    664: .I lex
                    665: programs sometimes access
                    666: .B yytext
                    667: externally using declarations of the form:
                    668: .nf
                    669:     extern char yytext[];
                    670: .fi
                    671: This definition is erroneous when used with
                    672: .B %pointer,
                    673: but correct for
                    674: .B %array.
                    675: .PP
                    676: .B %array
                    677: defines
                    678: .B yytext
                    679: to be an array of
                    680: .B YYLMAX
                    681: characters, which defaults to a fairly large value.  You can change
                    682: the size by simply #define'ing
                    683: .B YYLMAX
                    684: to a different value in the first section of your
                    685: .I flex
                    686: input.  As mentioned above, with
                    687: .B %pointer
                    688: yytext grows dynamically to accommodate large tokens.  While this means your
                    689: .B %pointer
                    690: scanner can accommodate very large tokens (such as matching entire blocks
                    691: of comments), bear in mind that each time the scanner must resize
                    692: .B yytext
                    693: it also must rescan the entire token from the beginning, so matching such
                    694: tokens can prove slow.
                    695: .B yytext
                    696: presently does
                    697: .I not
                    698: dynamically grow if a call to
                    699: .B unput()
                    700: results in too much text being pushed back; instead, a run-time error results.
                    701: .PP
                    702: Also note that you cannot use
                    703: .B %array
                    704: with C++ scanner classes
                    705: (the
                    706: .B c++
                    707: option; see below).
                    708: .SH ACTIONS
                    709: Each pattern in a rule has a corresponding action, which can be any
                    710: arbitrary C statement.  The pattern ends at the first non-escaped
                    711: whitespace character; the remainder of the line is its action.  If the
                    712: action is empty, then when the pattern is matched the input token
                    713: is simply discarded.  For example, here is the specification for a program
                    714: which deletes all occurrences of "zap me" from its input:
                    715: .nf
                    716:
                    717:     %%
                    718:     "zap me"
                    719:
                    720: .fi
                    721: (It will copy all other characters in the input to the output since
                    722: they will be matched by the default rule.)
                    723: .PP
                    724: Here is a program which compresses multiple blanks and tabs down to
                    725: a single blank, and throws away whitespace found at the end of a line:
                    726: .nf
                    727:
                    728:     %%
                    729:     [ \\t]+        putchar( ' ' );
                    730:     [ \\t]+$       /* ignore this token */
                    731:
                    732: .fi
                    733: .PP
                    734: If the action contains a '{', then the action spans till the balancing '}'
                    735: is found, and the action may cross multiple lines.
1.7       aaron     736: .I flex
1.1       deraadt   737: knows about C strings and comments and won't be fooled by braces found
                    738: within them, but also allows actions to begin with
                    739: .B %{
                    740: and will consider the action to be all the text up to the next
                    741: .B %}
                    742: (regardless of ordinary braces inside the action).
                    743: .PP
                    744: An action consisting solely of a vertical bar ('|') means "same as
                    745: the action for the next rule."  See below for an illustration.
                    746: .PP
                    747: Actions can include arbitrary C code, including
                    748: .B return
                    749: statements to return a value to whatever routine called
                    750: .B yylex().
                    751: Each time
                    752: .B yylex()
                    753: is called it continues processing tokens from where it last left
                    754: off until it either reaches
                    755: the end of the file or executes a return.
                    756: .PP
                    757: Actions are free to modify
                    758: .B yytext
                    759: except for lengthening it (adding
                    760: characters to its end--these will overwrite later characters in the
                    761: input stream).  This however does not apply when using
                    762: .B %array
                    763: (see above); in that case,
                    764: .B yytext
                    765: may be freely modified in any way.
                    766: .PP
                    767: Actions are free to modify
                    768: .B yyleng
                    769: except they should not do so if the action also includes use of
                    770: .B yymore()
                    771: (see below).
                    772: .PP
                    773: There are a number of special directives which can be included within
                    774: an action:
                    775: .IP -
                    776: .B ECHO
                    777: copies yytext to the scanner's output.
                    778: .IP -
                    779: .B BEGIN
                    780: followed by the name of a start condition places the scanner in the
                    781: corresponding start condition (see below).
                    782: .IP -
                    783: .B REJECT
                    784: directs the scanner to proceed on to the "second best" rule which matched the
                    785: input (or a prefix of the input).  The rule is chosen as described
                    786: above in "How the Input is Matched", and
                    787: .B yytext
                    788: and
                    789: .B yyleng
                    790: set up appropriately.
                    791: It may either be one which matched as much text
                    792: as the originally chosen rule but came later in the
                    793: .I flex
                    794: input file, or one which matched less text.
                    795: For example, the following will both count the
                    796: words in the input and call the routine special() whenever "frob" is seen:
                    797: .nf
                    798:
                    799:             int word_count = 0;
                    800:     %%
                    801:
                    802:     frob        special(); REJECT;
                    803:     [^ \\t\\n]+   ++word_count;
                    804:
                    805: .fi
                    806: Without the
                    807: .B REJECT,
                    808: any "frob"'s in the input would not be counted as words, since the
                    809: scanner normally executes only one action per token.
                    810: Multiple
                    811: .B REJECT's
                    812: are allowed, each one finding the next best choice to the currently
                    813: active rule.  For example, when the following scanner scans the token
                    814: "abcd", it will write "abcdabcaba" to the output:
                    815: .nf
                    816:
                    817:     %%
                    818:     a        |
                    819:     ab       |
                    820:     abc      |
                    821:     abcd     ECHO; REJECT;
                    822:     .|\\n     /* eat up any unmatched character */
                    823:
                    824: .fi
                    825: (The first three rules share the fourth's action since they use
                    826: the special '|' action.)
                    827: .B REJECT
                    828: is a particularly expensive feature in terms of scanner performance;
                    829: if it is used in
                    830: .I any
                    831: of the scanner's actions it will slow down
                    832: .I all
                    833: of the scanner's matching.  Furthermore,
                    834: .B REJECT
                    835: cannot be used with the
                    836: .I -Cf
                    837: or
                    838: .I -CF
                    839: options (see below).
                    840: .IP
                    841: Note also that unlike the other special actions,
                    842: .B REJECT
                    843: is a
                    844: .I branch;
                    845: code immediately following it in the action will
                    846: .I not
                    847: be executed.
                    848: .IP -
                    849: .B yymore()
                    850: tells the scanner that the next time it matches a rule, the corresponding
                    851: token should be
                    852: .I appended
                    853: onto the current value of
                    854: .B yytext
                    855: rather than replacing it.  For example, given the input "mega-kludge"
                    856: the following will write "mega-mega-kludge" to the output:
                    857: .nf
                    858:
                    859:     %%
                    860:     mega-    ECHO; yymore();
                    861:     kludge   ECHO;
                    862:
                    863: .fi
                    864: First "mega-" is matched and echoed to the output.  Then "kludge"
                    865: is matched, but the previous "mega-" is still hanging around at the
                    866: beginning of
                    867: .B yytext
                    868: so the
                    869: .B ECHO
                    870: for the "kludge" rule will actually write "mega-kludge".
                    871: .PP
                    872: Two notes regarding use of
                    873: .B yymore().
                    874: First,
                    875: .B yymore()
                    876: depends on the value of
                    877: .I yyleng
                    878: correctly reflecting the size of the current token, so you must not
                    879: modify
                    880: .I yyleng
                    881: if you are using
                    882: .B yymore().
                    883: Second, the presence of
                    884: .B yymore()
                    885: in the scanner's action entails a minor performance penalty in the
                    886: scanner's matching speed.
                    887: .IP -
                    888: .B yyless(n)
                    889: returns all but the first
                    890: .I n
                    891: characters of the current token back to the input stream, where they
                    892: will be rescanned when the scanner looks for the next match.
                    893: .B yytext
                    894: and
                    895: .B yyleng
                    896: are adjusted appropriately (e.g.,
                    897: .B yyleng
                    898: will now be equal to
                    899: .I n
                    900: ).  For example, on the input "foobar" the following will write out
                    901: "foobarbar":
                    902: .nf
                    903:
                    904:     %%
                    905:     foobar    ECHO; yyless(3);
                    906:     [a-z]+    ECHO;
                    907:
                    908: .fi
                    909: An argument of 0 to
                    910: .B yyless
                    911: will cause the entire current input string to be scanned again.  Unless you've
                    912: changed how the scanner will subsequently process its input (using
                    913: .B BEGIN,
                    914: for example), this will result in an endless loop.
                    915: .PP
                    916: Note that
                    917: .B yyless
                    918: is a macro and can only be used in the flex input file, not from
                    919: other source files.
                    920: .IP -
                    921: .B unput(c)
                    922: puts the character
                    923: .I c
                    924: back onto the input stream.  It will be the next character scanned.
                    925: The following action will take the current token and cause it
                    926: to be rescanned enclosed in parentheses.
                    927: .nf
                    928:
                    929:     {
                    930:     int i;
                    931:     /* Copy yytext because unput() trashes yytext */
                    932:     char *yycopy = strdup( yytext );
                    933:     unput( ')' );
                    934:     for ( i = yyleng - 1; i >= 0; --i )
                    935:         unput( yycopy[i] );
                    936:     unput( '(' );
                    937:     free( yycopy );
                    938:     }
                    939:
                    940: .fi
                    941: Note that since each
                    942: .B unput()
                    943: puts the given character back at the
                    944: .I beginning
                    945: of the input stream, pushing back strings must be done back-to-front.
                    946: .PP
                    947: An important potential problem when using
                    948: .B unput()
                    949: is that if you are using
                    950: .B %pointer
                    951: (the default), a call to
                    952: .B unput()
                    953: .I destroys
                    954: the contents of
                    955: .I yytext,
                    956: starting with its rightmost character and devouring one character to
                    957: the left with each call.  If you need the value of yytext preserved
                    958: after a call to
                    959: .B unput()
                    960: (as in the above example),
                    961: you must either first copy it elsewhere, or build your scanner using
                    962: .B %array
                    963: instead (see How The Input Is Matched).
                    964: .PP
                    965: Finally, note that you cannot put back
                    966: .B EOF
                    967: to attempt to mark the input stream with an end-of-file.
                    968: .IP -
                    969: .B input()
                    970: reads the next character from the input stream.  For example,
                    971: the following is one way to eat up C comments:
                    972: .nf
                    973:
                    974:     %%
                    975:     "/*"        {
                    976:                 register int c;
                    977:
                    978:                 for ( ; ; )
                    979:                     {
                    980:                     while ( (c = input()) != '*' &&
                    981:                             c != EOF )
                    982:                         ;    /* eat up text of comment */
                    983:
                    984:                     if ( c == '*' )
                    985:                         {
                    986:                         while ( (c = input()) == '*' )
                    987:                             ;
                    988:                         if ( c == '/' )
                    989:                             break;    /* found the end */
                    990:                         }
                    991:
                    992:                     if ( c == EOF )
                    993:                         {
                    994:                         error( "EOF in comment" );
                    995:                         break;
                    996:                         }
                    997:                     }
                    998:                 }
                    999:
                   1000: .fi
                   1001: (Note that if the scanner is compiled using
                   1002: .B C++,
                   1003: then
                   1004: .B input()
                   1005: is instead referred to as
                   1006: .B yyinput(),
                   1007: in order to avoid a name clash with the
                   1008: .B C++
                   1009: stream by the name of
                   1010: .I input.)
                   1011: .IP -
                   1012: .B YY_FLUSH_BUFFER
                   1013: flushes the scanner's internal buffer
                   1014: so that the next time the scanner attempts to match a token, it will
                   1015: first refill the buffer using
                   1016: .B YY_INPUT
                   1017: (see The Generated Scanner, below).  This action is a special case
                   1018: of the more general
                   1019: .B yy_flush_buffer()
                   1020: function, described below in the section Multiple Input Buffers.
                   1021: .IP -
                   1022: .B yyterminate()
                   1023: can be used in lieu of a return statement in an action.  It terminates
                   1024: the scanner and returns a 0 to the scanner's caller, indicating "all done".
                   1025: By default,
                   1026: .B yyterminate()
                   1027: is also called when an end-of-file is encountered.  It is a macro and
                   1028: may be redefined.
                   1029: .SH THE GENERATED SCANNER
                   1030: The output of
                   1031: .I flex
                   1032: is the file
                   1033: .B lex.yy.c,
                   1034: which contains the scanning routine
                   1035: .B yylex(),
                   1036: a number of tables used by it for matching tokens, and a number
                   1037: of auxiliary routines and macros.  By default,
                   1038: .B yylex()
                   1039: is declared as follows:
                   1040: .nf
                   1041:
                   1042:     int yylex()
                   1043:         {
                   1044:         ... various definitions and the actions in here ...
                   1045:         }
                   1046:
                   1047: .fi
                   1048: (If your environment supports function prototypes, then it will
                   1049: be "int yylex( void )".)  This definition may be changed by defining
                   1050: the "YY_DECL" macro.  For example, you could use:
                   1051: .nf
                   1052:
                   1053:     #define YY_DECL float lexscan( a, b ) float a, b;
                   1054:
                   1055: .fi
                   1056: to give the scanning routine the name
                   1057: .I lexscan,
                   1058: returning a float, and taking two floats as arguments.  Note that
                   1059: if you give arguments to the scanning routine using a
                   1060: K&R-style/non-prototyped function declaration, you must terminate
                   1061: the definition with a semi-colon (;).
                   1062: .PP
                   1063: Whenever
                   1064: .B yylex()
                   1065: is called, it scans tokens from the global input file
                   1066: .I yyin
                   1067: (which defaults to stdin).  It continues until it either reaches
                   1068: an end-of-file (at which point it returns the value 0) or
                   1069: one of its actions executes a
                   1070: .I return
                   1071: statement.
                   1072: .PP
                   1073: If the scanner reaches an end-of-file, subsequent calls are undefined
                   1074: unless either
                   1075: .I yyin
                   1076: is pointed at a new input file (in which case scanning continues from
                   1077: that file), or
                   1078: .B yyrestart()
                   1079: is called.
                   1080: .B yyrestart()
                   1081: takes one argument, a
                   1082: .B FILE *
                   1083: pointer (which can be nil, if you've set up
                   1084: .B YY_INPUT
                   1085: to scan from a source other than
                   1086: .I yyin),
                   1087: and initializes
                   1088: .I yyin
                   1089: for scanning from that file.  Essentially there is no difference between
                   1090: just assigning
                   1091: .I yyin
                   1092: to a new input file or using
                   1093: .B yyrestart()
                   1094: to do so; the latter is available for compatibility with previous versions
                   1095: of
                   1096: .I flex,
                   1097: and because it can be used to switch input files in the middle of scanning.
                   1098: It can also be used to throw away the current input buffer, by calling
                   1099: it with an argument of
                   1100: .I yyin;
                   1101: but better is to use
                   1102: .B YY_FLUSH_BUFFER
                   1103: (see above).
                   1104: Note that
                   1105: .B yyrestart()
                   1106: does
                   1107: .I not
                   1108: reset the start condition to
                   1109: .B INITIAL
                   1110: (see Start Conditions, below).
                   1111: .PP
                   1112: If
                   1113: .B yylex()
                   1114: stops scanning due to executing a
                   1115: .I return
                   1116: statement in one of the actions, the scanner may then be called again and it
                   1117: will resume scanning where it left off.
                   1118: .PP
                   1119: By default (and for purposes of efficiency), the scanner uses
                   1120: block-reads rather than simple
                   1121: .I getc()
                   1122: calls to read characters from
                   1123: .I yyin.
                   1124: The nature of how it gets its input can be controlled by defining the
                   1125: .B YY_INPUT
                   1126: macro.
                   1127: YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)".  Its
                   1128: action is to place up to
                   1129: .I max_size
                   1130: characters in the character array
                   1131: .I buf
                   1132: and return in the integer variable
                   1133: .I result
                   1134: either the
                   1135: number of characters read or the constant YY_NULL (0 on Unix systems)
                   1136: to indicate EOF.  The default YY_INPUT reads from the
                   1137: global file-pointer "yyin".
                   1138: .PP
                   1139: A sample definition of YY_INPUT (in the definitions
                   1140: section of the input file):
                   1141: .nf
                   1142:
                   1143:     %{
                   1144:     #define YY_INPUT(buf,result,max_size) \\
                   1145:         { \\
                   1146:         int c = getchar(); \\
                   1147:         result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
                   1148:         }
                   1149:     %}
                   1150:
                   1151: .fi
                   1152: This definition will change the input processing to occur
                   1153: one character at a time.
                   1154: .PP
                   1155: When the scanner receives an end-of-file indication from YY_INPUT,
                   1156: it then checks the
                   1157: .B yywrap()
                   1158: function.  If
                   1159: .B yywrap()
                   1160: returns false (zero), then it is assumed that the
                   1161: function has gone ahead and set up
                   1162: .I yyin
                   1163: to point to another input file, and scanning continues.  If it returns
                   1164: true (non-zero), then the scanner terminates, returning 0 to its
                   1165: caller.  Note that in either case, the start condition remains unchanged;
                   1166: it does
                   1167: .I not
                   1168: revert to
                   1169: .B INITIAL.
                   1170: .PP
                   1171: If you do not supply your own version of
                   1172: .B yywrap(),
                   1173: then you must either use
                   1174: .B %option noyywrap
                   1175: (in which case the scanner behaves as though
                   1176: .B yywrap()
                   1177: returned 1), or you must link with
                   1178: .B \-lfl
                   1179: to obtain the default version of the routine, which always returns 1.
                   1180: .PP
                   1181: Three routines are available for scanning from in-memory buffers rather
                   1182: than files:
                   1183: .B yy_scan_string(), yy_scan_bytes(),
                   1184: and
                   1185: .B yy_scan_buffer().
                   1186: See the discussion of them below in the section Multiple Input Buffers.
                   1187: .PP
                   1188: The scanner writes its
                   1189: .B ECHO
                   1190: output to the
                   1191: .I yyout
                   1192: global (default, stdout), which may be redefined by the user simply
                   1193: by assigning it to some other
                   1194: .B FILE
                   1195: pointer.
                   1196: .SH START CONDITIONS
                   1197: .I flex
                   1198: provides a mechanism for conditionally activating rules.  Any rule
                   1199: whose pattern is prefixed with "<sc>" will only be active when
                   1200: the scanner is in the start condition named "sc".  For example,
                   1201: .nf
                   1202:
                   1203:     <STRING>[^"]*        { /* eat up the string body ... */
                   1204:                 ...
                   1205:                 }
                   1206:
                   1207: .fi
                   1208: will be active only when the scanner is in the "STRING" start
                   1209: condition, and
                   1210: .nf
                   1211:
                   1212:     <INITIAL,STRING,QUOTE>\\.        { /* handle an escape ... */
                   1213:                 ...
                   1214:                 }
                   1215:
                   1216: .fi
                   1217: will be active only when the current start condition is
                   1218: either "INITIAL", "STRING", or "QUOTE".
                   1219: .PP
                   1220: Start conditions
                   1221: are declared in the definitions (first) section of the input
                   1222: using unindented lines beginning with either
                   1223: .B %s
                   1224: or
                   1225: .B %x
                   1226: followed by a list of names.
                   1227: The former declares
                   1228: .I inclusive
                   1229: start conditions, the latter
                   1230: .I exclusive
                   1231: start conditions.  A start condition is activated using the
                   1232: .B BEGIN
                   1233: action.  Until the next
                   1234: .B BEGIN
                   1235: action is executed, rules with the given start
                   1236: condition will be active and
                   1237: rules with other start conditions will be inactive.
                   1238: If the start condition is
                   1239: .I inclusive,
                   1240: then rules with no start conditions at all will also be active.
                   1241: If it is
                   1242: .I exclusive,
                   1243: then
                   1244: .I only
                   1245: rules qualified with the start condition will be active.
                   1246: A set of rules contingent on the same exclusive start condition
                   1247: describe a scanner which is independent of any of the other rules in the
                   1248: .I flex
                   1249: input.  Because of this,
                   1250: exclusive start conditions make it easy to specify "mini-scanners"
                   1251: which scan portions of the input that are syntactically different
                   1252: from the rest (e.g., comments).
                   1253: .PP
                   1254: If the distinction between inclusive and exclusive start conditions
                   1255: is still a little vague, here's a simple example illustrating the
                   1256: connection between the two.  The set of rules:
                   1257: .nf
                   1258:
                   1259:     %s example
                   1260:     %%
                   1261:
                   1262:     <example>foo   do_something();
                   1263:
                   1264:     bar            something_else();
                   1265:
                   1266: .fi
                   1267: is equivalent to
                   1268: .nf
                   1269:
                   1270:     %x example
                   1271:     %%
                   1272:
                   1273:     <example>foo   do_something();
                   1274:
                   1275:     <INITIAL,example>bar    something_else();
                   1276:
                   1277: .fi
                   1278: Without the
                   1279: .B <INITIAL,example>
                   1280: qualifier, the
                   1281: .I bar
                   1282: pattern in the second example wouldn't be active (i.e., couldn't match)
                   1283: when in start condition
                   1284: .B example.
                   1285: If we just used
                   1286: .B <example>
                   1287: to qualify
                   1288: .I bar,
                   1289: though, then it would only be active in
                   1290: .B example
                   1291: and not in
                   1292: .B INITIAL,
                   1293: while in the first example it's active in both, because in the first
                   1294: example the
                   1295: .B example
1.10      deraadt  1296: start condition is an
1.1       deraadt  1297: .I inclusive
                   1298: .B (%s)
                   1299: start condition.
                   1300: .PP
                   1301: Also note that the special start-condition specifier
                   1302: .B <*>
                   1303: matches every start condition.  Thus, the above example could also
                   1304: have been written;
                   1305: .nf
                   1306:
                   1307:     %x example
                   1308:     %%
                   1309:
                   1310:     <example>foo   do_something();
                   1311:
                   1312:     <*>bar    something_else();
                   1313:
                   1314: .fi
                   1315: .PP
                   1316: The default rule (to
                   1317: .B ECHO
                   1318: any unmatched character) remains active in start conditions.  It
                   1319: is equivalent to:
                   1320: .nf
                   1321:
                   1322:     <*>.|\\n     ECHO;
                   1323:
                   1324: .fi
                   1325: .PP
                   1326: .B BEGIN(0)
                   1327: returns to the original state where only the rules with
                   1328: no start conditions are active.  This state can also be
                   1329: referred to as the start-condition "INITIAL", so
                   1330: .B BEGIN(INITIAL)
                   1331: is equivalent to
                   1332: .B BEGIN(0).
                   1333: (The parentheses around the start condition name are not required but
                   1334: are considered good style.)
                   1335: .PP
                   1336: .B BEGIN
                   1337: actions can also be given as indented code at the beginning
                   1338: of the rules section.  For example, the following will cause
                   1339: the scanner to enter the "SPECIAL" start condition whenever
                   1340: .B yylex()
                   1341: is called and the global variable
                   1342: .I enter_special
                   1343: is true:
                   1344: .nf
                   1345:
                   1346:             int enter_special;
                   1347:
                   1348:     %x SPECIAL
                   1349:     %%
                   1350:             if ( enter_special )
                   1351:                 BEGIN(SPECIAL);
                   1352:
                   1353:     <SPECIAL>blahblahblah
                   1354:     ...more rules follow...
                   1355:
                   1356: .fi
                   1357: .PP
                   1358: To illustrate the uses of start conditions,
                   1359: here is a scanner which provides two different interpretations
                   1360: of a string like "123.456".  By default it will treat it as
                   1361: three tokens, the integer "123", a dot ('.'), and the integer "456".
                   1362: But if the string is preceded earlier in the line by the string
                   1363: "expect-floats"
                   1364: it will treat it as a single token, the floating-point number
                   1365: 123.456:
                   1366: .nf
                   1367:
                   1368:     %{
                   1369:     #include <math.h>
                   1370:     %}
                   1371:     %s expect
                   1372:
                   1373:     %%
                   1374:     expect-floats        BEGIN(expect);
                   1375:
                   1376:     <expect>[0-9]+"."[0-9]+      {
                   1377:                 printf( "found a float, = %f\\n",
                   1378:                         atof( yytext ) );
                   1379:                 }
                   1380:     <expect>\\n           {
                   1381:                 /* that's the end of the line, so
                   1382:                  * we need another "expect-number"
                   1383:                  * before we'll recognize any more
                   1384:                  * numbers
                   1385:                  */
                   1386:                 BEGIN(INITIAL);
                   1387:                 }
                   1388:
                   1389:     [0-9]+      {
                   1390:                 printf( "found an integer, = %d\\n",
                   1391:                         atoi( yytext ) );
                   1392:                 }
                   1393:
                   1394:     "."         printf( "found a dot\\n" );
                   1395:
                   1396: .fi
                   1397: Here is a scanner which recognizes (and discards) C comments while
                   1398: maintaining a count of the current input line.
                   1399: .nf
                   1400:
                   1401:     %x comment
                   1402:     %%
                   1403:             int line_num = 1;
                   1404:
                   1405:     "/*"         BEGIN(comment);
                   1406:
                   1407:     <comment>[^*\\n]*        /* eat anything that's not a '*' */
                   1408:     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
                   1409:     <comment>\\n             ++line_num;
                   1410:     <comment>"*"+"/"        BEGIN(INITIAL);
                   1411:
                   1412: .fi
                   1413: This scanner goes to a bit of trouble to match as much
                   1414: text as possible with each rule.  In general, when attempting to write
1.10      deraadt  1415: a high-speed scanner try to match as much as possible in each rule, as
1.1       deraadt  1416: it's a big win.
                   1417: .PP
1.10      deraadt  1418: Note that start-condition names are really integer values and
1.1       deraadt  1419: can be stored as such.  Thus, the above could be extended in the
                   1420: following fashion:
                   1421: .nf
                   1422:
                   1423:     %x comment foo
                   1424:     %%
                   1425:             int line_num = 1;
                   1426:             int comment_caller;
                   1427:
                   1428:     "/*"         {
                   1429:                  comment_caller = INITIAL;
                   1430:                  BEGIN(comment);
                   1431:                  }
                   1432:
                   1433:     ...
                   1434:
                   1435:     <foo>"/*"    {
                   1436:                  comment_caller = foo;
                   1437:                  BEGIN(comment);
                   1438:                  }
                   1439:
                   1440:     <comment>[^*\\n]*        /* eat anything that's not a '*' */
                   1441:     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
                   1442:     <comment>\\n             ++line_num;
                   1443:     <comment>"*"+"/"        BEGIN(comment_caller);
                   1444:
                   1445: .fi
                   1446: Furthermore, you can access the current start condition using
                   1447: the integer-valued
                   1448: .B YY_START
                   1449: macro.  For example, the above assignments to
                   1450: .I comment_caller
                   1451: could instead be written
                   1452: .nf
                   1453:
                   1454:     comment_caller = YY_START;
                   1455:
                   1456: .fi
                   1457: Flex provides
                   1458: .B YYSTATE
                   1459: as an alias for
                   1460: .B YY_START
                   1461: (since that is what's used by AT&T
                   1462: .I lex).
                   1463: .PP
                   1464: Note that start conditions do not have their own name-space; %s's and %x's
                   1465: declare names in the same fashion as #define's.
                   1466: .PP
                   1467: Finally, here's an example of how to match C-style quoted strings using
                   1468: exclusive start conditions, including expanded escape sequences (but
                   1469: not including checking for a string that's too long):
                   1470: .nf
                   1471:
                   1472:     %x str
                   1473:
                   1474:     %%
                   1475:             char string_buf[MAX_STR_CONST];
                   1476:             char *string_buf_ptr;
                   1477:
                   1478:
                   1479:     \\"      string_buf_ptr = string_buf; BEGIN(str);
                   1480:
                   1481:     <str>\\"        { /* saw closing quote - all done */
                   1482:             BEGIN(INITIAL);
                   1483:             *string_buf_ptr = '\\0';
                   1484:             /* return string constant token type and
                   1485:              * value to parser
                   1486:              */
                   1487:             }
                   1488:
                   1489:     <str>\\n        {
                   1490:             /* error - unterminated string constant */
                   1491:             /* generate error message */
                   1492:             }
                   1493:
                   1494:     <str>\\\\[0-7]{1,3} {
                   1495:             /* octal escape sequence */
                   1496:             int result;
                   1497:
                   1498:             (void) sscanf( yytext + 1, "%o", &result );
                   1499:
                   1500:             if ( result > 0xff )
                   1501:                     /* error, constant is out-of-bounds */
                   1502:
                   1503:             *string_buf_ptr++ = result;
                   1504:             }
                   1505:
                   1506:     <str>\\\\[0-9]+ {
                   1507:             /* generate error - bad escape sequence; something
                   1508:              * like '\\48' or '\\0777777'
                   1509:              */
                   1510:             }
                   1511:
                   1512:     <str>\\\\n  *string_buf_ptr++ = '\\n';
                   1513:     <str>\\\\t  *string_buf_ptr++ = '\\t';
                   1514:     <str>\\\\r  *string_buf_ptr++ = '\\r';
                   1515:     <str>\\\\b  *string_buf_ptr++ = '\\b';
                   1516:     <str>\\\\f  *string_buf_ptr++ = '\\f';
                   1517:
                   1518:     <str>\\\\(.|\\n)  *string_buf_ptr++ = yytext[1];
                   1519:
                   1520:     <str>[^\\\\\\n\\"]+        {
                   1521:             char *yptr = yytext;
                   1522:
                   1523:             while ( *yptr )
                   1524:                     *string_buf_ptr++ = *yptr++;
                   1525:             }
                   1526:
                   1527: .fi
                   1528: .PP
                   1529: Often, such as in some of the examples above, you wind up writing a
                   1530: whole bunch of rules all preceded by the same start condition(s).  Flex
                   1531: makes this a little easier and cleaner by introducing a notion of
                   1532: start condition
                   1533: .I scope.
                   1534: A start condition scope is begun with:
                   1535: .nf
                   1536:
                   1537:     <SCs>{
                   1538:
                   1539: .fi
                   1540: where
                   1541: .I SCs
                   1542: is a list of one or more start conditions.  Inside the start condition
                   1543: scope, every rule automatically has the prefix
                   1544: .I <SCs>
                   1545: applied to it, until a
                   1546: .I '}'
                   1547: which matches the initial
                   1548: .I '{'.
                   1549: So, for example,
                   1550: .nf
                   1551:
                   1552:     <ESC>{
                   1553:         "\\\\n"   return '\\n';
                   1554:         "\\\\r"   return '\\r';
                   1555:         "\\\\f"   return '\\f';
                   1556:         "\\\\0"   return '\\0';
                   1557:     }
                   1558:
                   1559: .fi
                   1560: is equivalent to:
                   1561: .nf
                   1562:
                   1563:     <ESC>"\\\\n"  return '\\n';
                   1564:     <ESC>"\\\\r"  return '\\r';
                   1565:     <ESC>"\\\\f"  return '\\f';
                   1566:     <ESC>"\\\\0"  return '\\0';
                   1567:
                   1568: .fi
                   1569: Start condition scopes may be nested.
                   1570: .PP
                   1571: Three routines are available for manipulating stacks of start conditions:
                   1572: .TP
                   1573: .B void yy_push_state(int new_state)
                   1574: pushes the current start condition onto the top of the start condition
                   1575: stack and switches to
                   1576: .I new_state
                   1577: as though you had used
                   1578: .B BEGIN new_state
                   1579: (recall that start condition names are also integers).
                   1580: .TP
                   1581: .B void yy_pop_state()
                   1582: pops the top of the stack and switches to it via
                   1583: .B BEGIN.
                   1584: .TP
                   1585: .B int yy_top_state()
                   1586: returns the top of the stack without altering the stack's contents.
                   1587: .PP
                   1588: The start condition stack grows dynamically and so has no built-in
                   1589: size limitation.  If memory is exhausted, program execution aborts.
                   1590: .PP
                   1591: To use start condition stacks, your scanner must include a
                   1592: .B %option stack
                   1593: directive (see Options below).
                   1594: .SH MULTIPLE INPUT BUFFERS
                   1595: Some scanners (such as those which support "include" files)
                   1596: require reading from several input streams.  As
                   1597: .I flex
                   1598: scanners do a large amount of buffering, one cannot control
                   1599: where the next input will be read from by simply writing a
                   1600: .B YY_INPUT
                   1601: which is sensitive to the scanning context.
                   1602: .B YY_INPUT
                   1603: is only called when the scanner reaches the end of its buffer, which
                   1604: may be a long time after scanning a statement such as an "include"
                   1605: which requires switching the input source.
                   1606: .PP
                   1607: To negotiate these sorts of problems,
                   1608: .I flex
                   1609: provides a mechanism for creating and switching between multiple
                   1610: input buffers.  An input buffer is created by using:
                   1611: .nf
                   1612:
                   1613:     YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
                   1614:
                   1615: .fi
                   1616: which takes a
                   1617: .I FILE
                   1618: pointer and a size and creates a buffer associated with the given
                   1619: file and large enough to hold
                   1620: .I size
                   1621: characters (when in doubt, use
                   1622: .B YY_BUF_SIZE
                   1623: for the size).  It returns a
                   1624: .B YY_BUFFER_STATE
                   1625: handle, which may then be passed to other routines (see below).  The
                   1626: .B YY_BUFFER_STATE
                   1627: type is a pointer to an opaque
                   1628: .B struct yy_buffer_state
                   1629: structure, so you may safely initialize YY_BUFFER_STATE variables to
                   1630: .B ((YY_BUFFER_STATE) 0)
                   1631: if you wish, and also refer to the opaque structure in order to
                   1632: correctly declare input buffers in source files other than that
                   1633: of your scanner.  Note that the
                   1634: .I FILE
                   1635: pointer in the call to
                   1636: .B yy_create_buffer
                   1637: is only used as the value of
                   1638: .I yyin
                   1639: seen by
                   1640: .B YY_INPUT;
                   1641: if you redefine
                   1642: .B YY_INPUT
                   1643: so it no longer uses
                   1644: .I yyin,
                   1645: then you can safely pass a nil
                   1646: .I FILE
                   1647: pointer to
                   1648: .B yy_create_buffer.
                   1649: You select a particular buffer to scan from using:
                   1650: .nf
                   1651:
                   1652:     void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
                   1653:
                   1654: .fi
                   1655: switches the scanner's input buffer so subsequent tokens will
                   1656: come from
                   1657: .I new_buffer.
                   1658: Note that
                   1659: .B yy_switch_to_buffer()
                   1660: may be used by yywrap() to set things up for continued scanning, instead
                   1661: of opening a new file and pointing
                   1662: .I yyin
                   1663: at it.  Note also that switching input sources via either
                   1664: .B yy_switch_to_buffer()
                   1665: or
                   1666: .B yywrap()
                   1667: does
                   1668: .I not
                   1669: change the start condition.
                   1670: .nf
                   1671:
                   1672:     void yy_delete_buffer( YY_BUFFER_STATE buffer )
                   1673:
                   1674: .fi
                   1675: is used to reclaim the storage associated with a buffer.  (
                   1676: .B buffer
                   1677: can be nil, in which case the routine does nothing.)
                   1678: You can also clear the current contents of a buffer using:
                   1679: .nf
                   1680:
                   1681:     void yy_flush_buffer( YY_BUFFER_STATE buffer )
                   1682:
                   1683: .fi
                   1684: This function discards the buffer's contents,
                   1685: so the next time the scanner attempts to match a token from the
                   1686: buffer, it will first fill the buffer anew using
                   1687: .B YY_INPUT.
                   1688: .PP
                   1689: .B yy_new_buffer()
                   1690: is an alias for
                   1691: .B yy_create_buffer(),
                   1692: provided for compatibility with the C++ use of
                   1693: .I new
                   1694: and
                   1695: .I delete
                   1696: for creating and destroying dynamic objects.
                   1697: .PP
                   1698: Finally, the
                   1699: .B YY_CURRENT_BUFFER
                   1700: macro returns a
                   1701: .B YY_BUFFER_STATE
                   1702: handle to the current buffer.
                   1703: .PP
                   1704: Here is an example of using these features for writing a scanner
                   1705: which expands include files (the
                   1706: .B <<EOF>>
                   1707: feature is discussed below):
                   1708: .nf
                   1709:
                   1710:     /* the "incl" state is used for picking up the name
                   1711:      * of an include file
                   1712:      */
                   1713:     %x incl
                   1714:
                   1715:     %{
                   1716:     #define MAX_INCLUDE_DEPTH 10
                   1717:     YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
                   1718:     int include_stack_ptr = 0;
                   1719:     %}
                   1720:
                   1721:     %%
                   1722:     include             BEGIN(incl);
                   1723:
                   1724:     [a-z]+              ECHO;
                   1725:     [^a-z\\n]*\\n?        ECHO;
                   1726:
                   1727:     <incl>[ \\t]*      /* eat the whitespace */
                   1728:     <incl>[^ \\t\\n]+   { /* got the include file name */
                   1729:             if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
                   1730:                 {
                   1731:                 fprintf( stderr, "Includes nested too deeply" );
                   1732:                 exit( 1 );
                   1733:                 }
                   1734:
                   1735:             include_stack[include_stack_ptr++] =
                   1736:                 YY_CURRENT_BUFFER;
                   1737:
                   1738:             yyin = fopen( yytext, "r" );
                   1739:
                   1740:             if ( ! yyin )
                   1741:                 error( ... );
                   1742:
                   1743:             yy_switch_to_buffer(
                   1744:                 yy_create_buffer( yyin, YY_BUF_SIZE ) );
                   1745:
                   1746:             BEGIN(INITIAL);
                   1747:             }
                   1748:
                   1749:     <<EOF>> {
                   1750:             if ( --include_stack_ptr < 0 )
                   1751:                 {
                   1752:                 yyterminate();
                   1753:                 }
                   1754:
                   1755:             else
                   1756:                 {
                   1757:                 yy_delete_buffer( YY_CURRENT_BUFFER );
                   1758:                 yy_switch_to_buffer(
                   1759:                      include_stack[include_stack_ptr] );
                   1760:                 }
                   1761:             }
                   1762:
                   1763: .fi
                   1764: Three routines are available for setting up input buffers for
                   1765: scanning in-memory strings instead of files.  All of them create
                   1766: a new input buffer for scanning the string, and return a corresponding
                   1767: .B YY_BUFFER_STATE
                   1768: handle (which you should delete with
                   1769: .B yy_delete_buffer()
                   1770: when done with it).  They also switch to the new buffer using
                   1771: .B yy_switch_to_buffer(),
                   1772: so the next call to
                   1773: .B yylex()
                   1774: will start scanning the string.
                   1775: .TP
                   1776: .B yy_scan_string(const char *str)
                   1777: scans a NUL-terminated string.
                   1778: .TP
                   1779: .B yy_scan_bytes(const char *bytes, int len)
                   1780: scans
                   1781: .I len
                   1782: bytes (including possibly NUL's)
                   1783: starting at location
                   1784: .I bytes.
                   1785: .PP
                   1786: Note that both of these functions create and scan a
                   1787: .I copy
                   1788: of the string or bytes.  (This may be desirable, since
                   1789: .B yylex()
                   1790: modifies the contents of the buffer it is scanning.)  You can avoid the
                   1791: copy by using:
                   1792: .TP
                   1793: .B yy_scan_buffer(char *base, yy_size_t size)
                   1794: which scans in place the buffer starting at
                   1795: .I base,
                   1796: consisting of
                   1797: .I size
                   1798: bytes, the last two bytes of which
                   1799: .I must
                   1800: be
                   1801: .B YY_END_OF_BUFFER_CHAR
                   1802: (ASCII NUL).
                   1803: These last two bytes are not scanned; thus, scanning
                   1804: consists of
                   1805: .B base[0]
                   1806: through
                   1807: .B base[size-2],
                   1808: inclusive.
                   1809: .IP
                   1810: If you fail to set up
                   1811: .I base
                   1812: in this manner (i.e., forget the final two
                   1813: .B YY_END_OF_BUFFER_CHAR
                   1814: bytes), then
                   1815: .B yy_scan_buffer()
                   1816: returns a nil pointer instead of creating a new input buffer.
                   1817: .IP
                   1818: The type
                   1819: .B yy_size_t
                   1820: is an integral type to which you can cast an integer expression
                   1821: reflecting the size of the buffer.
                   1822: .SH END-OF-FILE RULES
                   1823: The special rule "<<EOF>>" indicates
                   1824: actions which are to be taken when an end-of-file is
                   1825: encountered and yywrap() returns non-zero (i.e., indicates
                   1826: no further files to process).  The action must finish
                   1827: by doing one of four things:
                   1828: .IP -
                   1829: assigning
                   1830: .I yyin
                   1831: to a new input file (in previous versions of flex, after doing the
                   1832: assignment you had to call the special action
                   1833: .B YY_NEW_FILE;
                   1834: this is no longer necessary);
                   1835: .IP -
                   1836: executing a
                   1837: .I return
                   1838: statement;
                   1839: .IP -
                   1840: executing the special
                   1841: .B yyterminate()
                   1842: action;
                   1843: .IP -
                   1844: or, switching to a new buffer using
                   1845: .B yy_switch_to_buffer()
                   1846: as shown in the example above.
                   1847: .PP
                   1848: <<EOF>> rules may not be used with other
                   1849: patterns; they may only be qualified with a list of start
                   1850: conditions.  If an unqualified <<EOF>> rule is given, it
                   1851: applies to
                   1852: .I all
                   1853: start conditions which do not already have <<EOF>> actions.  To
                   1854: specify an <<EOF>> rule for only the initial start condition, use
                   1855: .nf
                   1856:
                   1857:     <INITIAL><<EOF>>
                   1858:
                   1859: .fi
                   1860: .PP
                   1861: These rules are useful for catching things like unclosed comments.
                   1862: An example:
                   1863: .nf
                   1864:
                   1865:     %x quote
                   1866:     %%
                   1867:
                   1868:     ...other rules for dealing with quotes...
                   1869:
                   1870:     <quote><<EOF>>   {
                   1871:              error( "unterminated quote" );
                   1872:              yyterminate();
                   1873:              }
                   1874:     <<EOF>>  {
                   1875:              if ( *++filelist )
                   1876:                  yyin = fopen( *filelist, "r" );
                   1877:              else
                   1878:                 yyterminate();
                   1879:              }
                   1880:
                   1881: .fi
                   1882: .SH MISCELLANEOUS MACROS
                   1883: The macro
                   1884: .B YY_USER_ACTION
                   1885: can be defined to provide an action
                   1886: which is always executed prior to the matched rule's action.  For example,
                   1887: it could be #define'd to call a routine to convert yytext to lower-case.
                   1888: When
                   1889: .B YY_USER_ACTION
                   1890: is invoked, the variable
                   1891: .I yy_act
                   1892: gives the number of the matched rule (rules are numbered starting with 1).
                   1893: Suppose you want to profile how often each of your rules is matched.  The
                   1894: following would do the trick:
                   1895: .nf
                   1896:
                   1897:     #define YY_USER_ACTION ++ctr[yy_act]
                   1898:
                   1899: .fi
                   1900: where
                   1901: .I ctr
                   1902: is an array to hold the counts for the different rules.  Note that
                   1903: the macro
                   1904: .B YY_NUM_RULES
                   1905: gives the total number of rules (including the default rule, even if
                   1906: you use
                   1907: .B \-s),
                   1908: so a correct declaration for
                   1909: .I ctr
                   1910: is:
                   1911: .nf
                   1912:
                   1913:     int ctr[YY_NUM_RULES];
                   1914:
                   1915: .fi
                   1916: .PP
                   1917: The macro
                   1918: .B YY_USER_INIT
                   1919: may be defined to provide an action which is always executed before
                   1920: the first scan (and before the scanner's internal initializations are done).
                   1921: For example, it could be used to call a routine to read
                   1922: in a data table or open a logging file.
                   1923: .PP
                   1924: The macro
                   1925: .B yy_set_interactive(is_interactive)
                   1926: can be used to control whether the current buffer is considered
                   1927: .I interactive.
                   1928: An interactive buffer is processed more slowly,
                   1929: but must be used when the scanner's input source is indeed
                   1930: interactive to avoid problems due to waiting to fill buffers
                   1931: (see the discussion of the
                   1932: .B \-I
                   1933: flag below).  A non-zero value
1.7       aaron    1934: in the macro invocation marks the buffer as interactive, a zero
1.1       deraadt  1935: value as non-interactive.  Note that use of this macro overrides
                   1936: .B %option always-interactive
                   1937: or
                   1938: .B %option never-interactive
                   1939: (see Options below).
                   1940: .B yy_set_interactive()
                   1941: must be invoked prior to beginning to scan the buffer that is
                   1942: (or is not) to be considered interactive.
                   1943: .PP
                   1944: The macro
                   1945: .B yy_set_bol(at_bol)
                   1946: can be used to control whether the current buffer's scanning
                   1947: context for the next token match is done as though at the
                   1948: beginning of a line.  A non-zero macro argument makes rules anchored with
1.10      deraadt  1949: \'^' active, while a zero argument makes '^' rules inactive.
1.1       deraadt  1950: .PP
                   1951: The macro
                   1952: .B YY_AT_BOL()
                   1953: returns true if the next token scanned from the current buffer
                   1954: will have '^' rules active, false otherwise.
                   1955: .PP
                   1956: In the generated scanner, the actions are all gathered in one large
                   1957: switch statement and separated using
                   1958: .B YY_BREAK,
                   1959: which may be redefined.  By default, it is simply a "break", to separate
1.10      deraadt  1960: each rule's action from the following rules.
1.1       deraadt  1961: Redefining
                   1962: .B YY_BREAK
                   1963: allows, for example, C++ users to
                   1964: #define YY_BREAK to do nothing (while being very careful that every
                   1965: rule ends with a "break" or a "return"!) to avoid suffering from
                   1966: unreachable statement warnings where because a rule's action ends with
                   1967: "return", the
                   1968: .B YY_BREAK
                   1969: is inaccessible.
                   1970: .SH VALUES AVAILABLE TO THE USER
                   1971: This section summarizes the various values available to the user
                   1972: in the rule actions.
                   1973: .IP -
                   1974: .B char *yytext
                   1975: holds the text of the current token.  It may be modified but not lengthened
                   1976: (you cannot append characters to the end).
                   1977: .IP
                   1978: If the special directive
                   1979: .B %array
                   1980: appears in the first section of the scanner description, then
                   1981: .B yytext
                   1982: is instead declared
                   1983: .B char yytext[YYLMAX],
                   1984: where
                   1985: .B YYLMAX
                   1986: is a macro definition that you can redefine in the first section
                   1987: if you don't like the default value (generally 8KB).  Using
                   1988: .B %array
                   1989: results in somewhat slower scanners, but the value of
                   1990: .B yytext
                   1991: becomes immune to calls to
                   1992: .I input()
                   1993: and
                   1994: .I unput(),
                   1995: which potentially destroy its value when
                   1996: .B yytext
                   1997: is a character pointer.  The opposite of
                   1998: .B %array
                   1999: is
                   2000: .B %pointer,
                   2001: which is the default.
                   2002: .IP
                   2003: You cannot use
                   2004: .B %array
                   2005: when generating C++ scanner classes
                   2006: (the
                   2007: .B \-+
                   2008: flag).
                   2009: .IP -
                   2010: .B int yyleng
                   2011: holds the length of the current token.
                   2012: .IP -
                   2013: .B FILE *yyin
                   2014: is the file which by default
                   2015: .I flex
                   2016: reads from.  It may be redefined but doing so only makes sense before
                   2017: scanning begins or after an EOF has been encountered.  Changing it in
                   2018: the midst of scanning will have unexpected results since
                   2019: .I flex
                   2020: buffers its input; use
                   2021: .B yyrestart()
                   2022: instead.
                   2023: Once scanning terminates because an end-of-file
                   2024: has been seen, you can assign
                   2025: .I yyin
                   2026: at the new input file and then call the scanner again to continue scanning.
                   2027: .IP -
                   2028: .B void yyrestart( FILE *new_file )
                   2029: may be called to point
                   2030: .I yyin
                   2031: at the new input file.  The switch-over to the new file is immediate
                   2032: (any previously buffered-up input is lost).  Note that calling
                   2033: .B yyrestart()
                   2034: with
                   2035: .I yyin
                   2036: as an argument thus throws away the current input buffer and continues
                   2037: scanning the same input file.
                   2038: .IP -
                   2039: .B FILE *yyout
                   2040: is the file to which
                   2041: .B ECHO
                   2042: actions are done.  It can be reassigned by the user.
                   2043: .IP -
                   2044: .B YY_CURRENT_BUFFER
                   2045: returns a
                   2046: .B YY_BUFFER_STATE
                   2047: handle to the current buffer.
                   2048: .IP -
                   2049: .B YY_START
                   2050: returns an integer value corresponding to the current start
                   2051: condition.  You can subsequently use this value with
                   2052: .B BEGIN
                   2053: to return to that start condition.
                   2054: .SH INTERFACING WITH YACC
                   2055: One of the main uses of
                   2056: .I flex
                   2057: is as a companion to the
                   2058: .I yacc
                   2059: parser-generator.
                   2060: .I yacc
                   2061: parsers expect to call a routine named
                   2062: .B yylex()
                   2063: to find the next input token.  The routine is supposed to
                   2064: return the type of the next token as well as putting any associated
                   2065: value in the global
                   2066: .B yylval.
                   2067: To use
                   2068: .I flex
                   2069: with
                   2070: .I yacc,
                   2071: one specifies the
                   2072: .B \-d
                   2073: option to
                   2074: .I yacc
                   2075: to instruct it to generate the file
                   2076: .B y.tab.h
                   2077: containing definitions of all the
                   2078: .B %tokens
                   2079: appearing in the
                   2080: .I yacc
                   2081: input.  This file is then included in the
                   2082: .I flex
                   2083: scanner.  For example, if one of the tokens is "TOK_NUMBER",
                   2084: part of the scanner might look like:
                   2085: .nf
                   2086:
                   2087:     %{
                   2088:     #include "y.tab.h"
                   2089:     %}
                   2090:
                   2091:     %%
                   2092:
                   2093:     [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
                   2094:
                   2095: .fi
                   2096: .SH OPTIONS
                   2097: .I flex
                   2098: has the following options:
                   2099: .TP
                   2100: .B \-b
                   2101: Generate backing-up information to
                   2102: .I lex.backup.
                   2103: This is a list of scanner states which require backing up
                   2104: and the input characters on which they do so.  By adding rules one
                   2105: can remove backing-up states.  If
                   2106: .I all
                   2107: backing-up states are eliminated and
                   2108: .B \-Cf
                   2109: or
                   2110: .B \-CF
                   2111: is used, the generated scanner will run faster (see the
                   2112: .B \-p
                   2113: flag).  Only users who wish to squeeze every last cycle out of their
                   2114: scanners need worry about this option.  (See the section on Performance
                   2115: Considerations below.)
                   2116: .TP
                   2117: .B \-c
                   2118: is a do-nothing, deprecated option included for POSIX compliance.
                   2119: .TP
                   2120: .B \-d
                   2121: makes the generated scanner run in
                   2122: .I debug
                   2123: mode.  Whenever a pattern is recognized and the global
                   2124: .B yy_flex_debug
                   2125: is non-zero (which is the default),
                   2126: the scanner will write to
                   2127: .I stderr
                   2128: a line of the form:
                   2129: .nf
                   2130:
                   2131:     --accepting rule at line 53 ("the matched text")
                   2132:
                   2133: .fi
                   2134: The line number refers to the location of the rule in the file
                   2135: defining the scanner (i.e., the file that was fed to flex).  Messages
                   2136: are also generated when the scanner backs up, accepts the
                   2137: default rule, reaches the end of its input buffer (or encounters
                   2138: a NUL; at this point, the two look the same as far as the scanner's concerned),
                   2139: or reaches an end-of-file.
                   2140: .TP
                   2141: .B \-f
                   2142: specifies
                   2143: .I fast scanner.
                   2144: No table compression is done and stdio is bypassed.
                   2145: The result is large but fast.  This option is equivalent to
                   2146: .B \-Cfr
                   2147: (see below).
                   2148: .TP
                   2149: .B \-h
                   2150: generates a "help" summary of
                   2151: .I flex's
                   2152: options to
1.7       aaron    2153: .I stdout
1.1       deraadt  2154: and then exits.
                   2155: .B \-?
                   2156: and
                   2157: .B \-\-help
                   2158: are synonyms for
                   2159: .B \-h.
                   2160: .TP
                   2161: .B \-i
                   2162: instructs
                   2163: .I flex
                   2164: to generate a
                   2165: .I case-insensitive
                   2166: scanner.  The case of letters given in the
                   2167: .I flex
                   2168: input patterns will
                   2169: be ignored, and tokens in the input will be matched regardless of case.  The
                   2170: matched text given in
                   2171: .I yytext
                   2172: will have the preserved case (i.e., it will not be folded).
                   2173: .TP
                   2174: .B \-l
                   2175: turns on maximum compatibility with the original AT&T
                   2176: .I lex
                   2177: implementation.  Note that this does not mean
                   2178: .I full
                   2179: compatibility.  Use of this option costs a considerable amount of
                   2180: performance, and it cannot be used with the
                   2181: .B \-+, -f, -F, -Cf,
                   2182: or
                   2183: .B -CF
                   2184: options.  For details on the compatibilities it provides, see the section
                   2185: "Incompatibilities With Lex And POSIX" below.  This option also results
                   2186: in the name
                   2187: .B YY_FLEX_LEX_COMPAT
                   2188: being #define'd in the generated scanner.
                   2189: .TP
                   2190: .B \-n
                   2191: is another do-nothing, deprecated option included only for
                   2192: POSIX compliance.
                   2193: .TP
                   2194: .B \-p
                   2195: generates a performance report to stderr.  The report
                   2196: consists of comments regarding features of the
                   2197: .I flex
                   2198: input file which will cause a serious loss of performance in the resulting
                   2199: scanner.  If you give the flag twice, you will also get comments regarding
                   2200: features that lead to minor performance losses.
                   2201: .IP
                   2202: Note that the use of
                   2203: .B REJECT,
                   2204: .B %option yylineno,
                   2205: and variable trailing context (see the Deficiencies / Bugs section below)
                   2206: entails a substantial performance penalty; use of
                   2207: .I yymore(),
                   2208: the
                   2209: .B ^
                   2210: operator,
                   2211: and the
                   2212: .B \-I
                   2213: flag entail minor performance penalties.
                   2214: .TP
                   2215: .B \-s
                   2216: causes the
                   2217: .I default rule
                   2218: (that unmatched scanner input is echoed to
                   2219: .I stdout)
                   2220: to be suppressed.  If the scanner encounters input that does not
                   2221: match any of its rules, it aborts with an error.  This option is
                   2222: useful for finding holes in a scanner's rule set.
                   2223: .TP
                   2224: .B \-t
                   2225: instructs
                   2226: .I flex
                   2227: to write the scanner it generates to standard output instead
                   2228: of
                   2229: .B lex.yy.c.
                   2230: .TP
                   2231: .B \-v
                   2232: specifies that
                   2233: .I flex
                   2234: should write to
                   2235: .I stderr
                   2236: a summary of statistics regarding the scanner it generates.
                   2237: Most of the statistics are meaningless to the casual
                   2238: .I flex
                   2239: user, but the first line identifies the version of
                   2240: .I flex
                   2241: (same as reported by
                   2242: .B \-V),
                   2243: and the next line the flags used when generating the scanner, including
                   2244: those that are on by default.
                   2245: .TP
                   2246: .B \-w
                   2247: suppresses warning messages.
                   2248: .TP
                   2249: .B \-B
                   2250: instructs
                   2251: .I flex
                   2252: to generate a
                   2253: .I batch
                   2254: scanner, the opposite of
                   2255: .I interactive
                   2256: scanners generated by
                   2257: .B \-I
                   2258: (see below).  In general, you use
                   2259: .B \-B
                   2260: when you are
                   2261: .I certain
                   2262: that your scanner will never be used interactively, and you want to
                   2263: squeeze a
                   2264: .I little
                   2265: more performance out of it.  If your goal is instead to squeeze out a
                   2266: .I lot
                   2267: more performance, you should  be using the
                   2268: .B \-Cf
                   2269: or
                   2270: .B \-CF
                   2271: options (discussed below), which turn on
                   2272: .B \-B
                   2273: automatically anyway.
                   2274: .TP
                   2275: .B \-F
                   2276: specifies that the
                   2277: .ul
                   2278: fast
                   2279: scanner table representation should be used (and stdio
                   2280: bypassed).  This representation is
                   2281: about as fast as the full table representation
                   2282: .B (-f),
                   2283: and for some sets of patterns will be considerably smaller (and for
                   2284: others, larger).  In general, if the pattern set contains both "keywords"
                   2285: and a catch-all, "identifier" rule, such as in the set:
                   2286: .nf
                   2287:
                   2288:     "case"    return TOK_CASE;
                   2289:     "switch"  return TOK_SWITCH;
                   2290:     ...
                   2291:     "default" return TOK_DEFAULT;
                   2292:     [a-z]+    return TOK_ID;
                   2293:
                   2294: .fi
                   2295: then you're better off using the full table representation.  If only
                   2296: the "identifier" rule is present and you then use a hash table or some such
                   2297: to detect the keywords, you're better off using
                   2298: .B -F.
                   2299: .IP
                   2300: This option is equivalent to
                   2301: .B \-CFr
                   2302: (see below).  It cannot be used with
                   2303: .B \-+.
                   2304: .TP
                   2305: .B \-I
                   2306: instructs
                   2307: .I flex
                   2308: to generate an
                   2309: .I interactive
                   2310: scanner.  An interactive scanner is one that only looks ahead to decide
                   2311: what token has been matched if it absolutely must.  It turns out that
                   2312: always looking one extra character ahead, even if the scanner has already
                   2313: seen enough text to disambiguate the current token, is a bit faster than
                   2314: only looking ahead when necessary.  But scanners that always look ahead
                   2315: give dreadful interactive performance; for example, when a user types
                   2316: a newline, it is not recognized as a newline token until they enter
                   2317: .I another
                   2318: token, which often means typing in another whole line.
                   2319: .IP
                   2320: .I Flex
                   2321: scanners default to
                   2322: .I interactive
                   2323: unless you use the
                   2324: .B \-Cf
                   2325: or
                   2326: .B \-CF
                   2327: table-compression options (see below).  That's because if you're looking
                   2328: for high-performance you should be using one of these options, so if you
                   2329: didn't,
                   2330: .I flex
                   2331: assumes you'd rather trade off a bit of run-time performance for intuitive
                   2332: interactive behavior.  Note also that you
                   2333: .I cannot
                   2334: use
                   2335: .B \-I
                   2336: in conjunction with
                   2337: .B \-Cf
                   2338: or
                   2339: .B \-CF.
                   2340: Thus, this option is not really needed; it is on by default for all those
                   2341: cases in which it is allowed.
                   2342: .IP
                   2343: You can force a scanner to
                   2344: .I not
                   2345: be interactive by using
                   2346: .B \-B
                   2347: (see above).
                   2348: .TP
                   2349: .B \-L
                   2350: instructs
                   2351: .I flex
                   2352: not to generate
                   2353: .B #line
                   2354: directives.  Without this option,
                   2355: .I flex
                   2356: peppers the generated scanner
                   2357: with #line directives so error messages in the actions will be correctly
                   2358: located with respect to either the original
                   2359: .I flex
                   2360: input file (if the errors are due to code in the input file), or
                   2361: .B lex.yy.c
                   2362: (if the errors are
                   2363: .I flex's
                   2364: fault -- you should report these sorts of errors to the email address
                   2365: given below).
                   2366: .TP
                   2367: .B \-T
                   2368: makes
                   2369: .I flex
                   2370: run in
                   2371: .I trace
                   2372: mode.  It will generate a lot of messages to
                   2373: .I stderr
                   2374: concerning
                   2375: the form of the input and the resultant non-deterministic and deterministic
                   2376: finite automata.  This option is mostly for use in maintaining
                   2377: .I flex.
                   2378: .TP
                   2379: .B \-V
                   2380: prints the version number to
                   2381: .I stdout
                   2382: and exits.
                   2383: .B \-\-version
                   2384: is a synonym for
                   2385: .B \-V.
                   2386: .TP
                   2387: .B \-7
                   2388: instructs
                   2389: .I flex
                   2390: to generate a 7-bit scanner, i.e., one which can only recognized 7-bit
                   2391: characters in its input.  The advantage of using
                   2392: .B \-7
                   2393: is that the scanner's tables can be up to half the size of those generated
                   2394: using the
                   2395: .B \-8
                   2396: option (see below).  The disadvantage is that such scanners often hang
                   2397: or crash if their input contains an 8-bit character.
                   2398: .IP
                   2399: Note, however, that unless you generate your scanner using the
                   2400: .B \-Cf
                   2401: or
                   2402: .B \-CF
                   2403: table compression options, use of
                   2404: .B \-7
                   2405: will save only a small amount of table space, and make your scanner
                   2406: considerably less portable.
                   2407: .I Flex's
                   2408: default behavior is to generate an 8-bit scanner unless you use the
                   2409: .B \-Cf
                   2410: or
                   2411: .B \-CF,
                   2412: in which case
                   2413: .I flex
                   2414: defaults to generating 7-bit scanners unless your site was always
                   2415: configured to generate 8-bit scanners (as will often be the case
                   2416: with non-USA sites).  You can tell whether flex generated a 7-bit
                   2417: or an 8-bit scanner by inspecting the flag summary in the
                   2418: .B \-v
                   2419: output as described above.
                   2420: .IP
                   2421: Note that if you use
                   2422: .B \-Cfe
                   2423: or
                   2424: .B \-CFe
                   2425: (those table compression options, but also using equivalence classes as
                   2426: discussed see below), flex still defaults to generating an 8-bit
                   2427: scanner, since usually with these compression options full 8-bit tables
                   2428: are not much more expensive than 7-bit tables.
                   2429: .TP
                   2430: .B \-8
                   2431: instructs
                   2432: .I flex
                   2433: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
                   2434: characters.  This flag is only needed for scanners generated using
                   2435: .B \-Cf
                   2436: or
                   2437: .B \-CF,
                   2438: as otherwise flex defaults to generating an 8-bit scanner anyway.
                   2439: .IP
                   2440: See the discussion of
                   2441: .B \-7
                   2442: above for flex's default behavior and the tradeoffs between 7-bit
                   2443: and 8-bit scanners.
                   2444: .TP
                   2445: .B \-+
                   2446: specifies that you want flex to generate a C++
                   2447: scanner class.  See the section on Generating C++ Scanners below for
                   2448: details.
1.7       aaron    2449: .TP
1.1       deraadt  2450: .B \-C[aefFmr]
                   2451: controls the degree of table compression and, more generally, trade-offs
                   2452: between small scanners and fast scanners.
                   2453: .IP
                   2454: .B \-Ca
                   2455: ("align") instructs flex to trade off larger tables in the
                   2456: generated scanner for faster performance because the elements of
                   2457: the tables are better aligned for memory access and computation.  On some
                   2458: RISC architectures, fetching and manipulating longwords is more efficient
                   2459: than with smaller-sized units such as shortwords.  This option can
                   2460: double the size of the tables used by your scanner.
                   2461: .IP
                   2462: .B \-Ce
                   2463: directs
                   2464: .I flex
                   2465: to construct
                   2466: .I equivalence classes,
                   2467: i.e., sets of characters
                   2468: which have identical lexical properties (for example, if the only
                   2469: appearance of digits in the
                   2470: .I flex
                   2471: input is in the character class
                   2472: "[0-9]" then the digits '0', '1', ..., '9' will all be put
                   2473: in the same equivalence class).  Equivalence classes usually give
                   2474: dramatic reductions in the final table/object file sizes (typically
                   2475: a factor of 2-5) and are pretty cheap performance-wise (one array
                   2476: look-up per character scanned).
                   2477: .IP
                   2478: .B \-Cf
                   2479: specifies that the
                   2480: .I full
                   2481: scanner tables should be generated -
                   2482: .I flex
                   2483: should not compress the
1.10      deraadt  2484: tables by taking advantage of similar transition functions for
1.1       deraadt  2485: different states.
                   2486: .IP
                   2487: .B \-CF
                   2488: specifies that the alternate fast scanner representation (described
                   2489: above under the
                   2490: .B \-F
                   2491: flag)
                   2492: should be used.  This option cannot be used with
                   2493: .B \-+.
                   2494: .IP
                   2495: .B \-Cm
                   2496: directs
                   2497: .I flex
                   2498: to construct
                   2499: .I meta-equivalence classes,
                   2500: which are sets of equivalence classes (or characters, if equivalence
                   2501: classes are not being used) that are commonly used together.  Meta-equivalence
                   2502: classes are often a big win when using compressed tables, but they
                   2503: have a moderate performance impact (one or two "if" tests and one
                   2504: array look-up per character scanned).
                   2505: .IP
                   2506: .B \-Cr
                   2507: causes the generated scanner to
                   2508: .I bypass
                   2509: use of the standard I/O library (stdio) for input.  Instead of calling
                   2510: .B fread()
                   2511: or
                   2512: .B getc(),
                   2513: the scanner will use the
                   2514: .B read()
                   2515: system call, resulting in a performance gain which varies from system
                   2516: to system, but in general is probably negligible unless you are also using
                   2517: .B \-Cf
                   2518: or
                   2519: .B \-CF.
                   2520: Using
                   2521: .B \-Cr
                   2522: can cause strange behavior if, for example, you read from
                   2523: .I yyin
                   2524: using stdio prior to calling the scanner (because the scanner will miss
                   2525: whatever text your previous reads left in the stdio input buffer).
                   2526: .IP
                   2527: .B \-Cr
                   2528: has no effect if you define
                   2529: .B YY_INPUT
                   2530: (see The Generated Scanner above).
                   2531: .IP
                   2532: A lone
                   2533: .B \-C
                   2534: specifies that the scanner tables should be compressed but neither
                   2535: equivalence classes nor meta-equivalence classes should be used.
                   2536: .IP
                   2537: The options
                   2538: .B \-Cf
                   2539: or
                   2540: .B \-CF
                   2541: and
                   2542: .B \-Cm
                   2543: do not make sense together - there is no opportunity for meta-equivalence
                   2544: classes if the table is not being compressed.  Otherwise the options
                   2545: may be freely mixed, and are cumulative.
                   2546: .IP
                   2547: The default setting is
                   2548: .B \-Cem,
                   2549: which specifies that
                   2550: .I flex
                   2551: should generate equivalence classes
                   2552: and meta-equivalence classes.  This setting provides the highest
                   2553: degree of table compression.  You can trade off
                   2554: faster-executing scanners at the cost of larger tables with
                   2555: the following generally being true:
                   2556: .nf
                   2557:
                   2558:     slowest & smallest
                   2559:           -Cem
                   2560:           -Cm
                   2561:           -Ce
                   2562:           -C
                   2563:           -C{f,F}e
                   2564:           -C{f,F}
                   2565:           -C{f,F}a
                   2566:     fastest & largest
                   2567:
                   2568: .fi
                   2569: Note that scanners with the smallest tables are usually generated and
                   2570: compiled the quickest, so
                   2571: during development you will usually want to use the default, maximal
                   2572: compression.
                   2573: .IP
                   2574: .B \-Cfe
                   2575: is often a good compromise between speed and size for production
                   2576: scanners.
                   2577: .TP
                   2578: .B \-ooutput
                   2579: directs flex to write the scanner to the file
                   2580: .B output
                   2581: instead of
                   2582: .B lex.yy.c.
                   2583: If you combine
                   2584: .B \-o
                   2585: with the
                   2586: .B \-t
                   2587: option, then the scanner is written to
                   2588: .I stdout
                   2589: but its
                   2590: .B #line
                   2591: directives (see the
                   2592: .B \\-L
                   2593: option above) refer to the file
                   2594: .B output.
                   2595: .TP
                   2596: .B \-Pprefix
                   2597: changes the default
                   2598: .I "yy"
                   2599: prefix used by
                   2600: .I flex
1.6       aaron    2601: for all globally visible variable and function names to instead be
1.1       deraadt  2602: .I prefix.
                   2603: For example,
                   2604: .B \-Pfoo
                   2605: changes the name of
                   2606: .B yytext
                   2607: to
                   2608: .B footext.
                   2609: It also changes the name of the default output file from
                   2610: .B lex.yy.c
                   2611: to
                   2612: .B lex.foo.c.
                   2613: Here are all of the names affected:
                   2614: .nf
                   2615:
                   2616:     yy_create_buffer
                   2617:     yy_delete_buffer
                   2618:     yy_flex_debug
                   2619:     yy_init_buffer
                   2620:     yy_flush_buffer
                   2621:     yy_load_buffer_state
                   2622:     yy_switch_to_buffer
                   2623:     yyin
                   2624:     yyleng
                   2625:     yylex
                   2626:     yylineno
                   2627:     yyout
                   2628:     yyrestart
                   2629:     yytext
                   2630:     yywrap
                   2631:
                   2632: .fi
                   2633: (If you are using a C++ scanner, then only
                   2634: .B yywrap
                   2635: and
                   2636: .B yyFlexLexer
                   2637: are affected.)
                   2638: Within your scanner itself, you can still refer to the global variables
                   2639: and functions using either version of their name; but externally, they
                   2640: have the modified name.
                   2641: .IP
                   2642: This option lets you easily link together multiple
                   2643: .I flex
                   2644: programs into the same executable.  Note, though, that using this
                   2645: option also renames
                   2646: .B yywrap(),
                   2647: so you now
                   2648: .I must
                   2649: either
1.6       aaron    2650: provide your own (appropriately named) version of the routine for your
1.1       deraadt  2651: scanner, or use
                   2652: .B %option noyywrap,
                   2653: as linking with
                   2654: .B \-lfl
                   2655: no longer provides one for you by default.
                   2656: .TP
                   2657: .B \-Sskeleton_file
                   2658: overrides the default skeleton file from which
                   2659: .I flex
                   2660: constructs its scanners.  You'll never need this option unless you are doing
                   2661: .I flex
                   2662: maintenance or development.
                   2663: .PP
                   2664: .I flex
                   2665: also provides a mechanism for controlling options within the
                   2666: scanner specification itself, rather than from the flex command-line.
                   2667: This is done by including
                   2668: .B %option
                   2669: directives in the first section of the scanner specification.
                   2670: You can specify multiple options with a single
                   2671: .B %option
                   2672: directive, and multiple directives in the first section of your flex input
                   2673: file.
                   2674: .PP
                   2675: Most options are given simply as names, optionally preceded by the
                   2676: word "no" (with no intervening whitespace) to negate their meaning.
                   2677: A number are equivalent to flex flags or their negation:
                   2678: .nf
                   2679:
                   2680:     7bit            -7 option
                   2681:     8bit            -8 option
                   2682:     align           -Ca option
                   2683:     backup          -b option
                   2684:     batch           -B option
                   2685:     c++             -+ option
                   2686:
                   2687:     caseful or
                   2688:     case-sensitive  opposite of -i (default)
                   2689:
                   2690:     case-insensitive or
                   2691:     caseless        -i option
                   2692:
                   2693:     debug           -d option
                   2694:     default         opposite of -s option
                   2695:     ecs             -Ce option
                   2696:     fast            -F option
                   2697:     full            -f option
                   2698:     interactive     -I option
                   2699:     lex-compat      -l option
                   2700:     meta-ecs        -Cm option
                   2701:     perf-report     -p option
                   2702:     read            -Cr option
                   2703:     stdout          -t option
                   2704:     verbose         -v option
                   2705:     warn            opposite of -w option
                   2706:                     (use "%option nowarn" for -w)
                   2707:
                   2708:     array           equivalent to "%array"
                   2709:     pointer         equivalent to "%pointer" (default)
                   2710:
                   2711: .fi
                   2712: Some
                   2713: .B %option's
                   2714: provide features otherwise not available:
                   2715: .TP
                   2716: .B always-interactive
                   2717: instructs flex to generate a scanner which always considers its input
                   2718: "interactive".  Normally, on each new input file the scanner calls
                   2719: .B isatty()
                   2720: in an attempt to determine whether
                   2721: the scanner's input source is interactive and thus should be read a
                   2722: character at a time.  When this option is used, however, then no
                   2723: such call is made.
                   2724: .TP
                   2725: .B main
                   2726: directs flex to provide a default
                   2727: .B main()
                   2728: program for the scanner, which simply calls
                   2729: .B yylex().
                   2730: This option implies
                   2731: .B noyywrap
                   2732: (see below).
                   2733: .TP
                   2734: .B never-interactive
                   2735: instructs flex to generate a scanner which never considers its input
                   2736: "interactive" (again, no call made to
                   2737: .B isatty()).
                   2738: This is the opposite of
                   2739: .B always-interactive.
                   2740: .TP
                   2741: .B stack
                   2742: enables the use of start condition stacks (see Start Conditions above).
                   2743: .TP
                   2744: .B stdinit
                   2745: if set (i.e.,
                   2746: .B %option stdinit)
                   2747: initializes
                   2748: .I yyin
                   2749: and
                   2750: .I yyout
                   2751: to
                   2752: .I stdin
                   2753: and
                   2754: .I stdout,
                   2755: instead of the default of
                   2756: .I nil.
                   2757: Some existing
                   2758: .I lex
                   2759: programs depend on this behavior, even though it is not compliant with
                   2760: ANSI C, which does not require
                   2761: .I stdin
                   2762: and
                   2763: .I stdout
                   2764: to be compile-time constant.
                   2765: .TP
                   2766: .B yylineno
                   2767: directs
                   2768: .I flex
                   2769: to generate a scanner that maintains the number of the current line
                   2770: read from its input in the global variable
                   2771: .B yylineno.
                   2772: This option is implied by
                   2773: .B %option lex-compat.
                   2774: .TP
                   2775: .B yywrap
                   2776: if unset (i.e.,
                   2777: .B %option noyywrap),
                   2778: makes the scanner not call
                   2779: .B yywrap()
                   2780: upon an end-of-file, but simply assume that there are no more
                   2781: files to scan (until the user points
                   2782: .I yyin
                   2783: at a new file and calls
                   2784: .B yylex()
                   2785: again).
                   2786: .PP
                   2787: .I flex
                   2788: scans your rule actions to determine whether you use the
                   2789: .B REJECT
                   2790: or
                   2791: .B yymore()
                   2792: features.  The
                   2793: .B reject
                   2794: and
                   2795: .B yymore
                   2796: options are available to override its decision as to whether you use the
                   2797: options, either by setting them (e.g.,
                   2798: .B %option reject)
                   2799: to indicate the feature is indeed used, or
                   2800: unsetting them to indicate it actually is not used
                   2801: (e.g.,
                   2802: .B %option noyymore).
                   2803: .PP
                   2804: Three options take string-delimited values, offset with '=':
                   2805: .nf
                   2806:
                   2807:     %option outfile="ABC"
                   2808:
                   2809: .fi
                   2810: is equivalent to
                   2811: .B -oABC,
                   2812: and
                   2813: .nf
                   2814:
                   2815:     %option prefix="XYZ"
                   2816:
                   2817: .fi
                   2818: is equivalent to
                   2819: .B -PXYZ.
                   2820: Finally,
                   2821: .nf
                   2822:
                   2823:     %option yyclass="foo"
                   2824:
                   2825: .fi
                   2826: only applies when generating a C++ scanner (
                   2827: .B \-+
                   2828: option).  It informs
                   2829: .I flex
                   2830: that you have derived
                   2831: .B foo
                   2832: as a subclass of
                   2833: .B yyFlexLexer,
                   2834: so
                   2835: .I flex
                   2836: will place your actions in the member function
                   2837: .B foo::yylex()
                   2838: instead of
                   2839: .B yyFlexLexer::yylex().
                   2840: It also generates a
                   2841: .B yyFlexLexer::yylex()
                   2842: member function that emits a run-time error (by invoking
                   2843: .B yyFlexLexer::LexerError())
                   2844: if called.
                   2845: See Generating C++ Scanners, below, for additional information.
                   2846: .PP
                   2847: A number of options are available for lint purists who want to suppress
                   2848: the appearance of unneeded routines in the generated scanner.  Each of the
                   2849: following, if unset
                   2850: (e.g.,
                   2851: .B %option nounput
                   2852: ), results in the corresponding routine not appearing in
                   2853: the generated scanner:
                   2854: .nf
                   2855:
                   2856:     input, unput
                   2857:     yy_push_state, yy_pop_state, yy_top_state
                   2858:     yy_scan_buffer, yy_scan_bytes, yy_scan_string
                   2859:
                   2860: .fi
                   2861: (though
                   2862: .B yy_push_state()
                   2863: and friends won't appear anyway unless you use
                   2864: .B %option stack).
                   2865: .SH PERFORMANCE CONSIDERATIONS
                   2866: The main design goal of
                   2867: .I flex
                   2868: is that it generate high-performance scanners.  It has been optimized
                   2869: for dealing well with large sets of rules.  Aside from the effects on
                   2870: scanner speed of the table compression
                   2871: .B \-C
                   2872: options outlined above,
                   2873: there are a number of options/actions which degrade performance.  These
                   2874: are, from most expensive to least:
                   2875: .nf
                   2876:
                   2877:     REJECT
                   2878:     %option yylineno
                   2879:     arbitrary trailing context
                   2880:
                   2881:     pattern sets that require backing up
                   2882:     %array
                   2883:     %option interactive
                   2884:     %option always-interactive
                   2885:
                   2886:     '^' beginning-of-line operator
                   2887:     yymore()
                   2888:
                   2889: .fi
                   2890: with the first three all being quite expensive and the last two
                   2891: being quite cheap.  Note also that
                   2892: .B unput()
                   2893: is implemented as a routine call that potentially does quite a bit of
                   2894: work, while
                   2895: .B yyless()
                   2896: is a quite-cheap macro; so if just putting back some excess text you
                   2897: scanned, use
                   2898: .B yyless().
                   2899: .PP
                   2900: .B REJECT
                   2901: should be avoided at all costs when performance is important.
                   2902: It is a particularly expensive option.
                   2903: .PP
                   2904: Getting rid of backing up is messy and often may be an enormous
                   2905: amount of work for a complicated scanner.  In principal, one begins
                   2906: by using the
1.7       aaron    2907: .B \-b
1.1       deraadt  2908: flag to generate a
                   2909: .I lex.backup
                   2910: file.  For example, on the input
                   2911: .nf
                   2912:
                   2913:     %%
                   2914:     foo        return TOK_KEYWORD;
                   2915:     foobar     return TOK_KEYWORD;
                   2916:
                   2917: .fi
                   2918: the file looks like:
                   2919: .nf
                   2920:
                   2921:     State #6 is non-accepting -
                   2922:      associated rule line numbers:
                   2923:            2       3
                   2924:      out-transitions: [ o ]
                   2925:      jam-transitions: EOF [ \\001-n  p-\\177 ]
                   2926:
                   2927:     State #8 is non-accepting -
                   2928:      associated rule line numbers:
                   2929:            3
                   2930:      out-transitions: [ a ]
                   2931:      jam-transitions: EOF [ \\001-`  b-\\177 ]
                   2932:
                   2933:     State #9 is non-accepting -
                   2934:      associated rule line numbers:
                   2935:            3
                   2936:      out-transitions: [ r ]
                   2937:      jam-transitions: EOF [ \\001-q  s-\\177 ]
                   2938:
                   2939:     Compressed tables always back up.
                   2940:
                   2941: .fi
                   2942: The first few lines tell us that there's a scanner state in
                   2943: which it can make a transition on an 'o' but not on any other
                   2944: character, and that in that state the currently scanned text does not match
                   2945: any rule.  The state occurs when trying to match the rules found
                   2946: at lines 2 and 3 in the input file.
                   2947: If the scanner is in that state and then reads
                   2948: something other than an 'o', it will have to back up to find
                   2949: a rule which is matched.  With
                   2950: a bit of headscratching one can see that this must be the
                   2951: state it's in when it has seen "fo".  When this has happened,
                   2952: if anything other than another 'o' is seen, the scanner will
                   2953: have to back up to simply match the 'f' (by the default rule).
                   2954: .PP
                   2955: The comment regarding State #8 indicates there's a problem
                   2956: when "foob" has been scanned.  Indeed, on any character other
                   2957: than an 'a', the scanner will have to back up to accept "foo".
                   2958: Similarly, the comment for State #9 concerns when "fooba" has
                   2959: been scanned and an 'r' does not follow.
                   2960: .PP
                   2961: The final comment reminds us that there's no point going to
                   2962: all the trouble of removing backing up from the rules unless
                   2963: we're using
                   2964: .B \-Cf
                   2965: or
                   2966: .B \-CF,
                   2967: since there's no performance gain doing so with compressed scanners.
                   2968: .PP
                   2969: The way to remove the backing up is to add "error" rules:
                   2970: .nf
                   2971:
                   2972:     %%
                   2973:     foo         return TOK_KEYWORD;
                   2974:     foobar      return TOK_KEYWORD;
                   2975:
                   2976:     fooba       |
                   2977:     foob        |
                   2978:     fo          {
                   2979:                 /* false alarm, not really a keyword */
                   2980:                 return TOK_ID;
                   2981:                 }
                   2982:
                   2983: .fi
                   2984: .PP
                   2985: Eliminating backing up among a list of keywords can also be
                   2986: done using a "catch-all" rule:
                   2987: .nf
                   2988:
                   2989:     %%
                   2990:     foo         return TOK_KEYWORD;
                   2991:     foobar      return TOK_KEYWORD;
                   2992:
                   2993:     [a-z]+      return TOK_ID;
                   2994:
                   2995: .fi
                   2996: This is usually the best solution when appropriate.
                   2997: .PP
                   2998: Backing up messages tend to cascade.
                   2999: With a complicated set of rules it's not uncommon to get hundreds
                   3000: of messages.  If one can decipher them, though, it often
                   3001: only takes a dozen or so rules to eliminate the backing up (though
                   3002: it's easy to make a mistake and have an error rule accidentally match
                   3003: a valid token.  A possible future
                   3004: .I flex
                   3005: feature will be to automatically add rules to eliminate backing up).
                   3006: .PP
                   3007: It's important to keep in mind that you gain the benefits of eliminating
                   3008: backing up only if you eliminate
                   3009: .I every
                   3010: instance of backing up.  Leaving just one means you gain nothing.
                   3011: .PP
                   3012: .I Variable
                   3013: trailing context (where both the leading and trailing parts do not have
                   3014: a fixed length) entails almost the same performance loss as
                   3015: .B REJECT
                   3016: (i.e., substantial).  So when possible a rule like:
                   3017: .nf
                   3018:
                   3019:     %%
                   3020:     mouse|rat/(cat|dog)   run();
                   3021:
                   3022: .fi
                   3023: is better written:
                   3024: .nf
                   3025:
                   3026:     %%
                   3027:     mouse/cat|dog         run();
                   3028:     rat/cat|dog           run();
                   3029:
                   3030: .fi
                   3031: or as
                   3032: .nf
                   3033:
                   3034:     %%
                   3035:     mouse|rat/cat         run();
                   3036:     mouse|rat/dog         run();
                   3037:
                   3038: .fi
                   3039: Note that here the special '|' action does
                   3040: .I not
                   3041: provide any savings, and can even make things worse (see
                   3042: Deficiencies / Bugs below).
                   3043: .LP
                   3044: Another area where the user can increase a scanner's performance
                   3045: (and one that's easier to implement) arises from the fact that
                   3046: the longer the tokens matched, the faster the scanner will run.
                   3047: This is because with long tokens the processing of most input
                   3048: characters takes place in the (short) inner scanning loop, and
                   3049: does not often have to go through the additional work of setting up
                   3050: the scanning environment (e.g.,
                   3051: .B yytext)
                   3052: for the action.  Recall the scanner for C comments:
                   3053: .nf
                   3054:
                   3055:     %x comment
                   3056:     %%
                   3057:             int line_num = 1;
                   3058:
                   3059:     "/*"         BEGIN(comment);
                   3060:
                   3061:     <comment>[^*\\n]*
                   3062:     <comment>"*"+[^*/\\n]*
                   3063:     <comment>\\n             ++line_num;
                   3064:     <comment>"*"+"/"        BEGIN(INITIAL);
                   3065:
                   3066: .fi
                   3067: This could be sped up by writing it as:
                   3068: .nf
                   3069:
                   3070:     %x comment
                   3071:     %%
                   3072:             int line_num = 1;
                   3073:
                   3074:     "/*"         BEGIN(comment);
                   3075:
                   3076:     <comment>[^*\\n]*
                   3077:     <comment>[^*\\n]*\\n      ++line_num;
                   3078:     <comment>"*"+[^*/\\n]*
                   3079:     <comment>"*"+[^*/\\n]*\\n ++line_num;
                   3080:     <comment>"*"+"/"        BEGIN(INITIAL);
                   3081:
                   3082: .fi
                   3083: Now instead of each newline requiring the processing of another
                   3084: action, recognizing the newlines is "distributed" over the other rules
                   3085: to keep the matched text as long as possible.  Note that
                   3086: .I adding
                   3087: rules does
                   3088: .I not
                   3089: slow down the scanner!  The speed of the scanner is independent
                   3090: of the number of rules or (modulo the considerations given at the
                   3091: beginning of this section) how complicated the rules are with
                   3092: regard to operators such as '*' and '|'.
                   3093: .PP
                   3094: A final example in speeding up a scanner: suppose you want to scan
                   3095: through a file containing identifiers and keywords, one per line
                   3096: and with no other extraneous characters, and recognize all the
                   3097: keywords.  A natural first approach is:
                   3098: .nf
                   3099:
                   3100:     %%
                   3101:     asm      |
                   3102:     auto     |
                   3103:     break    |
                   3104:     ... etc ...
                   3105:     volatile |
                   3106:     while    /* it's a keyword */
                   3107:
                   3108:     .|\\n     /* it's not a keyword */
                   3109:
                   3110: .fi
                   3111: To eliminate the back-tracking, introduce a catch-all rule:
                   3112: .nf
                   3113:
                   3114:     %%
                   3115:     asm      |
                   3116:     auto     |
                   3117:     break    |
                   3118:     ... etc ...
                   3119:     volatile |
                   3120:     while    /* it's a keyword */
                   3121:
                   3122:     [a-z]+   |
                   3123:     .|\\n     /* it's not a keyword */
                   3124:
                   3125: .fi
                   3126: Now, if it's guaranteed that there's exactly one word per line,
                   3127: then we can reduce the total number of matches by a half by
                   3128: merging in the recognition of newlines with that of the other
                   3129: tokens:
                   3130: .nf
                   3131:
                   3132:     %%
                   3133:     asm\\n    |
                   3134:     auto\\n   |
                   3135:     break\\n  |
                   3136:     ... etc ...
                   3137:     volatile\\n |
                   3138:     while\\n  /* it's a keyword */
                   3139:
                   3140:     [a-z]+\\n |
                   3141:     .|\\n     /* it's not a keyword */
                   3142:
                   3143: .fi
                   3144: One has to be careful here, as we have now reintroduced backing up
                   3145: into the scanner.  In particular, while
                   3146: .I we
                   3147: know that there will never be any characters in the input stream
                   3148: other than letters or newlines,
                   3149: .I flex
                   3150: can't figure this out, and it will plan for possibly needing to back up
                   3151: when it has scanned a token like "auto" and then the next character
                   3152: is something other than a newline or a letter.  Previously it would
                   3153: then just match the "auto" rule and be done, but now it has no "auto"
1.10      deraadt  3154: rule, only an "auto\\n" rule.  To eliminate the possibility of backing up,
1.1       deraadt  3155: we could either duplicate all rules but without final newlines, or,
                   3156: since we never expect to encounter such an input and therefore don't
                   3157: how it's classified, we can introduce one more catch-all rule, this
                   3158: one which doesn't include a newline:
                   3159: .nf
                   3160:
                   3161:     %%
                   3162:     asm\\n    |
                   3163:     auto\\n   |
                   3164:     break\\n  |
                   3165:     ... etc ...
                   3166:     volatile\\n |
                   3167:     while\\n  /* it's a keyword */
                   3168:
                   3169:     [a-z]+\\n |
                   3170:     [a-z]+   |
                   3171:     .|\\n     /* it's not a keyword */
                   3172:
                   3173: .fi
                   3174: Compiled with
                   3175: .B \-Cf,
                   3176: this is about as fast as one can get a
1.7       aaron    3177: .I flex
1.1       deraadt  3178: scanner to go for this particular problem.
                   3179: .PP
                   3180: A final note:
                   3181: .I flex
                   3182: is slow when matching NUL's, particularly when a token contains
                   3183: multiple NUL's.
                   3184: It's best to write rules which match
                   3185: .I short
                   3186: amounts of text if it's anticipated that the text will often include NUL's.
                   3187: .PP
                   3188: Another final note regarding performance: as mentioned above in the section
                   3189: How the Input is Matched, dynamically resizing
                   3190: .B yytext
                   3191: to accommodate huge tokens is a slow process because it presently requires that
                   3192: the (huge) token be rescanned from the beginning.  Thus if performance is
                   3193: vital, you should attempt to match "large" quantities of text but not
                   3194: "huge" quantities, where the cutoff between the two is at about 8K
                   3195: characters/token.
                   3196: .SH GENERATING C++ SCANNERS
                   3197: .I flex
                   3198: provides two different ways to generate scanners for use with C++.  The
                   3199: first way is to simply compile a scanner generated by
                   3200: .I flex
                   3201: using a C++ compiler instead of a C compiler.  You should not encounter
1.10      deraadt  3202: any compilation errors (please report any you find to the email address
1.1       deraadt  3203: given in the Author section below).  You can then use C++ code in your
                   3204: rule actions instead of C code.  Note that the default input source for
                   3205: your scanner remains
                   3206: .I yyin,
                   3207: and default echoing is still done to
                   3208: .I yyout.
                   3209: Both of these remain
                   3210: .I FILE *
                   3211: variables and not C++
                   3212: .I streams.
                   3213: .PP
                   3214: You can also use
                   3215: .I flex
                   3216: to generate a C++ scanner class, using the
                   3217: .B \-+
                   3218: option (or, equivalently,
                   3219: .B %option c++),
                   3220: which is automatically specified if the name of the flex
                   3221: executable ends in a '+', such as
                   3222: .I flex++.
                   3223: When using this option, flex defaults to generating the scanner to the file
                   3224: .B lex.yy.cc
                   3225: instead of
                   3226: .B lex.yy.c.
                   3227: The generated scanner includes the header file
1.5       deraadt  3228: .I g++/FlexLexer.h,
1.1       deraadt  3229: which defines the interface to two C++ classes.
                   3230: .PP
                   3231: The first class,
                   3232: .B FlexLexer,
                   3233: provides an abstract base class defining the general scanner class
                   3234: interface.  It provides the following member functions:
                   3235: .TP
                   3236: .B const char* YYText()
                   3237: returns the text of the most recently matched token, the equivalent of
                   3238: .B yytext.
                   3239: .TP
                   3240: .B int YYLeng()
                   3241: returns the length of the most recently matched token, the equivalent of
                   3242: .B yyleng.
                   3243: .TP
                   3244: .B int lineno() const
                   3245: returns the current input line number
                   3246: (see
                   3247: .B %option yylineno),
                   3248: or
                   3249: .B 1
                   3250: if
                   3251: .B %option yylineno
                   3252: was not used.
                   3253: .TP
                   3254: .B void set_debug( int flag )
                   3255: sets the debugging flag for the scanner, equivalent to assigning to
                   3256: .B yy_flex_debug
                   3257: (see the Options section above).  Note that you must build the scanner
                   3258: using
                   3259: .B %option debug
                   3260: to include debugging information in it.
                   3261: .TP
                   3262: .B int debug() const
                   3263: returns the current setting of the debugging flag.
                   3264: .PP
                   3265: Also provided are member functions equivalent to
                   3266: .B yy_switch_to_buffer(),
                   3267: .B yy_create_buffer()
                   3268: (though the first argument is an
                   3269: .B istream*
                   3270: object pointer and not a
                   3271: .B FILE*),
                   3272: .B yy_flush_buffer(),
                   3273: .B yy_delete_buffer(),
                   3274: and
                   3275: .B yyrestart()
1.10      deraadt  3276: (again, the first argument is an
1.1       deraadt  3277: .B istream*
                   3278: object pointer).
                   3279: .PP
                   3280: The second class defined in
1.5       deraadt  3281: .I g++/FlexLexer.h
1.1       deraadt  3282: is
                   3283: .B yyFlexLexer,
                   3284: which is derived from
                   3285: .B FlexLexer.
                   3286: It defines the following additional member functions:
                   3287: .TP
                   3288: .B
                   3289: yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
                   3290: constructs a
                   3291: .B yyFlexLexer
                   3292: object using the given streams for input and output.  If not specified,
                   3293: the streams default to
                   3294: .B cin
                   3295: and
                   3296: .B cout,
                   3297: respectively.
                   3298: .TP
                   3299: .B virtual int yylex()
1.10      deraadt  3300: performs the same role as
1.1       deraadt  3301: .B yylex()
                   3302: does for ordinary flex scanners: it scans the input stream, consuming
                   3303: tokens, until a rule's action returns a value.  If you derive a subclass
                   3304: .B S
                   3305: from
                   3306: .B yyFlexLexer
                   3307: and want to access the member functions and variables of
                   3308: .B S
                   3309: inside
                   3310: .B yylex(),
                   3311: then you need to use
                   3312: .B %option yyclass="S"
                   3313: to inform
                   3314: .I flex
                   3315: that you will be using that subclass instead of
                   3316: .B yyFlexLexer.
                   3317: In this case, rather than generating
                   3318: .B yyFlexLexer::yylex(),
                   3319: .I flex
                   3320: generates
                   3321: .B S::yylex()
                   3322: (and also generates a dummy
                   3323: .B yyFlexLexer::yylex()
                   3324: that calls
                   3325: .B yyFlexLexer::LexerError()
                   3326: if called).
                   3327: .TP
                   3328: .B
                   3329: virtual void switch_streams(istream* new_in = 0,
                   3330: .B
                   3331: ostream* new_out = 0)
                   3332: reassigns
                   3333: .B yyin
                   3334: to
                   3335: .B new_in
                   3336: (if non-nil)
                   3337: and
                   3338: .B yyout
                   3339: to
                   3340: .B new_out
                   3341: (ditto), deleting the previous input buffer if
                   3342: .B yyin
                   3343: is reassigned.
                   3344: .TP
                   3345: .B
                   3346: int yylex( istream* new_in, ostream* new_out = 0 )
                   3347: first switches the input streams via
                   3348: .B switch_streams( new_in, new_out )
                   3349: and then returns the value of
                   3350: .B yylex().
                   3351: .PP
                   3352: In addition,
                   3353: .B yyFlexLexer
                   3354: defines the following protected virtual functions which you can redefine
                   3355: in derived classes to tailor the scanner:
                   3356: .TP
                   3357: .B
                   3358: virtual int LexerInput( char* buf, int max_size )
                   3359: reads up to
                   3360: .B max_size
                   3361: characters into
                   3362: .B buf
                   3363: and returns the number of characters read.  To indicate end-of-input,
                   3364: return 0 characters.  Note that "interactive" scanners (see the
                   3365: .B \-B
                   3366: and
                   3367: .B \-I
                   3368: flags) define the macro
                   3369: .B YY_INTERACTIVE.
                   3370: If you redefine
                   3371: .B LexerInput()
                   3372: and need to take different actions depending on whether or not
                   3373: the scanner might be scanning an interactive input source, you can
                   3374: test for the presence of this name via
                   3375: .B #ifdef.
                   3376: .TP
                   3377: .B
                   3378: virtual void LexerOutput( const char* buf, int size )
                   3379: writes out
                   3380: .B size
                   3381: characters from the buffer
                   3382: .B buf,
                   3383: which, while NUL-terminated, may also contain "internal" NUL's if
                   3384: the scanner's rules can match text with NUL's in them.
                   3385: .TP
                   3386: .B
                   3387: virtual void LexerError( const char* msg )
                   3388: reports a fatal error message.  The default version of this function
                   3389: writes the message to the stream
                   3390: .B cerr
                   3391: and exits.
                   3392: .PP
                   3393: Note that a
                   3394: .B yyFlexLexer
                   3395: object contains its
                   3396: .I entire
                   3397: scanning state.  Thus you can use such objects to create reentrant
                   3398: scanners.  You can instantiate multiple instances of the same
                   3399: .B yyFlexLexer
                   3400: class, and you can also combine multiple C++ scanner classes together
                   3401: in the same program using the
                   3402: .B \-P
                   3403: option discussed above.
                   3404: .PP
                   3405: Finally, note that the
                   3406: .B %array
                   3407: feature is not available to C++ scanner classes; you must use
                   3408: .B %pointer
                   3409: (the default).
                   3410: .PP
                   3411: Here is an example of a simple C++ scanner:
                   3412: .nf
                   3413:
                   3414:         // An example of using the flex C++ scanner class.
                   3415:
                   3416:     %{
                   3417:     int mylineno = 0;
                   3418:     %}
                   3419:
                   3420:     string  \\"[^\\n"]+\\"
                   3421:
                   3422:     ws      [ \\t]+
                   3423:
                   3424:     alpha   [A-Za-z]
                   3425:     dig     [0-9]
                   3426:     name    ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])*
                   3427:     num1    [-+]?{dig}+\\.?([eE][-+]?{dig}+)?
                   3428:     num2    [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)?
                   3429:     number  {num1}|{num2}
                   3430:
                   3431:     %%
                   3432:
                   3433:     {ws}    /* skip blanks and tabs */
                   3434:
                   3435:     "/*"    {
                   3436:             int c;
                   3437:
                   3438:             while((c = yyinput()) != 0)
                   3439:                 {
                   3440:                 if(c == '\\n')
                   3441:                     ++mylineno;
                   3442:
                   3443:                 else if(c == '*')
                   3444:                     {
                   3445:                     if((c = yyinput()) == '/')
                   3446:                         break;
                   3447:                     else
                   3448:                         unput(c);
                   3449:                     }
                   3450:                 }
                   3451:             }
                   3452:
                   3453:     {number}  cout << "number " << YYText() << '\\n';
                   3454:
                   3455:     \\n        mylineno++;
                   3456:
                   3457:     {name}    cout << "name " << YYText() << '\\n';
                   3458:
                   3459:     {string}  cout << "string " << YYText() << '\\n';
                   3460:
                   3461:     %%
                   3462:
                   3463:     int main( int /* argc */, char** /* argv */ )
                   3464:         {
                   3465:         FlexLexer* lexer = new yyFlexLexer;
                   3466:         while(lexer->yylex() != 0)
                   3467:             ;
                   3468:         return 0;
                   3469:         }
                   3470: .fi
                   3471: If you want to create multiple (different) lexer classes, you use the
                   3472: .B \-P
                   3473: flag (or the
                   3474: .B prefix=
                   3475: option) to rename each
                   3476: .B yyFlexLexer
                   3477: to some other
                   3478: .B xxFlexLexer.
                   3479: You then can include
1.5       deraadt  3480: .B <g++/FlexLexer.h>
1.1       deraadt  3481: in your other sources once per lexer class, first renaming
                   3482: .B yyFlexLexer
                   3483: as follows:
                   3484: .nf
                   3485:
                   3486:     #undef yyFlexLexer
                   3487:     #define yyFlexLexer xxFlexLexer
1.5       deraadt  3488:     #include <g++/FlexLexer.h>
1.1       deraadt  3489:
                   3490:     #undef yyFlexLexer
                   3491:     #define yyFlexLexer zzFlexLexer
1.5       deraadt  3492:     #include <g++/FlexLexer.h>
1.1       deraadt  3493:
                   3494: .fi
                   3495: if, for example, you used
                   3496: .B %option prefix="xx"
                   3497: for one of your scanners and
                   3498: .B %option prefix="zz"
                   3499: for the other.
                   3500: .PP
                   3501: IMPORTANT: the present form of the scanning class is
                   3502: .I experimental
1.7       aaron    3503: and may change considerably between major releases.
1.1       deraadt  3504: .SH INCOMPATIBILITIES WITH LEX AND POSIX
                   3505: .I flex
                   3506: is a rewrite of the AT&T Unix
                   3507: .I lex
                   3508: tool (the two implementations do not share any code, though),
                   3509: with some extensions and incompatibilities, both of which
                   3510: are of concern to those who wish to write scanners acceptable
                   3511: to either implementation.  Flex is fully compliant with the POSIX
                   3512: .I lex
                   3513: specification, except that when using
                   3514: .B %pointer
                   3515: (the default), a call to
                   3516: .B unput()
                   3517: destroys the contents of
                   3518: .B yytext,
                   3519: which is counter to the POSIX specification.
                   3520: .PP
                   3521: In this section we discuss all of the known areas of incompatibility
                   3522: between flex, AT&T lex, and the POSIX specification.
                   3523: .PP
                   3524: .I flex's
                   3525: .B \-l
                   3526: option turns on maximum compatibility with the original AT&T
                   3527: .I lex
                   3528: implementation, at the cost of a major loss in the generated scanner's
                   3529: performance.  We note below which incompatibilities can be overcome
                   3530: using the
                   3531: .B \-l
                   3532: option.
                   3533: .PP
                   3534: .I flex
                   3535: is fully compatible with
                   3536: .I lex
                   3537: with the following exceptions:
                   3538: .IP -
                   3539: The undocumented
                   3540: .I lex
                   3541: scanner internal variable
                   3542: .B yylineno
                   3543: is not supported unless
                   3544: .B \-l
                   3545: or
                   3546: .B %option yylineno
                   3547: is used.
                   3548: .IP
                   3549: .B yylineno
                   3550: should be maintained on a per-buffer basis, rather than a per-scanner
                   3551: (single global variable) basis.
                   3552: .IP
                   3553: .B yylineno
                   3554: is not part of the POSIX specification.
                   3555: .IP -
                   3556: The
                   3557: .B input()
                   3558: routine is not redefinable, though it may be called to read characters
                   3559: following whatever has been matched by a rule.  If
                   3560: .B input()
                   3561: encounters an end-of-file the normal
                   3562: .B yywrap()
                   3563: processing is done.  A ``real'' end-of-file is returned by
                   3564: .B input()
                   3565: as
                   3566: .I EOF.
                   3567: .IP
                   3568: Input is instead controlled by defining the
                   3569: .B YY_INPUT
                   3570: macro.
                   3571: .IP
                   3572: The
                   3573: .I flex
                   3574: restriction that
                   3575: .B input()
                   3576: cannot be redefined is in accordance with the POSIX specification,
                   3577: which simply does not specify any way of controlling the
                   3578: scanner's input other than by making an initial assignment to
                   3579: .I yyin.
                   3580: .IP -
                   3581: The
                   3582: .B unput()
                   3583: routine is not redefinable.  This restriction is in accordance with POSIX.
                   3584: .IP -
                   3585: .I flex
                   3586: scanners are not as reentrant as
                   3587: .I lex
                   3588: scanners.  In particular, if you have an interactive scanner and
                   3589: an interrupt handler which long-jumps out of the scanner, and
                   3590: the scanner is subsequently called again, you may get the following
                   3591: message:
                   3592: .nf
                   3593:
                   3594:     fatal flex scanner internal error--end of buffer missed
                   3595:
                   3596: .fi
                   3597: To reenter the scanner, first use
                   3598: .nf
                   3599:
                   3600:     yyrestart( yyin );
                   3601:
                   3602: .fi
                   3603: Note that this call will throw away any buffered input; usually this
                   3604: isn't a problem with an interactive scanner.
                   3605: .IP
                   3606: Also note that flex C++ scanner classes
                   3607: .I are
                   3608: reentrant, so if using C++ is an option for you, you should use
                   3609: them instead.  See "Generating C++ Scanners" above for details.
                   3610: .IP -
                   3611: .B output()
                   3612: is not supported.
                   3613: Output from the
                   3614: .B ECHO
                   3615: macro is done to the file-pointer
                   3616: .I yyout
                   3617: (default
                   3618: .I stdout).
                   3619: .IP
                   3620: .B output()
                   3621: is not part of the POSIX specification.
                   3622: .IP -
                   3623: .I lex
                   3624: does not support exclusive start conditions (%x), though they
                   3625: are in the POSIX specification.
                   3626: .IP -
                   3627: When definitions are expanded,
                   3628: .I flex
                   3629: encloses them in parentheses.
                   3630: With lex, the following:
                   3631: .nf
                   3632:
                   3633:     NAME    [A-Z][A-Z0-9]*
                   3634:     %%
                   3635:     foo{NAME}?      printf( "Found it\\n" );
                   3636:     %%
                   3637:
                   3638: .fi
                   3639: will not match the string "foo" because when the macro
                   3640: is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
                   3641: and the precedence is such that the '?' is associated with
                   3642: "[A-Z0-9]*".  With
                   3643: .I flex,
                   3644: the rule will be expanded to
                   3645: "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
                   3646: .IP
                   3647: Note that if the definition begins with
                   3648: .B ^
                   3649: or ends with
                   3650: .B $
                   3651: then it is
                   3652: .I not
                   3653: expanded with parentheses, to allow these operators to appear in
                   3654: definitions without losing their special meanings.  But the
                   3655: .B <s>, /,
                   3656: and
                   3657: .B <<EOF>>
                   3658: operators cannot be used in a
                   3659: .I flex
                   3660: definition.
                   3661: .IP
                   3662: Using
                   3663: .B \-l
                   3664: results in the
                   3665: .I lex
                   3666: behavior of no parentheses around the definition.
                   3667: .IP
                   3668: The POSIX specification is that the definition be enclosed in parentheses.
                   3669: .IP -
                   3670: Some implementations of
                   3671: .I lex
                   3672: allow a rule's action to begin on a separate line, if the rule's pattern
                   3673: has trailing whitespace:
                   3674: .nf
                   3675:
                   3676:     %%
                   3677:     foo|bar<space here>
                   3678:       { foobar_action(); }
                   3679:
                   3680: .fi
                   3681: .I flex
                   3682: does not support this feature.
                   3683: .IP -
                   3684: The
                   3685: .I lex
                   3686: .B %r
                   3687: (generate a Ratfor scanner) option is not supported.  It is not part
                   3688: of the POSIX specification.
                   3689: .IP -
                   3690: After a call to
                   3691: .B unput(),
                   3692: .I yytext
                   3693: is undefined until the next token is matched, unless the scanner
                   3694: was built using
                   3695: .B %array.
                   3696: This is not the case with
                   3697: .I lex
                   3698: or the POSIX specification.  The
                   3699: .B \-l
                   3700: option does away with this incompatibility.
                   3701: .IP -
                   3702: The precedence of the
                   3703: .B {}
                   3704: (numeric range) operator is different.
                   3705: .I lex
                   3706: interprets "abc{1,3}" as "match one, two, or
                   3707: three occurrences of 'abc'", whereas
                   3708: .I flex
                   3709: interprets it as "match 'ab'
                   3710: followed by one, two, or three occurrences of 'c'".  The latter is
                   3711: in agreement with the POSIX specification.
                   3712: .IP -
                   3713: The precedence of the
                   3714: .B ^
                   3715: operator is different.
                   3716: .I lex
                   3717: interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
                   3718: or 'bar' anywhere", whereas
                   3719: .I flex
                   3720: interprets it as "match either 'foo' or 'bar' if they come at the beginning
                   3721: of a line".  The latter is in agreement with the POSIX specification.
                   3722: .IP -
                   3723: The special table-size declarations such as
                   3724: .B %a
                   3725: supported by
                   3726: .I lex
                   3727: are not required by
                   3728: .I flex
                   3729: scanners;
                   3730: .I flex
                   3731: ignores them.
                   3732: .IP -
                   3733: The name
                   3734: .bd
                   3735: FLEX_SCANNER
                   3736: is #define'd so scanners may be written for use with either
                   3737: .I flex
                   3738: or
                   3739: .I lex.
                   3740: Scanners also include
                   3741: .B YY_FLEX_MAJOR_VERSION
                   3742: and
                   3743: .B YY_FLEX_MINOR_VERSION
                   3744: indicating which version of
                   3745: .I flex
                   3746: generated the scanner
                   3747: (for example, for the 2.5 release, these defines would be 2 and 5
                   3748: respectively).
                   3749: .PP
                   3750: The following
                   3751: .I flex
                   3752: features are not included in
                   3753: .I lex
                   3754: or the POSIX specification:
                   3755: .nf
                   3756:
                   3757:     C++ scanners
                   3758:     %option
                   3759:     start condition scopes
                   3760:     start condition stacks
                   3761:     interactive/non-interactive scanners
                   3762:     yy_scan_string() and friends
                   3763:     yyterminate()
                   3764:     yy_set_interactive()
                   3765:     yy_set_bol()
                   3766:     YY_AT_BOL()
                   3767:     <<EOF>>
                   3768:     <*>
                   3769:     YY_DECL
                   3770:     YY_START
                   3771:     YY_USER_ACTION
                   3772:     YY_USER_INIT
                   3773:     #line directives
                   3774:     %{}'s around actions
                   3775:     multiple actions on a line
                   3776:
                   3777: .fi
                   3778: plus almost all of the flex flags.
                   3779: The last feature in the list refers to the fact that with
                   3780: .I flex
                   3781: you can put multiple actions on the same line, separated with
                   3782: semi-colons, while with
                   3783: .I lex,
                   3784: the following
                   3785: .nf
                   3786:
                   3787:     foo    handle_foo(); ++num_foos_seen;
                   3788:
                   3789: .fi
                   3790: is (rather surprisingly) truncated to
                   3791: .nf
                   3792:
                   3793:     foo    handle_foo();
                   3794:
                   3795: .fi
                   3796: .I flex
                   3797: does not truncate the action.  Actions that are not enclosed in
                   3798: braces are simply terminated at the end of the line.
                   3799: .SH DIAGNOSTICS
                   3800: .PP
                   3801: .I warning, rule cannot be matched
                   3802: indicates that the given rule
                   3803: cannot be matched because it follows other rules that will
                   3804: always match the same text as it.  For
                   3805: example, in the following "foo" cannot be matched because it comes after
                   3806: an identifier "catch-all" rule:
                   3807: .nf
                   3808:
                   3809:     [a-z]+    got_identifier();
                   3810:     foo       got_foo();
                   3811:
                   3812: .fi
                   3813: Using
                   3814: .B REJECT
                   3815: in a scanner suppresses this warning.
                   3816: .PP
                   3817: .I warning,
                   3818: .B \-s
                   3819: .I
                   3820: option given but default rule can be matched
                   3821: means that it is possible (perhaps only in a particular start condition)
                   3822: that the default rule (match any single character) is the only one
                   3823: that will match a particular input.  Since
                   3824: .B \-s
                   3825: was given, presumably this is not intended.
                   3826: .PP
                   3827: .I reject_used_but_not_detected undefined
                   3828: or
                   3829: .I yymore_used_but_not_detected undefined -
                   3830: These errors can occur at compile time.  They indicate that the
                   3831: scanner uses
                   3832: .B REJECT
                   3833: or
                   3834: .B yymore()
                   3835: but that
                   3836: .I flex
                   3837: failed to notice the fact, meaning that
                   3838: .I flex
                   3839: scanned the first two sections looking for occurrences of these actions
1.10      deraadt  3840: and failed to find any, but somehow you snuck some in (via an #include
1.1       deraadt  3841: file, for example).  Use
                   3842: .B %option reject
                   3843: or
                   3844: .B %option yymore
                   3845: to indicate to flex that you really do use these features.
                   3846: .PP
                   3847: .I flex scanner jammed -
                   3848: a scanner compiled with
                   3849: .B \-s
                   3850: has encountered an input string which wasn't matched by
                   3851: any of its rules.  This error can also occur due to internal problems.
                   3852: .PP
                   3853: .I token too large, exceeds YYLMAX -
                   3854: your scanner uses
                   3855: .B %array
                   3856: and one of its rules matched a string longer than the
                   3857: .B YYLMAX
                   3858: constant (8K bytes by default).  You can increase the value by
                   3859: #define'ing
                   3860: .B YYLMAX
                   3861: in the definitions section of your
                   3862: .I flex
                   3863: input.
                   3864: .PP
                   3865: .I scanner requires \-8 flag to
                   3866: .I use the character 'x' -
                   3867: Your scanner specification includes recognizing the 8-bit character
                   3868: .I 'x'
                   3869: and you did not specify the \-8 flag, and your scanner defaulted to 7-bit
                   3870: because you used the
                   3871: .B \-Cf
                   3872: or
                   3873: .B \-CF
                   3874: table compression options.  See the discussion of the
                   3875: .B \-7
                   3876: flag for details.
                   3877: .PP
                   3878: .I flex scanner push-back overflow -
                   3879: you used
                   3880: .B unput()
                   3881: to push back so much text that the scanner's buffer could not hold
                   3882: both the pushed-back text and the current token in
                   3883: .B yytext.
                   3884: Ideally the scanner should dynamically resize the buffer in this case, but at
                   3885: present it does not.
                   3886: .PP
                   3887: .I
                   3888: input buffer overflow, can't enlarge buffer because scanner uses REJECT -
                   3889: the scanner was working on matching an extremely large token and needed
                   3890: to expand the input buffer.  This doesn't work with scanners that use
                   3891: .B
                   3892: REJECT.
                   3893: .PP
                   3894: .I
                   3895: fatal flex scanner internal error--end of buffer missed -
                   3896: This can occur in an scanner which is reentered after a long-jump
                   3897: has jumped out (or over) the scanner's activation frame.  Before
                   3898: reentering the scanner, use:
                   3899: .nf
                   3900:
                   3901:     yyrestart( yyin );
                   3902:
                   3903: .fi
                   3904: or, as noted above, switch to using the C++ scanner class.
                   3905: .PP
                   3906: .I too many start conditions in <> construct! -
                   3907: you listed more start conditions in a <> construct than exist (so
                   3908: you must have listed at least one of them twice).
                   3909: .SH FILES
                   3910: .TP
                   3911: .B \-lfl
                   3912: library with which scanners must be linked.
                   3913: .TP
                   3914: .I lex.yy.c
                   3915: generated scanner (called
                   3916: .I lexyy.c
                   3917: on some systems).
                   3918: .TP
                   3919: .I lex.yy.cc
                   3920: generated C++ scanner class, when using
                   3921: .B -+.
                   3922: .TP
1.5       deraadt  3923: .I <g++/FlexLexer.h>
1.1       deraadt  3924: header file defining the C++ scanner base class,
                   3925: .B FlexLexer,
                   3926: and its derived class,
                   3927: .B yyFlexLexer.
                   3928: .TP
                   3929: .I flex.skl
                   3930: skeleton scanner.  This file is only used when building flex, not when
                   3931: flex executes.
                   3932: .TP
                   3933: .I lex.backup
                   3934: backing-up information for
                   3935: .B \-b
                   3936: flag (called
                   3937: .I lex.bck
                   3938: on some systems).
                   3939: .SH DEFICIENCIES / BUGS
                   3940: .PP
                   3941: Some trailing context
                   3942: patterns cannot be properly matched and generate
                   3943: warning messages ("dangerous trailing context").  These are
                   3944: patterns where the ending of the
                   3945: first part of the rule matches the beginning of the second
                   3946: part, such as "zx*/xy*", where the 'x*' matches the 'x' at
                   3947: the beginning of the trailing context.  (Note that the POSIX draft
                   3948: states that the text matched by such patterns is undefined.)
                   3949: .PP
                   3950: For some trailing context rules, parts which are actually fixed-length are
1.3       deraadt  3951: not recognized as such, leading to the above mentioned performance loss.
1.1       deraadt  3952: In particular, parts using '|' or {n} (such as "foo{3}") are always
                   3953: considered variable-length.
                   3954: .PP
                   3955: Combining trailing context with the special '|' action can result in
                   3956: .I fixed
                   3957: trailing context being turned into the more expensive
                   3958: .I variable
                   3959: trailing context.  For example, in the following:
                   3960: .nf
                   3961:
                   3962:     %%
                   3963:     abc      |
                   3964:     xyz/def
                   3965:
                   3966: .fi
                   3967: .PP
                   3968: Use of
                   3969: .B unput()
                   3970: invalidates yytext and yyleng, unless the
                   3971: .B %array
                   3972: directive
                   3973: or the
                   3974: .B \-l
                   3975: option has been used.
                   3976: .PP
                   3977: Pattern-matching of NUL's is substantially slower than matching other
                   3978: characters.
                   3979: .PP
                   3980: Dynamic resizing of the input buffer is slow, as it entails rescanning
                   3981: all the text matched so far by the current (generally huge) token.
                   3982: .PP
                   3983: Due to both buffering of input and read-ahead, you cannot intermix
                   3984: calls to <stdio.h> routines, such as, for example,
                   3985: .B getchar(),
                   3986: with
                   3987: .I flex
                   3988: rules and expect it to work.  Call
                   3989: .B input()
                   3990: instead.
                   3991: .PP
                   3992: The total table entries listed by the
                   3993: .B \-v
                   3994: flag excludes the number of table entries needed to determine
                   3995: what rule has been matched.  The number of entries is equal
                   3996: to the number of DFA states if the scanner does not use
                   3997: .B REJECT,
                   3998: and somewhat greater than the number of states if it does.
                   3999: .PP
                   4000: .B REJECT
                   4001: cannot be used with the
                   4002: .B \-f
                   4003: or
                   4004: .B \-F
                   4005: options.
                   4006: .PP
                   4007: The
                   4008: .I flex
                   4009: internal algorithms need documentation.
                   4010: .SH SEE ALSO
                   4011: .PP
                   4012: lex(1), yacc(1), sed(1), awk(1).
                   4013: .PP
                   4014: John Levine, Tony Mason, and Doug Brown,
                   4015: .I Lex & Yacc,
                   4016: O'Reilly and Associates.  Be sure to get the 2nd edition.
                   4017: .PP
                   4018: M. E. Lesk and E. Schmidt,
                   4019: .I LEX \- Lexical Analyzer Generator
                   4020: .PP
                   4021: Alfred Aho, Ravi Sethi and Jeffrey Ullman,
                   4022: .I Compilers: Principles, Techniques and Tools,
                   4023: Addison-Wesley (1986).  Describes the pattern-matching techniques used by
                   4024: .I flex
                   4025: (deterministic finite automata).
                   4026: .SH AUTHOR
                   4027: Vern Paxson, with the help of many ideas and much inspiration from
                   4028: Van Jacobson.  Original version by Jef Poskanzer.  The fast table
                   4029: representation is a partial implementation of a design done by Van
                   4030: Jacobson.  The implementation was done by Kevin Gong and Vern Paxson.
                   4031: .PP
                   4032: Thanks to the many
                   4033: .I flex
                   4034: beta-testers, feedbackers, and contributors, especially Francois Pinard,
                   4035: Casey Leedom,
                   4036: Robert Abramovitz,
                   4037: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
                   4038: Neal Becker, Nelson H.F. Beebe, benson@odi.com,
                   4039: Karl Berry, Peter A. Bigot, Simon Blanchard,
                   4040: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
                   4041: Brian Clapper, J.T. Conklin,
                   4042: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11      deraadt  4043: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1       deraadt  4044: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
                   4045: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
                   4046: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
                   4047: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
                   4048: Jan Hajic, Charles Hemphill, NORO Hideo,
                   4049: Jarkko Hietaniemi, Scott Hofmann,
                   4050: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
                   4051: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
                   4052: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
                   4053: Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
                   4054: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
                   4055: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
                   4056: David Loffredo, Mike Long,
                   4057: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
                   4058: Bengt Martensson, Chris Metcalf,
                   4059: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
                   4060: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
                   4061: Richard Ohnemus, Karsten Pahnke,
                   4062: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond
                   4063: Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
                   4064: Frederic Raimbault, Pat Rankin, Rick Richardson,
                   4065: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
                   4066: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
                   4067: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
                   4068: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
                   4069: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
                   4070: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
                   4071: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
                   4072: Yap, Ron Zellar, Nathan Zelle, David Zuhn,
                   4073: and those whose names have slipped my marginal
                   4074: mail-archiving skills but whose contributions are appreciated all the
                   4075: same.
                   4076: .PP
                   4077: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
                   4078: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
                   4079: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
                   4080: distribution headaches.
                   4081: .PP
                   4082: Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to
                   4083: Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom
                   4084: Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to
                   4085: Eric Hughes for support of multiple buffers.
                   4086: .PP
                   4087: This work was primarily done when I was with the Real Time Systems Group
                   4088: at the Lawrence Berkeley Laboratory in Berkeley, CA.  Many thanks to all there
                   4089: for the support I received.
                   4090: .PP
                   4091: Send comments to vern@ee.lbl.gov.