[BACK]Return to flex.1 CVS log [TXT][DIR] Up to [local] / src / usr.bin / lex

Annotation of src/usr.bin/lex/flex.1, Revision 1.42

1.42    ! nicm        1: .\"    $OpenBSD: flex.1,v 1.41 2015/09/07 15:28:06 sobrado Exp $
1.16      jmc         2: .\"
1.12      jmc         3: .\" Copyright (c) 1990 The Regents of the University of California.
                      4: .\" All rights reserved.
1.2       deraadt     5: .\"
1.12      jmc         6: .\" This code is derived from software contributed to Berkeley by
                      7: .\" Vern Paxson.
                      8: .\"
                      9: .\" The United States Government has rights in this work pursuant
                     10: .\" to contract no. DE-AC03-76SF00098 between the United States
                     11: .\" Department of Energy and the University of California.
                     12: .\"
                     13: .\" Redistribution and use in source and binary forms, with or without
1.13      millert    14: .\" modification, are permitted provided that the following conditions
                     15: .\" are met:
                     16: .\"
                     17: .\" 1. Redistributions of source code must retain the above copyright
                     18: .\"    notice, this list of conditions and the following disclaimer.
                     19: .\" 2. Redistributions in binary form must reproduce the above copyright
                     20: .\"    notice, this list of conditions and the following disclaimer in the
                     21: .\"    documentation and/or other materials provided with the distribution.
                     22: .\"
                     23: .\" Neither the name of the University nor the names of its contributors
                     24: .\" may be used to endorse or promote products derived from this software
                     25: .\" without specific prior written permission.
                     26: .\"
                     27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
                     28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
                     29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
                     30: .\" PURPOSE.
1.16      jmc        31: .\"
1.42    ! nicm       32: .Dd $Mdocdate: September 7 2015 $
1.16      jmc        33: .Dt FLEX 1
                     34: .Os
                     35: .Sh NAME
1.42    ! nicm       36: .Nm flex ,
        !            37: .Nm flex++ ,
        !            38: .Nm lex
1.16      jmc        39: .Nd fast lexical analyzer generator
                     40: .Sh SYNOPSIS
                     41: .Nm
1.28      jmc        42: .Bk -words
1.31      jmc        43: .Op Fl 78BbdFfhIiLlnpsTtVvw+?
1.16      jmc        44: .Op Fl C Ns Op Cm aeFfmr
                     45: .Op Fl Fl help
                     46: .Op Fl Fl version
1.28      jmc        47: .Op Fl o Ns Ar output
                     48: .Op Fl P Ns Ar prefix
                     49: .Op Fl S Ns Ar skeleton
                     50: .Op Ar
                     51: .Ek
1.21      jmc        52: .Sh DESCRIPTION
                     53: .Nm
                     54: is a tool for generating
                     55: .Em scanners :
                     56: programs which recognize lexical patterns in text.
                     57: .Nm
                     58: reads the given input files, or its standard input if no file names are given,
                     59: for a description of a scanner to generate.
                     60: The description is in the form of pairs of regular expressions and C code,
                     61: called
                     62: .Em rules .
                     63: .Nm
                     64: generates as output a C source file,
                     65: .Pa lex.yy.c ,
                     66: which defines a routine
                     67: .Fn yylex .
                     68: This file is compiled and linked with the
                     69: .Fl lfl
                     70: library to produce an executable.
                     71: When the executable is run, it analyzes its input for occurrences
                     72: of the regular expressions.
                     73: Whenever it finds one, it executes the corresponding C code.
1.42    ! nicm       74: .Pp
        !            75: .Nm lex
        !            76: is a synonym for
        !            77: .Nm flex .
        !            78: .Pp
        !            79: .Nm flex++
        !            80: is a synonym for
        !            81: .Nm
        !            82: .Fl + .
1.21      jmc        83: .Pp
1.16      jmc        84: The manual includes both tutorial and reference sections:
                     85: .Bl -ohang
                     86: .It Sy Some Simple Examples
                     87: .It Sy Format of the Input File
                     88: .It Sy Patterns
                     89: The extended regular expressions used by
                     90: .Nm .
                     91: .It Sy How the Input is Matched
                     92: The rules for determining what has been matched.
                     93: .It Sy Actions
                     94: How to specify what to do when a pattern is matched.
                     95: .It Sy The Generated Scanner
                     96: Details regarding the scanner that
                     97: .Nm
                     98: produces;
                     99: how to control the input source.
                    100: .It Sy Start Conditions
                    101: Introducing context into scanners, and managing
                    102: .Qq mini-scanners .
                    103: .It Sy Multiple Input Buffers
                    104: How to manipulate multiple input sources;
                    105: how to scan from strings instead of files.
                    106: .It Sy End-of-File Rules
                    107: Special rules for matching the end of the input.
                    108: .It Sy Miscellaneous Macros
                    109: A summary of macros available to the actions.
                    110: .It Sy Values Available to the User
                    111: A summary of values available to the actions.
                    112: .It Sy Interfacing with Yacc
                    113: Connecting flex scanners together with
                    114: .Xr yacc 1
                    115: parsers.
                    116: .It Sy Options
                    117: .Nm
                    118: command-line options, and the
                    119: .Dq %option
                    120: directive.
                    121: .It Sy Performance Considerations
                    122: How to make scanners go as fast as possible.
                    123: .It Sy Generating C++ Scanners
                    124: The
                    125: .Pq experimental
                    126: facility for generating C++ scanner classes.
                    127: .It Sy Incompatibilities with Lex and POSIX
                    128: How
                    129: .Nm
1.36      schwarze  130: differs from
                    131: .At
                    132: .Nm lex
                    133: and the
1.16      jmc       134: .Tn POSIX
1.36      schwarze  135: .Nm lex
                    136: standard.
1.16      jmc       137: .It Sy Files
                    138: Files used by
                    139: .Nm .
                    140: .It Sy Diagnostics
                    141: Those error messages produced by
                    142: .Nm
                    143: .Pq or scanners it generates
                    144: whose meanings might not be apparent.
                    145: .It Sy See Also
                    146: Other documentation, related tools.
                    147: .It Sy Authors
                    148: Includes contact information.
                    149: .It Sy Bugs
                    150: Known problems with
                    151: .Nm .
                    152: .El
                    153: .Sh SOME SIMPLE EXAMPLES
1.1       deraadt   154: First some simple examples to get the flavor of how one uses
1.16      jmc       155: .Nm .
1.1       deraadt   156: The following
1.16      jmc       157: .Nm
1.1       deraadt   158: input specifies a scanner which whenever it encounters the string
1.16      jmc       159: .Qq username
                    160: will replace it with the user's login name:
                    161: .Bd -literal -offset indent
                    162: %%
                    163: username    printf("%s", getlogin());
                    164: .Ed
                    165: .Pp
1.1       deraadt   166: By default, any text not matched by a
1.16      jmc       167: .Nm
                    168: scanner is copied to the output, so the net effect of this scanner is
                    169: to copy its input file to its output with each occurrence of
                    170: .Qq username
                    171: expanded.
                    172: In this input, there is just one rule.
                    173: .Qq username
                    174: is the
                    175: .Em pattern
                    176: and the
                    177: .Qq printf
                    178: is the
                    179: .Em action .
                    180: The
                    181: .Qq %%
                    182: marks the beginning of the rules.
                    183: .Pp
1.1       deraadt   184: Here's another simple example:
1.16      jmc       185: .Bd -literal -offset indent
1.20      pvalchev  186: %{
1.16      jmc       187: int num_lines = 0, num_chars = 0;
1.20      pvalchev  188: %}
1.1       deraadt   189:
1.16      jmc       190: %%
                    191: \en      ++num_lines; ++num_chars;
                    192: \&.       ++num_chars;
                    193:
                    194: %%
                    195: main()
                    196: {
                    197:        yylex();
                    198:        printf("# of lines = %d, # of chars = %d\en",
                    199:             num_lines, num_chars);
                    200: }
                    201: .Ed
                    202: .Pp
1.1       deraadt   203: This scanner counts the number of characters and the number
1.16      jmc       204: of lines in its input
                    205: (it produces no output other than the final report on the counts).
                    206: The first line declares two globals,
                    207: .Qq num_lines
                    208: and
                    209: .Qq num_chars ,
                    210: which are accessible both inside
                    211: .Fn yylex
1.1       deraadt   212: and in the
1.16      jmc       213: .Fn main
                    214: routine declared after the second
                    215: .Qq %% .
                    216: There are two rules, one which matches a newline
                    217: .Pq \&"\en\&"
                    218: and increments both the line count and the character count,
                    219: and one which matches any character other than a newline
                    220: (indicated by the
                    221: .Qq \&.
                    222: regular expression).
                    223: .Pp
1.1       deraadt   224: A somewhat more complicated example:
1.16      jmc       225: .Bd -literal -offset indent
                    226: /* scanner for a toy Pascal-like language */
1.1       deraadt   227:
1.16      jmc       228: %{
                    229: /* need this for the call to atof() below */
                    230: #include <math.h>
                    231: %}
1.1       deraadt   232:
1.16      jmc       233: DIGIT    [0-9]
                    234: ID       [a-z][a-z0-9]*
1.1       deraadt   235:
1.16      jmc       236: %%
1.1       deraadt   237:
1.16      jmc       238: {DIGIT}+ {
                    239:         printf("An integer: %s (%d)\en", yytext,
                    240:             atoi(yytext));
                    241: }
1.1       deraadt   242:
1.16      jmc       243: {DIGIT}+"."{DIGIT}* {
                    244:         printf("A float: %s (%g)\en", yytext,
                    245:             atof(yytext));
                    246: }
1.1       deraadt   247:
1.16      jmc       248: if|then|begin|end|procedure|function {
                    249:         printf("A keyword: %s\en", yytext);
                    250: }
1.1       deraadt   251:
1.16      jmc       252: {ID}    printf("An identifier: %s\en", yytext);
1.1       deraadt   253:
1.16      jmc       254: "+"|"-"|"*"|"/"   printf("An operator: %s\en", yytext);
1.1       deraadt   255:
1.16      jmc       256: "{"[^}\en]*"}"     /* eat up one-line comments */
1.1       deraadt   257:
1.16      jmc       258: [ \et\en]+          /* eat up whitespace */
1.1       deraadt   259:
1.16      jmc       260: \&.       printf("Unrecognized character: %s\en", yytext);
1.1       deraadt   261:
1.16      jmc       262: %%
1.1       deraadt   263:
1.16      jmc       264: main(int argc, char *argv[])
                    265: {
                    266:         ++argv; --argc;  /* skip over program name */
                    267:         if (argc > 0)
                    268:                 yyin = fopen(argv[0], "r");
1.1       deraadt   269:         else
                    270:                 yyin = stdin;
1.7       aaron     271:
1.1       deraadt   272:         yylex();
1.16      jmc       273: }
                    274: .Ed
                    275: .Pp
                    276: This is the beginnings of a simple scanner for a language like Pascal.
                    277: It identifies different types of
                    278: .Em tokens
1.1       deraadt   279: and reports on what it has seen.
1.16      jmc       280: .Pp
                    281: The details of this example will be explained in the following sections.
                    282: .Sh FORMAT OF THE INPUT FILE
1.1       deraadt   283: The
1.16      jmc       284: .Nm
1.1       deraadt   285: input file consists of three sections, separated by a line with just
1.16      jmc       286: .Qq %%
1.1       deraadt   287: in it:
1.16      jmc       288: .Bd -unfilled -offset indent
                    289: definitions
                    290: %%
                    291: rules
                    292: %%
                    293: user code
                    294: .Ed
                    295: .Pp
1.1       deraadt   296: The
1.16      jmc       297: .Em definitions
1.1       deraadt   298: section contains declarations of simple
1.16      jmc       299: .Em name
1.1       deraadt   300: definitions to simplify the scanner specification, and declarations of
1.16      jmc       301: .Em start conditions ,
1.1       deraadt   302: which are explained in a later section.
1.16      jmc       303: .Pp
1.1       deraadt   304: Name definitions have the form:
1.16      jmc       305: .Pp
                    306: .D1 name definition
                    307: .Pp
                    308: The
                    309: .Qq name
                    310: is a word beginning with a letter or an underscore
                    311: .Pq Sq _
                    312: followed by zero or more letters, digits,
                    313: .Sq _ ,
                    314: or
                    315: .Sq -
                    316: .Pq dash .
1.8       aaron     317: The definition is taken to begin at the first non-whitespace character
1.1       deraadt   318: following the name and continuing to the end of the line.
1.16      jmc       319: The definition can subsequently be referred to using
                    320: .Qq {name} ,
                    321: which will expand to
                    322: .Qq (definition) .
                    323: For example:
                    324: .Bd -literal -offset indent
                    325: DIGIT    [0-9]
                    326: ID       [a-z][a-z0-9]*
                    327: .Ed
                    328: .Pp
                    329: This defines
                    330: .Qq DIGIT
                    331: to be a regular expression which matches a single digit, and
                    332: .Qq ID
                    333: to be a regular expression which matches a letter
1.1       deraadt   334: followed by zero-or-more letters-or-digits.
                    335: A subsequent reference to
1.16      jmc       336: .Pp
                    337: .Dl {DIGIT}+"."{DIGIT}*
                    338: .Pp
1.1       deraadt   339: is identical to
1.16      jmc       340: .Pp
                    341: .Dl ([0-9])+"."([0-9])*
                    342: .Pp
                    343: and matches one-or-more digits followed by a
                    344: .Sq .\&
                    345: followed by zero-or-more digits.
                    346: .Pp
1.1       deraadt   347: The
1.16      jmc       348: .Em rules
1.1       deraadt   349: section of the
1.16      jmc       350: .Nm
1.1       deraadt   351: input contains a series of rules of the form:
1.16      jmc       352: .Pp
1.35      schwarze  353: .Dl pattern    action
1.16      jmc       354: .Pp
                    355: The pattern must be unindented and the action must begin
1.1       deraadt   356: on the same line.
1.16      jmc       357: .Pp
1.1       deraadt   358: See below for a further description of patterns and actions.
1.16      jmc       359: .Pp
1.1       deraadt   360: Finally, the user code section is simply copied to
1.16      jmc       361: .Pa lex.yy.c
1.1       deraadt   362: verbatim.
1.16      jmc       363: It is used for companion routines which call or are called by the scanner.
                    364: The presence of this section is optional;
1.1       deraadt   365: if it is missing, the second
1.16      jmc       366: .Qq %%
                    367: in the input file may be skipped too.
                    368: .Pp
                    369: In the definitions and rules sections, any indented text or text enclosed in
                    370: .Sq %{
1.1       deraadt   371: and
1.16      jmc       372: .Sq %}
                    373: is copied verbatim to the output
                    374: .Pq with the %{}'s removed .
1.1       deraadt   375: The %{}'s must appear unindented on lines by themselves.
1.16      jmc       376: .Pp
1.1       deraadt   377: In the rules section,
1.16      jmc       378: any indented or %{} text appearing before the first rule may be used to
                    379: declare variables which are local to the scanning routine and
                    380: .Pq after the declarations
1.1       deraadt   381: code which is to be executed whenever the scanning routine is entered.
                    382: Other indented or %{} text in the rule section is still copied to the output,
                    383: but its meaning is not well-defined and it may well cause compile-time
                    384: errors (this feature is present for
1.16      jmc       385: .Tn POSIX
1.1       deraadt   386: compliance; see below for other such features).
1.16      jmc       387: .Pp
                    388: In the definitions section
                    389: .Pq but not in the rules section ,
                    390: an unindented comment
                    391: (i.e., a line beginning with
                    392: .Qq /* )
                    393: is also copied verbatim to the output up to the next
                    394: .Qq */ .
                    395: .Sh PATTERNS
1.1       deraadt   396: The patterns in the input are written using an extended set of regular
1.16      jmc       397: expressions.
                    398: These are:
                    399: .Bl -tag -width "XXXXXXXX"
                    400: .It x
                    401: Match the character
                    402: .Sq x .
                    403: .It .\&
                    404: Any character
                    405: .Pq byte
                    406: except newline.
                    407: .It [xyz]
                    408: A
                    409: .Qq character class ;
                    410: in this case, the pattern matches either an
                    411: .Sq x ,
                    412: a
                    413: .Sq y ,
                    414: or a
                    415: .Sq z .
                    416: .It [abj-oZ]
                    417: A
                    418: .Qq character class
                    419: with a range in it; matches an
                    420: .Sq a ,
                    421: a
                    422: .Sq b ,
                    423: any letter from
                    424: .Sq j
                    425: through
                    426: .Sq o ,
                    427: or a
                    428: .Sq Z .
                    429: .It [^A-Z]
                    430: A
                    431: .Qq negated character class ,
                    432: i.e., any character but those in the class.
                    433: In this case, any character EXCEPT an uppercase letter.
                    434: .It [^A-Z\en]
                    435: Any character EXCEPT an uppercase letter or a newline.
                    436: .It r*
                    437: Zero or more r's, where
                    438: .Sq r
                    439: is any regular expression.
                    440: .It r+
                    441: One or more r's.
                    442: .It r?
                    443: Zero or one r's (that is,
                    444: .Qq an optional r ) .
                    445: .It r{2,5}
                    446: Anywhere from two to five r's.
                    447: .It r{2,}
                    448: Two or more r's.
                    449: .It r{4}
                    450: Exactly 4 r's.
                    451: .It {name}
                    452: The expansion of the
                    453: .Qq name
                    454: definition
                    455: .Pq see above .
                    456: .It \&"[xyz]\e\&"foo\&"
                    457: The literal string: [xyz]"foo.
                    458: .It \eX
                    459: If
                    460: .Sq X
                    461: is an
                    462: .Sq a ,
                    463: .Sq b ,
                    464: .Sq f ,
                    465: .Sq n ,
                    466: .Sq r ,
                    467: .Sq t ,
                    468: or
                    469: .Sq v ,
                    470: then the ANSI-C interpretation of
                    471: .Sq \eX .
                    472: Otherwise, a literal
                    473: .Sq X
                    474: (used to escape operators such as
                    475: .Sq * ) .
                    476: .It \e0
                    477: A NUL character
                    478: .Pq ASCII code 0 .
                    479: .It \e123
                    480: The character with octal value 123.
                    481: .It \ex2a
                    482: The character with hexadecimal value 2a.
                    483: .It (r)
                    484: Match an
                    485: .Sq r ;
                    486: parentheses are used to override precedence
                    487: .Pq see below .
                    488: .It rs
                    489: The regular expression
                    490: .Sq r
                    491: followed by the regular expression
                    492: .Sq s ;
                    493: called
                    494: .Qq concatenation .
                    495: .It r|s
                    496: Either an
                    497: .Sq r
                    498: or an
                    499: .Sq s .
                    500: .It r/s
                    501: An
                    502: .Sq r ,
                    503: but only if it is followed by an
                    504: .Sq s .
                    505: The text matched by
                    506: .Sq s
                    507: is included when determining whether this rule is the
                    508: .Qq longest match ,
                    509: but is then returned to the input before the action is executed.
                    510: So the action only sees the text matched by
                    511: .Sq r .
                    512: This type of pattern is called
                    513: .Qq trailing context .
                    514: (There are some combinations of r/s that
                    515: .Nm
                    516: cannot match correctly; see notes in the
                    517: .Sx BUGS
                    518: section below regarding
                    519: .Qq dangerous trailing context . )
                    520: .It ^r
                    521: An
                    522: .Sq r ,
                    523: but only at the beginning of a line
                    524: (i.e., just starting to scan, or right after a newline has been scanned).
                    525: .It r$
                    526: An
                    527: .Sq r ,
                    528: but only at the end of a line
                    529: .Pq i.e., just before a newline .
                    530: Equivalent to
                    531: .Qq r/\en .
                    532: .Pp
                    533: Note that
                    534: .Nm flex Ns 's
                    535: notion of
                    536: .Qq newline
                    537: is exactly whatever the C compiler used to compile
                    538: .Nm
                    539: interprets
                    540: .Sq \en
                    541: as.
                    542: .\" In particular, on some DOS systems you must either filter out \er's in the
                    543: .\" input yourself, or explicitly use r/\er\en for
                    544: .\" .Qq r$ .
                    545: .It <s>r
                    546: An
                    547: .Sq r ,
                    548: but only in start condition
                    549: .Sq s
                    550: .Pq see below for discussion of start conditions .
                    551: .It <s1,s2,s3>r
                    552: The same, but in any of start conditions s1, s2, or s3.
                    553: .It <*>r
                    554: An
                    555: .Sq r
                    556: in any start condition, even an exclusive one.
                    557: .It <<EOF>>
                    558: An end-of-file.
                    559: .It <s1,s2><<EOF>>
                    560: An end-of-file when in start condition s1 or s2.
                    561: .El
                    562: .Pp
1.1       deraadt   563: Note that inside of a character class, all regular expression operators
1.16      jmc       564: lose their special meaning except escape
                    565: .Pq Sq \e
                    566: and the character class operators,
                    567: .Sq - ,
                    568: .Sq ]\& ,
                    569: and, at the beginning of the class,
                    570: .Sq ^ .
                    571: .Pp
1.1       deraadt   572: The regular expressions listed above are grouped according to
                    573: precedence, from highest precedence at the top to lowest at the bottom.
1.16      jmc       574: Those grouped together have equal precedence.
                    575: For example,
                    576: .Pp
                    577: .D1 foo|bar*
                    578: .Pp
1.1       deraadt   579: is the same as
1.16      jmc       580: .Pp
                    581: .D1 (foo)|(ba(r*))
                    582: .Pp
                    583: since the
                    584: .Sq *
                    585: operator has higher precedence than concatenation,
                    586: and concatenation higher than alternation
                    587: .Pq Sq |\& .
                    588: This pattern therefore matches
                    589: .Em either
                    590: the string
                    591: .Qq foo
                    592: .Em or
                    593: the string
                    594: .Qq ba
                    595: followed by zero-or-more r's.
                    596: To match
                    597: .Qq foo
                    598: or zero-or-more "bar"'s,
                    599: use:
                    600: .Pp
                    601: .D1 foo|(bar)*
                    602: .Pp
1.1       deraadt   603: and to match zero-or-more "foo"'s-or-"bar"'s:
1.16      jmc       604: .Pp
                    605: .D1 (foo|bar)*
                    606: .Pp
1.1       deraadt   607: In addition to characters and ranges of characters, character classes
                    608: can also contain character class
1.16      jmc       609: .Em expressions .
1.1       deraadt   610: These are expressions enclosed inside
1.16      jmc       611: .Sq [:
                    612: and
                    613: .Sq :]
                    614: delimiters (which themselves must appear between the
1.26      schwarze  615: .Sq \&[
1.1       deraadt   616: and
1.16      jmc       617: .Sq ]\&
                    618: of the
1.1       deraadt   619: character class; other elements may occur inside the character class, too).
                    620: The valid expressions are:
1.16      jmc       621: .Bd -unfilled -offset indent
                    622: [:alnum:] [:alpha:] [:blank:]
                    623: [:cntrl:] [:digit:] [:graph:]
                    624: [:lower:] [:print:] [:punct:]
                    625: [:space:] [:upper:] [:xdigit:]
                    626: .Ed
                    627: .Pp
1.1       deraadt   628: These expressions all designate a set of characters equivalent to
                    629: the corresponding standard C
1.16      jmc       630: .Fn isXXX
                    631: function.
                    632: For example, [:alnum:] designates those characters for which
                    633: .Xr isalnum 3
                    634: returns true \- i.e., any alphabetic or numeric.
1.1       deraadt   635: Some systems don't provide
1.16      jmc       636: .Xr isblank 3 ,
                    637: so
                    638: .Nm
                    639: defines [:blank:] as a blank or a tab.
                    640: .Pp
1.1       deraadt   641: For example, the following character classes are all equivalent:
1.16      jmc       642: .Bd -unfilled -offset indent
                    643: [[:alnum:]]
                    644: [[:alpha:][:digit:]]
                    645: [[:alpha:]0-9]
                    646: [a-zA-Z0-9]
                    647: .Ed
                    648: .Pp
                    649: If the scanner is case-insensitive (the
                    650: .Fl i
                    651: flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
                    652: .Pp
1.1       deraadt   653: Some notes on patterns:
1.16      jmc       654: .Bl -dash
                    655: .It
                    656: A negated character class such as the example
                    657: .Qq [^A-Z]
                    658: above will match a newline unless "\en"
                    659: .Pq or an equivalent escape sequence
                    660: is one of the characters explicitly present in the negated character class
                    661: (e.g.,
                    662: .Qq [^A-Z\en] ) .
                    663: This is unlike how many other regular expression tools treat negated character
                    664: classes, but unfortunately the inconsistency is historically entrenched.
                    665: Matching newlines means that a pattern like
                    666: .Qq [^"]*
                    667: can match the entire input unless there's another quote in the input.
                    668: .It
                    669: A rule can have at most one instance of trailing context
                    670: (the
                    671: .Sq /
                    672: operator or the
                    673: .Sq $
                    674: operator).
                    675: The start condition,
                    676: .Sq ^ ,
                    677: and
                    678: .Qq <<EOF>>
1.40      jmc       679: patterns can only occur at the beginning of a pattern and, as well as with
1.16      jmc       680: .Sq /
                    681: and
                    682: .Sq $ ,
                    683: cannot be grouped inside parentheses.
                    684: A
                    685: .Sq ^
                    686: which does not occur at the beginning of a rule or a
                    687: .Sq $
                    688: which does not occur at the end of a rule loses its special properties
                    689: and is treated as a normal character.
                    690: .It
1.1       deraadt   691: The following are illegal:
1.16      jmc       692: .Bd -unfilled -offset indent
                    693: foo/bar$
                    694: <sc1>foo<sc2>bar
                    695: .Ed
                    696: .Pp
                    697: Note that the first of these, can be written
                    698: .Qq foo/bar\en .
                    699: .It
                    700: The following will result in
                    701: .Sq $
                    702: or
                    703: .Sq ^
                    704: being treated as a normal character:
                    705: .Bd -unfilled -offset indent
                    706: foo|(bar$)
                    707: foo|^bar
                    708: .Ed
                    709: .Pp
                    710: If what's wanted is a
                    711: .Qq foo
                    712: or a bar-followed-by-a-newline, the following could be used
                    713: (the special
                    714: .Sq |\&
                    715: action is explained below):
                    716: .Bd -unfilled -offset indent
                    717: foo      |
                    718: bar$     /* action goes here */
                    719: .Ed
                    720: .Pp
1.1       deraadt   721: A similar trick will work for matching a foo or a
                    722: bar-at-the-beginning-of-a-line.
1.16      jmc       723: .El
                    724: .Sh HOW THE INPUT IS MATCHED
                    725: When the generated scanner is run,
                    726: it analyzes its input looking for strings which match any of its patterns.
                    727: If it finds more than one match,
                    728: it takes the one matching the most text
                    729: (for trailing context rules, this includes the length of the trailing part,
                    730: even though it will then be returned to the input).
                    731: If it finds two or more matches of the same length,
                    732: the rule listed first in the
                    733: .Nm
1.1       deraadt   734: input file is chosen.
1.16      jmc       735: .Pp
1.1       deraadt   736: Once the match is determined, the text corresponding to the match
                    737: (called the
1.16      jmc       738: .Em token )
1.1       deraadt   739: is made available in the global character pointer
1.16      jmc       740: .Fa yytext ,
1.1       deraadt   741: and its length in the global integer
1.16      jmc       742: .Fa yyleng .
1.1       deraadt   743: The
1.16      jmc       744: .Em action
                    745: corresponding to the matched pattern is then executed
                    746: .Pq a more detailed description of actions follows ,
                    747: and then the remaining input is scanned for another match.
                    748: .Pp
                    749: If no match is found, then the default rule is executed:
                    750: the next character in the input is considered matched and
                    751: copied to the standard output.
                    752: Thus, the simplest legal
                    753: .Nm
1.1       deraadt   754: input is:
1.16      jmc       755: .Pp
                    756: .D1 %%
                    757: .Pp
                    758: which generates a scanner that simply copies its input
                    759: .Pq one character at a time
                    760: to its output.
                    761: .Pp
1.1       deraadt   762: Note that
1.16      jmc       763: .Fa yytext
                    764: can be defined in two different ways:
                    765: either as a character pointer or as a character array.
                    766: Which definition
                    767: .Nm
                    768: uses can be controlled by including one of the special directives
                    769: .Dq %pointer
                    770: or
                    771: .Dq %array
                    772: in the first
                    773: .Pq definitions
                    774: section of flex input.
                    775: The default is
                    776: .Dq %pointer ,
                    777: unless the
                    778: .Fl l
1.36      schwarze  779: .Nm lex
                    780: compatibility option is used, in which case
1.16      jmc       781: .Fa yytext
1.1       deraadt   782: will be an array.
                    783: The advantage of using
1.16      jmc       784: .Dq %pointer
1.1       deraadt   785: is substantially faster scanning and no buffer overflow when matching
1.16      jmc       786: very large tokens
                    787: .Pq unless not enough dynamic memory is available .
                    788: The disadvantage is that actions are restricted in how they can modify
                    789: .Fa yytext
                    790: .Pq see the next section ,
                    791: and calls to the
                    792: .Fn unput
1.10      deraadt   793: function destroy the present contents of
1.16      jmc       794: .Fa yytext ,
1.1       deraadt   795: which can be a considerable porting headache when moving between different
1.16      jmc       796: .Nm lex
1.1       deraadt   797: versions.
1.16      jmc       798: .Pp
1.1       deraadt   799: The advantage of
1.16      jmc       800: .Dq %array
                    801: is that
                    802: .Fa yytext
                    803: can be modified as much as wanted, and calls to
                    804: .Fn unput
1.1       deraadt   805: do not destroy
1.16      jmc       806: .Fa yytext
                    807: .Pq see below .
                    808: Furthermore, existing
                    809: .Nm lex
1.1       deraadt   810: programs sometimes access
1.16      jmc       811: .Fa yytext
1.1       deraadt   812: externally using declarations of the form:
1.16      jmc       813: .Pp
                    814: .D1 extern char yytext[];
                    815: .Pp
1.1       deraadt   816: This definition is erroneous when used with
1.16      jmc       817: .Dq %pointer ,
1.1       deraadt   818: but correct for
1.16      jmc       819: .Dq %array .
                    820: .Pp
                    821: .Dq %array
1.1       deraadt   822: defines
1.16      jmc       823: .Fa yytext
1.1       deraadt   824: to be an array of
1.16      jmc       825: .Dv YYLMAX
                    826: characters, which defaults to a fairly large value.
                    827: The size can be changed by simply #define'ing
                    828: .Dv YYLMAX
                    829: to a different value in the first section of
                    830: .Nm
                    831: input.
                    832: As mentioned above, with
                    833: .Dq %pointer
                    834: yytext grows dynamically to accommodate large tokens.
                    835: While this means a
                    836: .Dq %pointer
                    837: scanner can accommodate very large tokens
                    838: .Pq such as matching entire blocks of comments ,
                    839: bear in mind that each time the scanner must resize
                    840: .Fa yytext
1.1       deraadt   841: it also must rescan the entire token from the beginning, so matching such
                    842: tokens can prove slow.
1.16      jmc       843: .Fa yytext
                    844: presently does not dynamically grow if a call to
                    845: .Fn unput
1.1       deraadt   846: results in too much text being pushed back; instead, a run-time error results.
1.16      jmc       847: .Pp
                    848: Also note that
                    849: .Dq %array
                    850: cannot be used with C++ scanner classes
                    851: .Pq the c++ option; see below .
                    852: .Sh ACTIONS
                    853: Each pattern in a rule has a corresponding action,
                    854: which can be any arbitrary C statement.
                    855: The pattern ends at the first non-escaped whitespace character;
                    856: the remainder of the line is its action.
                    857: If the action is empty,
                    858: then when the pattern is matched the input token is simply discarded.
                    859: For example, here is the specification for a program
                    860: which deletes all occurrences of
                    861: .Qq zap me
                    862: from its input:
                    863: .Bd -literal -offset indent
                    864: %%
                    865: "zap me"
                    866: .Ed
                    867: .Pp
1.1       deraadt   868: (It will copy all other characters in the input to the output since
                    869: they will be matched by the default rule.)
1.16      jmc       870: .Pp
1.1       deraadt   871: Here is a program which compresses multiple blanks and tabs down to
                    872: a single blank, and throws away whitespace found at the end of a line:
1.16      jmc       873: .Bd -literal -offset indent
                    874: %%
                    875: [ \et]+        putchar(' ');
                    876: [ \et]+$       /* ignore this token */
                    877: .Ed
                    878: .Pp
                    879: If the action contains a
                    880: .Sq { ,
                    881: then the action spans till the balancing
                    882: .Sq }
1.1       deraadt   883: is found, and the action may cross multiple lines.
1.16      jmc       884: .Nm
1.1       deraadt   885: knows about C strings and comments and won't be fooled by braces found
                    886: within them, but also allows actions to begin with
1.16      jmc       887: .Sq %{
1.1       deraadt   888: and will consider the action to be all the text up to the next
1.16      jmc       889: .Sq %}
                    890: .Pq regardless of ordinary braces inside the action .
                    891: .Pp
                    892: An action consisting solely of a vertical bar
                    893: .Pq Sq |\&
                    894: means
                    895: .Qq same as the action for the next rule .
                    896: See below for an illustration.
                    897: .Pp
                    898: Actions can include arbitrary C code,
                    899: including return statements to return a value to whatever routine called
                    900: .Fn yylex .
1.1       deraadt   901: Each time
1.16      jmc       902: .Fn yylex
                    903: is called, it continues processing tokens from where it last left off
                    904: until it either reaches the end of the file or executes a return.
                    905: .Pp
1.1       deraadt   906: Actions are free to modify
1.16      jmc       907: .Fa yytext
                    908: except for lengthening it
                    909: (adding characters to its end \- these will overwrite later characters in the
                    910: input stream).
                    911: This, however, does not apply when using
                    912: .Dq %array
                    913: .Pq see above ;
                    914: in that case,
                    915: .Fa yytext
1.1       deraadt   916: may be freely modified in any way.
1.16      jmc       917: .Pp
1.1       deraadt   918: Actions are free to modify
1.16      jmc       919: .Fa yyleng
1.1       deraadt   920: except they should not do so if the action also includes use of
1.16      jmc       921: .Fn yymore
                    922: .Pq see below .
                    923: .Pp
1.1       deraadt   924: There are a number of special directives which can be included within
                    925: an action:
1.16      jmc       926: .Bl -tag -width Ds
                    927: .It ECHO
                    928: Copies
                    929: .Fa yytext
                    930: to the scanner's output.
                    931: .It BEGIN
                    932: Followed by the name of a start condition, places the scanner in the
                    933: corresponding start condition
                    934: .Pq see below .
                    935: .It REJECT
                    936: Directs the scanner to proceed on to the
                    937: .Qq second best
                    938: rule which matched the input
                    939: .Pq or a prefix of the input .
                    940: The rule is chosen as described above in
                    941: .Sx HOW THE INPUT IS MATCHED ,
                    942: and
                    943: .Fa yytext
1.1       deraadt   944: and
1.16      jmc       945: .Fa yyleng
1.1       deraadt   946: set up appropriately.
                    947: It may either be one which matched as much text
                    948: as the originally chosen rule but came later in the
1.16      jmc       949: .Nm
1.1       deraadt   950: input file, or one which matched less text.
                    951: For example, the following will both count the
1.16      jmc       952: words in the input and call the routine
                    953: .Fn special
                    954: whenever
                    955: .Qq frob
                    956: is seen:
                    957: .Bd -literal -offset indent
                    958: int word_count = 0;
                    959: %%
                    960:
                    961: frob        special(); REJECT;
                    962: [^ \et\en]+   ++word_count;
                    963: .Ed
                    964: .Pp
1.1       deraadt   965: Without the
1.16      jmc       966: .Em REJECT ,
                    967: any "frob"'s in the input would not be counted as words,
                    968: since the scanner normally executes only one action per token.
1.1       deraadt   969: Multiple
1.16      jmc       970: .Em REJECT Ns 's
                    971: are allowed,
                    972: each one finding the next best choice to the currently active rule.
                    973: For example, when the following scanner scans the token
                    974: .Qq abcd ,
                    975: it will write
                    976: .Qq abcdabcaba
                    977: to the output:
                    978: .Bd -literal -offset indent
                    979: %%
                    980: a        |
                    981: ab       |
                    982: abc      |
                    983: abcd     ECHO; REJECT;
                    984: \&.|\en     /* eat up any unmatched character */
                    985: .Ed
                    986: .Pp
1.1       deraadt   987: (The first three rules share the fourth's action since they use
1.16      jmc       988: the special
                    989: .Sq |\&
                    990: action.)
                    991: .Em REJECT
1.1       deraadt   992: is a particularly expensive feature in terms of scanner performance;
1.16      jmc       993: if it is used in any of the scanner's actions it will slow down
                    994: all of the scanner's matching.
                    995: Furthermore,
                    996: .Em REJECT
1.1       deraadt   997: cannot be used with the
1.16      jmc       998: .Fl Cf
1.1       deraadt   999: or
1.16      jmc      1000: .Fl CF
                   1001: options
                   1002: .Pq see below .
                   1003: .Pp
1.1       deraadt  1004: Note also that unlike the other special actions,
1.16      jmc      1005: .Em REJECT
1.1       deraadt  1006: is a
1.16      jmc      1007: .Em branch ;
                   1008: code immediately following it in the action will not be executed.
                   1009: .It yymore()
                   1010: Tells the scanner that the next time it matches a rule, the corresponding
                   1011: token should be appended onto the current value of
                   1012: .Fa yytext
                   1013: rather than replacing it.
                   1014: For example, given the input
                   1015: .Qq mega-kludge
                   1016: the following will write
                   1017: .Qq mega-mega-kludge
                   1018: to the output:
                   1019: .Bd -literal -offset indent
                   1020: %%
                   1021: mega-    ECHO; yymore();
                   1022: kludge   ECHO;
                   1023: .Ed
                   1024: .Pp
                   1025: First
                   1026: .Qq mega-
                   1027: is matched and echoed to the output.
                   1028: Then
                   1029: .Qq kludge
                   1030: is matched, but the previous
                   1031: .Qq mega-
                   1032: is still hanging around at the beginning of
                   1033: .Fa yytext
1.1       deraadt  1034: so the
1.16      jmc      1035: .Em ECHO
                   1036: for the
                   1037: .Qq kludge
                   1038: rule will actually write
                   1039: .Qq mega-kludge .
                   1040: .Pp
1.1       deraadt  1041: Two notes regarding use of
1.16      jmc      1042: .Fn yymore :
1.1       deraadt  1043: First,
1.16      jmc      1044: .Fn yymore
1.1       deraadt  1045: depends on the value of
1.16      jmc      1046: .Fa yyleng
                   1047: correctly reflecting the size of the current token, so
                   1048: .Fa yyleng
                   1049: must not be modified when using
                   1050: .Fn yymore .
1.1       deraadt  1051: Second, the presence of
1.16      jmc      1052: .Fn yymore
1.1       deraadt  1053: in the scanner's action entails a minor performance penalty in the
                   1054: scanner's matching speed.
1.16      jmc      1055: .It yyless(n)
                   1056: Returns all but the first
                   1057: .Ar n
1.1       deraadt  1058: characters of the current token back to the input stream, where they
                   1059: will be rescanned when the scanner looks for the next match.
1.16      jmc      1060: .Fa yytext
1.1       deraadt  1061: and
1.16      jmc      1062: .Fa yyleng
1.1       deraadt  1063: are adjusted appropriately (e.g.,
1.16      jmc      1064: .Fa yyleng
1.1       deraadt  1065: will now be equal to
1.16      jmc      1066: .Ar n ) .
                   1067: For example, on the input
                   1068: .Qq foobar
                   1069: the following will write out
                   1070: .Qq foobarbar :
                   1071: .Bd -literal -offset indent
                   1072: %%
                   1073: foobar    ECHO; yyless(3);
                   1074: [a-z]+    ECHO;
                   1075: .Ed
                   1076: .Pp
1.1       deraadt  1077: An argument of 0 to
1.16      jmc      1078: .Fa yyless
                   1079: will cause the entire current input string to be scanned again.
                   1080: Unless how the scanner will subsequently process its input has been changed
                   1081: (using
                   1082: .Em BEGIN ,
                   1083: for example),
                   1084: this will result in an endless loop.
                   1085: .Pp
1.1       deraadt  1086: Note that
1.16      jmc      1087: .Fa yyless
                   1088: is a macro and can only be used in the
                   1089: .Nm
                   1090: input file, not from other source files.
                   1091: .It unput(c)
                   1092: Puts the character
                   1093: .Ar c
                   1094: back into the input stream.
                   1095: It will be the next character scanned.
1.1       deraadt  1096: The following action will take the current token and cause it
                   1097: to be rescanned enclosed in parentheses.
1.16      jmc      1098: .Bd -literal -offset indent
                   1099: {
                   1100:         int i;
                   1101:         char *yycopy;
                   1102:
                   1103:         /* Copy yytext because unput() trashes yytext */
                   1104:         if ((yycopy = strdup(yytext)) == NULL)
                   1105:                 err(1, NULL);
                   1106:         unput(')');
                   1107:         for (i = yyleng - 1; i >= 0; --i)
                   1108:                 unput(yycopy[i]);
                   1109:         unput('(');
                   1110:         free(yycopy);
                   1111: }
                   1112: .Ed
                   1113: .Pp
1.1       deraadt  1114: Note that since each
1.16      jmc      1115: .Fn unput
                   1116: puts the given character back at the beginning of the input stream,
                   1117: pushing back strings must be done back-to-front.
                   1118: .Pp
1.1       deraadt  1119: An important potential problem when using
1.16      jmc      1120: .Fn unput
                   1121: is that if using
                   1122: .Dq %pointer
                   1123: .Pq the default ,
                   1124: a call to
                   1125: .Fn unput
                   1126: destroys the contents of
                   1127: .Fa yytext ,
1.1       deraadt  1128: starting with its rightmost character and devouring one character to
1.16      jmc      1129: the left with each call.
                   1130: If the value of
                   1131: .Fa yytext
                   1132: should be preserved after a call to
                   1133: .Fn unput
                   1134: .Pq as in the above example ,
                   1135: it must either first be copied elsewhere, or the scanner must be built using
                   1136: .Dq %array
                   1137: instead (see
                   1138: .Sx HOW THE INPUT IS MATCHED ) .
                   1139: .Pp
                   1140: Finally, note that EOF cannot be put back
1.1       deraadt  1141: to attempt to mark the input stream with an end-of-file.
1.16      jmc      1142: .It input()
                   1143: Reads the next character from the input stream.
                   1144: For example, the following is one way to eat up C comments:
                   1145: .Bd -literal -offset indent
                   1146: %%
                   1147: "/*" {
                   1148:         int c;
                   1149:
                   1150:         for (;;) {
                   1151:                 while ((c = input()) != '*' && c != EOF)
                   1152:                         ; /* eat up text of comment */
                   1153:
                   1154:                 if (c == '*') {
                   1155:                         while ((c = input()) == '*')
                   1156:                                 ;
                   1157:                         if (c == '/')
                   1158:                                 break; /* found the end */
                   1159:                 }
                   1160:
                   1161:                 if (c == EOF) {
                   1162:                         errx(1, "EOF in comment");
1.1       deraadt  1163:                         break;
                   1164:                 }
1.16      jmc      1165:         }
                   1166: }
                   1167: .Ed
                   1168: .Pp
                   1169: (Note that if the scanner is compiled using C++, then
                   1170: .Fn input
1.1       deraadt  1171: is instead referred to as
1.16      jmc      1172: .Fn yyinput ,
                   1173: in order to avoid a name clash with the C++ stream by the name of input.)
                   1174: .It YY_FLUSH_BUFFER
                   1175: Flushes the scanner's internal buffer
                   1176: so that the next time the scanner attempts to match a token,
                   1177: it will first refill the buffer using
                   1178: .Dv YY_INPUT
                   1179: (see
                   1180: .Sx THE GENERATED SCANNER ,
                   1181: below).
                   1182: This action is a special case of the more general
                   1183: .Fn yy_flush_buffer
                   1184: function, described below in the section
                   1185: .Sx MULTIPLE INPUT BUFFERS .
                   1186: .It yyterminate()
                   1187: Can be used in lieu of a return statement in an action.
                   1188: It terminates the scanner and returns a 0 to the scanner's caller, indicating
                   1189: .Qq all done .
1.1       deraadt  1190: By default,
1.16      jmc      1191: .Fn yyterminate
                   1192: is also called when an end-of-file is encountered.
                   1193: It is a macro and may be redefined.
                   1194: .El
                   1195: .Sh THE GENERATED SCANNER
1.1       deraadt  1196: The output of
1.16      jmc      1197: .Nm
1.1       deraadt  1198: is the file
1.16      jmc      1199: .Pa lex.yy.c ,
1.1       deraadt  1200: which contains the scanning routine
1.16      jmc      1201: .Fn yylex ,
                   1202: a number of tables used by it for matching tokens,
                   1203: and a number of auxiliary routines and macros.
                   1204: By default,
                   1205: .Fn yylex
1.1       deraadt  1206: is declared as follows:
1.16      jmc      1207: .Bd -unfilled -offset indent
                   1208: int yylex()
                   1209: {
                   1210:     ... various definitions and the actions in here ...
                   1211: }
                   1212: .Ed
                   1213: .Pp
                   1214: (If the environment supports function prototypes, then it will
                   1215: be "int yylex(void)".)
                   1216: This definition may be changed by defining the
                   1217: .Dv YY_DECL
                   1218: macro.
                   1219: For example:
                   1220: .Bd -literal -offset indent
                   1221: #define YY_DECL float lexscan(a, b) float a, b;
                   1222: .Ed
                   1223: .Pp
                   1224: would give the scanning routine the name
                   1225: .Em lexscan ,
                   1226: returning a float, and taking two floats as arguments.
                   1227: Note that if arguments are given to the scanning routine using a
                   1228: K&R-style/non-prototyped function declaration,
                   1229: the definition must be terminated with a semi-colon
                   1230: .Pq Sq ;\& .
                   1231: .Pp
1.1       deraadt  1232: Whenever
1.16      jmc      1233: .Fn yylex
1.1       deraadt  1234: is called, it scans tokens from the global input file
1.16      jmc      1235: .Pa yyin
                   1236: .Pq which defaults to stdin .
                   1237: It continues until it either reaches an end-of-file
                   1238: .Pq at which point it returns the value 0
                   1239: or one of its actions executes a
                   1240: .Em return
1.1       deraadt  1241: statement.
1.16      jmc      1242: .Pp
1.1       deraadt  1243: If the scanner reaches an end-of-file, subsequent calls are undefined
                   1244: unless either
1.16      jmc      1245: .Em yyin
                   1246: is pointed at a new input file
                   1247: .Pq in which case scanning continues from that file ,
                   1248: or
                   1249: .Fn yyrestart
1.1       deraadt  1250: is called.
1.16      jmc      1251: .Fn yyrestart
1.1       deraadt  1252: takes one argument, a
1.16      jmc      1253: .Fa FILE *
                   1254: pointer (which can be nil, if
                   1255: .Dv YY_INPUT
                   1256: has been set up to scan from a source other than
                   1257: .Em yyin ) ,
1.1       deraadt  1258: and initializes
1.16      jmc      1259: .Em yyin
                   1260: for scanning from that file.
                   1261: Essentially there is no difference between just assigning
                   1262: .Em yyin
1.1       deraadt  1263: to a new input file or using
1.16      jmc      1264: .Fn yyrestart
                   1265: to do so; the latter is available for compatibility with previous versions of
                   1266: .Nm ,
1.1       deraadt  1267: and because it can be used to switch input files in the middle of scanning.
1.16      jmc      1268: It can also be used to throw away the current input buffer,
                   1269: by calling it with an argument of
                   1270: .Em yyin ;
1.1       deraadt  1271: but better is to use
1.16      jmc      1272: .Dv YY_FLUSH_BUFFER
                   1273: .Pq see above .
1.1       deraadt  1274: Note that
1.16      jmc      1275: .Fn yyrestart
                   1276: does not reset the start condition to
                   1277: .Em INITIAL
                   1278: (see
                   1279: .Sx START CONDITIONS ,
                   1280: below).
                   1281: .Pp
1.1       deraadt  1282: If
1.16      jmc      1283: .Fn yylex
1.1       deraadt  1284: stops scanning due to executing a
1.16      jmc      1285: .Em return
1.1       deraadt  1286: statement in one of the actions, the scanner may then be called again and it
                   1287: will resume scanning where it left off.
1.16      jmc      1288: .Pp
                   1289: By default
                   1290: .Pq and for purposes of efficiency ,
                   1291: the scanner uses block-reads rather than simple
                   1292: .Xr getc 3
1.1       deraadt  1293: calls to read characters from
1.16      jmc      1294: .Em yyin .
1.1       deraadt  1295: The nature of how it gets its input can be controlled by defining the
1.16      jmc      1296: .Dv YY_INPUT
1.1       deraadt  1297: macro.
1.16      jmc      1298: .Dv YY_INPUT Ns 's
                   1299: calling sequence is
                   1300: .Qq YY_INPUT(buf,result,max_size) .
                   1301: Its action is to place up to
                   1302: .Dv max_size
1.1       deraadt  1303: characters in the character array
1.16      jmc      1304: .Em buf
1.1       deraadt  1305: and return in the integer variable
1.16      jmc      1306: .Em result
                   1307: either the number of characters read or the constant
                   1308: .Dv YY_NULL
                   1309: (0 on
                   1310: .Ux
                   1311: systems)
                   1312: to indicate
                   1313: .Dv EOF .
                   1314: The default
                   1315: .Dv YY_INPUT
                   1316: reads from the global file-pointer
                   1317: .Qq yyin .
                   1318: .Pp
                   1319: A sample definition of
                   1320: .Dv YY_INPUT
                   1321: .Pq in the definitions section of the input file :
                   1322: .Bd -unfilled -offset indent
                   1323: %{
                   1324: #define YY_INPUT(buf,result,max_size) \e
                   1325: { \e
                   1326:         int c = getchar(); \e
                   1327:         result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
                   1328: }
                   1329: %}
                   1330: .Ed
                   1331: .Pp
1.1       deraadt  1332: This definition will change the input processing to occur
                   1333: one character at a time.
1.16      jmc      1334: .Pp
                   1335: When the scanner receives an end-of-file indication from
                   1336: .Dv YY_INPUT ,
1.1       deraadt  1337: it then checks the
1.16      jmc      1338: .Fn yywrap
                   1339: function.
                   1340: If
                   1341: .Fn yywrap
                   1342: returns false
                   1343: .Pq zero ,
                   1344: then it is assumed that the function has gone ahead and set up
                   1345: .Em yyin
                   1346: to point to another input file, and scanning continues.
                   1347: If it returns true
                   1348: .Pq non-zero ,
                   1349: then the scanner terminates, returning 0 to its caller.
                   1350: Note that in either case, the start condition remains unchanged;
                   1351: it does not revert to
                   1352: .Em INITIAL .
                   1353: .Pp
1.1       deraadt  1354: If you do not supply your own version of
1.16      jmc      1355: .Fn yywrap ,
1.1       deraadt  1356: then you must either use
1.16      jmc      1357: .Dq %option noyywrap
1.1       deraadt  1358: (in which case the scanner behaves as though
1.16      jmc      1359: .Fn yywrap
1.1       deraadt  1360: returned 1), or you must link with
1.16      jmc      1361: .Fl lfl
1.1       deraadt  1362: to obtain the default version of the routine, which always returns 1.
1.16      jmc      1363: .Pp
1.1       deraadt  1364: Three routines are available for scanning from in-memory buffers rather
                   1365: than files:
1.16      jmc      1366: .Fn yy_scan_string ,
                   1367: .Fn yy_scan_bytes ,
1.1       deraadt  1368: and
1.16      jmc      1369: .Fn yy_scan_buffer .
                   1370: See the discussion of them below in the section
                   1371: .Sx MULTIPLE INPUT BUFFERS .
                   1372: .Pp
1.1       deraadt  1373: The scanner writes its
1.16      jmc      1374: .Em ECHO
1.1       deraadt  1375: output to the
1.16      jmc      1376: .Em yyout
                   1377: global
                   1378: .Pq default, stdout ,
                   1379: which may be redefined by the user simply by assigning it to some other
                   1380: .Va FILE
1.1       deraadt  1381: pointer.
1.16      jmc      1382: .Sh START CONDITIONS
                   1383: .Nm
                   1384: provides a mechanism for conditionally activating rules.
                   1385: Any rule whose pattern is prefixed with
                   1386: .Qq Aq sc
                   1387: will only be active when the scanner is in the start condition named
                   1388: .Qq sc .
                   1389: For example,
                   1390: .Bd -literal -offset indent
                   1391: <STRING>[^"]* { /* eat up the string body ... */
                   1392:         ...
                   1393: }
                   1394: .Ed
                   1395: .Pp
                   1396: will be active only when the scanner is in the
                   1397: .Qq STRING
                   1398: start condition, and
                   1399: .Bd -literal -offset indent
                   1400: <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
                   1401:         ...
                   1402: }
                   1403: .Ed
                   1404: .Pp
                   1405: will be active only when the current start condition is either
                   1406: .Qq INITIAL ,
                   1407: .Qq STRING ,
                   1408: or
                   1409: .Qq QUOTE .
                   1410: .Pp
                   1411: Start conditions are declared in the definitions
                   1412: .Pq first
                   1413: section of the input using unindented lines beginning with either
                   1414: .Sq %s
1.1       deraadt  1415: or
1.16      jmc      1416: .Sq %x
1.1       deraadt  1417: followed by a list of names.
                   1418: The former declares
1.16      jmc      1419: .Em inclusive
1.1       deraadt  1420: start conditions, the latter
1.16      jmc      1421: .Em exclusive
                   1422: start conditions.
                   1423: A start condition is activated using the
                   1424: .Em BEGIN
                   1425: action.
                   1426: Until the next
                   1427: .Em BEGIN
                   1428: action is executed, rules with the given start condition will be active and
1.1       deraadt  1429: rules with other start conditions will be inactive.
1.16      jmc      1430: If the start condition is inclusive,
1.1       deraadt  1431: then rules with no start conditions at all will also be active.
1.16      jmc      1432: If it is exclusive,
                   1433: then only rules qualified with the start condition will be active.
1.1       deraadt  1434: A set of rules contingent on the same exclusive start condition
                   1435: describe a scanner which is independent of any of the other rules in the
1.16      jmc      1436: .Nm
                   1437: input.
                   1438: Because of this, exclusive start conditions make it easy to specify
                   1439: .Qq mini-scanners
1.1       deraadt  1440: which scan portions of the input that are syntactically different
1.16      jmc      1441: from the rest
                   1442: .Pq e.g., comments .
                   1443: .Pp
1.1       deraadt  1444: If the distinction between inclusive and exclusive start conditions
                   1445: is still a little vague, here's a simple example illustrating the
1.16      jmc      1446: connection between the two.
                   1447: The set of rules:
                   1448: .Bd -literal -offset indent
                   1449: %s example
                   1450: %%
                   1451:
                   1452: <example>foo   do_something();
                   1453:
                   1454: bar            something_else();
                   1455: .Ed
                   1456: .Pp
1.1       deraadt  1457: is equivalent to
1.16      jmc      1458: .Bd -literal -offset indent
                   1459: %x example
                   1460: %%
                   1461:
                   1462: <example>foo   do_something();
                   1463:
                   1464: <INITIAL,example>bar    something_else();
                   1465: .Ed
                   1466: .Pp
1.1       deraadt  1467: Without the
1.16      jmc      1468: .Aq INITIAL,example
1.1       deraadt  1469: qualifier, the
1.16      jmc      1470: .Dq bar
                   1471: pattern in the second example wouldn't be active
                   1472: .Pq i.e., couldn't match
1.1       deraadt  1473: when in start condition
1.16      jmc      1474: .Dq example .
1.1       deraadt  1475: If we just used
1.16      jmc      1476: .Aq example
1.1       deraadt  1477: to qualify
1.16      jmc      1478: .Dq bar ,
1.1       deraadt  1479: though, then it would only be active in
1.16      jmc      1480: .Dq example
1.1       deraadt  1481: and not in
1.16      jmc      1482: .Em INITIAL ,
                   1483: while in the first example it's active in both,
                   1484: because in the first example the
                   1485: .Dq example
                   1486: start condition is an inclusive
                   1487: .Pq Sq %s
1.1       deraadt  1488: start condition.
1.16      jmc      1489: .Pp
1.1       deraadt  1490: Also note that the special start-condition specifier
1.16      jmc      1491: .Sq Aq *
                   1492: matches every start condition.
                   1493: Thus, the above example could also have been written:
                   1494: .Bd -literal -offset indent
                   1495: %x example
                   1496: %%
                   1497:
                   1498: <example>foo   do_something();
                   1499:
                   1500: <*>bar         something_else();
                   1501: .Ed
                   1502: .Pp
1.1       deraadt  1503: The default rule (to
1.16      jmc      1504: .Em ECHO
                   1505: any unmatched character) remains active in start conditions.
                   1506: It is equivalent to:
                   1507: .Bd -literal -offset indent
                   1508: <*>.|\en     ECHO;
                   1509: .Ed
                   1510: .Pp
                   1511: .Dq BEGIN(0)
1.1       deraadt  1512: returns to the original state where only the rules with
1.16      jmc      1513: no start conditions are active.
                   1514: This state can also be referred to as the start-condition
                   1515: .Em INITIAL ,
                   1516: so
                   1517: .Dq BEGIN(INITIAL)
1.1       deraadt  1518: is equivalent to
1.16      jmc      1519: .Dq BEGIN(0) .
1.1       deraadt  1520: (The parentheses around the start condition name are not required but
                   1521: are considered good style.)
1.16      jmc      1522: .Pp
                   1523: .Em BEGIN
1.1       deraadt  1524: actions can also be given as indented code at the beginning
1.16      jmc      1525: of the rules section.
                   1526: For example, the following will cause the scanner to enter the
                   1527: .Qq SPECIAL
                   1528: start condition whenever
                   1529: .Fn yylex
1.1       deraadt  1530: is called and the global variable
1.16      jmc      1531: .Fa enter_special
1.1       deraadt  1532: is true:
1.16      jmc      1533: .Bd -literal -offset indent
                   1534: int enter_special;
1.1       deraadt  1535:
1.16      jmc      1536: %x SPECIAL
                   1537: %%
                   1538:         if (enter_special)
1.1       deraadt  1539:                 BEGIN(SPECIAL);
                   1540:
1.16      jmc      1541: <SPECIAL>blahblahblah
                   1542: \&...more rules follow...
                   1543: .Ed
                   1544: .Pp
1.1       deraadt  1545: To illustrate the uses of start conditions,
                   1546: here is a scanner which provides two different interpretations
1.16      jmc      1547: of a string like
                   1548: .Qq 123.456 .
                   1549: By default it will treat it as three tokens: the integer
                   1550: .Qq 123 ,
                   1551: a dot
                   1552: .Pq Sq .\& ,
                   1553: and the integer
                   1554: .Qq 456 .
1.1       deraadt  1555: But if the string is preceded earlier in the line by the string
1.16      jmc      1556: .Qq expect-floats
                   1557: it will treat it as a single token, the floating-point number 123.456:
                   1558: .Bd -literal -offset indent
                   1559: %{
                   1560: #include <math.h>
                   1561: %}
                   1562: %s expect
                   1563:
                   1564: %%
                   1565: expect-floats        BEGIN(expect);
                   1566:
                   1567: <expect>[0-9]+"."[0-9]+ {
                   1568:         printf("found a float, = %f\en",
                   1569:             atof(yytext));
                   1570: }
                   1571: <expect>\en {
                   1572:         /*
                   1573:          * That's the end of the line, so
                   1574:          * we need another "expect-number"
                   1575:          * before we'll recognize any more
                   1576:          * numbers.
                   1577:          */
                   1578:         BEGIN(INITIAL);
                   1579: }
                   1580:
                   1581: [0-9]+ {
                   1582:         printf("found an integer, = %d\en",
                   1583:             atoi(yytext));
                   1584: }
                   1585:
                   1586: "."     printf("found a dot\en");
                   1587: .Ed
                   1588: .Pp
                   1589: Here is a scanner which recognizes
                   1590: .Pq and discards
                   1591: C comments while maintaining a count of the current input line:
                   1592: .Bd -literal -offset indent
                   1593: %x comment
                   1594: %%
                   1595: int line_num = 1;
                   1596:
                   1597: "/*"                    BEGIN(comment);
                   1598:
                   1599: <comment>[^*\en]*        /* eat anything that's not a '*' */
                   1600: <comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
                   1601: <comment>\en             ++line_num;
                   1602: <comment>"*"+"/"        BEGIN(INITIAL);
                   1603: .Ed
                   1604: .Pp
1.1       deraadt  1605: This scanner goes to a bit of trouble to match as much
1.16      jmc      1606: text as possible with each rule.
                   1607: In general, when attempting to write a high-speed scanner
                   1608: try to match as much as possible in each rule, as it's a big win.
                   1609: .Pp
1.10      deraadt  1610: Note that start-condition names are really integer values and
1.16      jmc      1611: can be stored as such.
                   1612: Thus, the above could be extended in the following fashion:
                   1613: .Bd -literal -offset indent
                   1614: %x comment foo
                   1615: %%
                   1616: int line_num = 1;
                   1617: int comment_caller;
                   1618:
                   1619: "/*" {
                   1620:         comment_caller = INITIAL;
                   1621:         BEGIN(comment);
                   1622: }
                   1623:
                   1624: \&...
                   1625:
                   1626: <foo>"/*" {
                   1627:         comment_caller = foo;
                   1628:         BEGIN(comment);
                   1629: }
                   1630:
                   1631: <comment>[^*\en]*        /* eat anything that's not a '*' */
                   1632: <comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
                   1633: <comment>\en             ++line_num;
                   1634: <comment>"*"+"/"        BEGIN(comment_caller);
                   1635: .Ed
                   1636: .Pp
                   1637: Furthermore, the current start condition can be accessed by using
1.1       deraadt  1638: the integer-valued
1.16      jmc      1639: .Dv YY_START
                   1640: macro.
                   1641: For example, the above assignments to
                   1642: .Em comment_caller
1.1       deraadt  1643: could instead be written
1.16      jmc      1644: .Pp
                   1645: .Dl comment_caller = YY_START;
                   1646: .Pp
1.1       deraadt  1647: Flex provides
1.16      jmc      1648: .Dv YYSTATE
1.1       deraadt  1649: as an alias for
1.16      jmc      1650: .Dv YY_START
1.36      schwarze 1651: (since that is what's used by
                   1652: .At
1.16      jmc      1653: .Nm lex ) .
                   1654: .Pp
                   1655: Note that start conditions do not have their own name-space;
                   1656: %s's and %x's declare names in the same fashion as #define's.
                   1657: .Pp
1.1       deraadt  1658: Finally, here's an example of how to match C-style quoted strings using
1.16      jmc      1659: exclusive start conditions, including expanded escape sequences
                   1660: (but not including checking for a string that's too long):
                   1661: .Bd -literal -offset indent
                   1662: %x str
                   1663:
                   1664: %%
                   1665: #define MAX_STR_CONST 1024
                   1666: char string_buf[MAX_STR_CONST];
                   1667: char *string_buf_ptr;
                   1668:
                   1669: \e"      string_buf_ptr = string_buf; BEGIN(str);
                   1670:
                   1671: <str>\e" { /* saw closing quote - all done */
                   1672:         BEGIN(INITIAL);
                   1673:         *string_buf_ptr = '\e0';
                   1674:         /*
                   1675:          * return string constant token type and
                   1676:          * value to parser
                   1677:          */
                   1678: }
                   1679:
                   1680: <str>\en {
                   1681:         /* error - unterminated string constant */
                   1682:         /* generate error message */
                   1683: }
                   1684:
                   1685: <str>\e\e[0-7]{1,3} {
                   1686:         /* octal escape sequence */
                   1687:         int result;
                   1688:
                   1689:         (void) sscanf(yytext + 1, "%o", &result);
                   1690:
                   1691:         if (result > 0xff) {
                   1692:                 /* error, constant is out-of-bounds */
                   1693:        } else
                   1694:                *string_buf_ptr++ = result;
                   1695: }
                   1696:
                   1697: <str>\e\e[0-9]+ {
                   1698:         /*
                   1699:          * generate error - bad escape sequence; something
                   1700:          * like '\e48' or '\e0777777'
                   1701:          */
                   1702: }
                   1703:
                   1704: <str>\e\en  *string_buf_ptr++ = '\en';
                   1705: <str>\e\et  *string_buf_ptr++ = '\et';
                   1706: <str>\e\er  *string_buf_ptr++ = '\er';
                   1707: <str>\e\eb  *string_buf_ptr++ = '\eb';
                   1708: <str>\e\ef  *string_buf_ptr++ = '\ef';
                   1709:
                   1710: <str>\e\e(.|\en)  *string_buf_ptr++ = yytext[1];
                   1711:
                   1712: <str>[^\e\e\en\e"]+ {
                   1713:         char *yptr = yytext;
                   1714:
                   1715:         while (*yptr)
                   1716:                 *string_buf_ptr++ = *yptr++;
                   1717: }
                   1718: .Ed
                   1719: .Pp
                   1720: Often, such as in some of the examples above,
                   1721: a whole bunch of rules are all preceded by the same start condition(s).
                   1722: .Nm
1.1       deraadt  1723: makes this a little easier and cleaner by introducing a notion of
                   1724: start condition
1.16      jmc      1725: .Em scope .
1.1       deraadt  1726: A start condition scope is begun with:
1.16      jmc      1727: .Pp
                   1728: .Dl <SCs>{
                   1729: .Pp
1.1       deraadt  1730: where
1.16      jmc      1731: .Dq SCs
                   1732: is a list of one or more start conditions.
                   1733: Inside the start condition scope, every rule automatically has the prefix
                   1734: .Aq SCs
1.1       deraadt  1735: applied to it, until a
1.16      jmc      1736: .Sq }
1.1       deraadt  1737: which matches the initial
1.16      jmc      1738: .Sq { .
1.1       deraadt  1739: So, for example,
1.16      jmc      1740: .Bd -literal -offset indent
                   1741: <ESC>{
                   1742:     "\e\en"   return '\en';
                   1743:     "\e\er"   return '\er';
                   1744:     "\e\ef"   return '\ef';
                   1745:     "\e\e0"   return '\e0';
                   1746: }
                   1747: .Ed
                   1748: .Pp
1.1       deraadt  1749: is equivalent to:
1.16      jmc      1750: .Bd -literal -offset indent
                   1751: <ESC>"\e\en"  return '\en';
                   1752: <ESC>"\e\er"  return '\er';
                   1753: <ESC>"\e\ef"  return '\ef';
                   1754: <ESC>"\e\e0"  return '\e0';
                   1755: .Ed
                   1756: .Pp
1.1       deraadt  1757: Start condition scopes may be nested.
1.16      jmc      1758: .Pp
1.1       deraadt  1759: Three routines are available for manipulating stacks of start conditions:
1.16      jmc      1760: .Bl -tag -width Ds
                   1761: .It void yy_push_state(int new_state)
                   1762: Pushes the current start condition onto the top of the start condition
1.1       deraadt  1763: stack and switches to
1.16      jmc      1764: .Fa new_state
                   1765: as though
                   1766: .Dq BEGIN new_state
                   1767: had been used
                   1768: .Pq recall that start condition names are also integers .
                   1769: .It void yy_pop_state()
                   1770: Pops the top of the stack and switches to it via
                   1771: .Em BEGIN .
                   1772: .It int yy_top_state()
                   1773: Returns the top of the stack without altering the stack's contents.
                   1774: .El
                   1775: .Pp
1.1       deraadt  1776: The start condition stack grows dynamically and so has no built-in
1.16      jmc      1777: size limitation.
                   1778: If memory is exhausted, program execution aborts.
                   1779: .Pp
                   1780: To use start condition stacks, scanners must include a
                   1781: .Dq %option stack
                   1782: directive (see
                   1783: .Sx OPTIONS
                   1784: below).
                   1785: .Sh MULTIPLE INPUT BUFFERS
                   1786: Some scanners
                   1787: (such as those which support
                   1788: .Qq include
                   1789: files)
                   1790: require reading from several input streams.
                   1791: As
                   1792: .Nm
1.1       deraadt  1793: scanners do a large amount of buffering, one cannot control
                   1794: where the next input will be read from by simply writing a
1.16      jmc      1795: .Dv YY_INPUT
1.1       deraadt  1796: which is sensitive to the scanning context.
1.16      jmc      1797: .Dv YY_INPUT
1.1       deraadt  1798: is only called when the scanner reaches the end of its buffer, which
1.16      jmc      1799: may be a long time after scanning a statement such as an
                   1800: .Qq include
1.1       deraadt  1801: which requires switching the input source.
1.16      jmc      1802: .Pp
1.1       deraadt  1803: To negotiate these sorts of problems,
1.16      jmc      1804: .Nm
1.1       deraadt  1805: provides a mechanism for creating and switching between multiple
1.16      jmc      1806: input buffers.
                   1807: An input buffer is created by using:
                   1808: .Pp
                   1809: .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
                   1810: .Pp
1.1       deraadt  1811: which takes a
1.16      jmc      1812: .Fa FILE
                   1813: pointer and a
                   1814: .Fa size
                   1815: and creates a buffer associated with the given file and large enough to hold
                   1816: .Fa size
1.1       deraadt  1817: characters (when in doubt, use
1.16      jmc      1818: .Dv YY_BUF_SIZE
                   1819: for the size).
                   1820: It returns a
                   1821: .Dv YY_BUFFER_STATE
                   1822: handle, which may then be passed to other routines
                   1823: .Pq see below .
                   1824: The
                   1825: .Dv YY_BUFFER_STATE
1.1       deraadt  1826: type is a pointer to an opaque
1.16      jmc      1827: .Dq struct yy_buffer_state
                   1828: structure, so
                   1829: .Dv YY_BUFFER_STATE
                   1830: variables may be safely initialized to
                   1831: .Dq ((YY_BUFFER_STATE) 0)
                   1832: if desired, and the opaque structure can also be referred to in order to
                   1833: correctly declare input buffers in source files other than that of scanners.
                   1834: Note that the
                   1835: .Fa FILE
1.1       deraadt  1836: pointer in the call to
1.16      jmc      1837: .Fn yy_create_buffer
1.1       deraadt  1838: is only used as the value of
1.16      jmc      1839: .Fa yyin
1.1       deraadt  1840: seen by
1.16      jmc      1841: .Dv YY_INPUT ;
                   1842: if
                   1843: .Dv YY_INPUT
                   1844: is redefined so that it no longer uses
                   1845: .Fa yyin ,
                   1846: then a nil
                   1847: .Fa FILE
                   1848: pointer can safely be passed to
                   1849: .Fn yy_create_buffer .
                   1850: To select a particular buffer to scan:
                   1851: .Pp
                   1852: .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
                   1853: .Pp
                   1854: It switches the scanner's input buffer so subsequent tokens will
1.1       deraadt  1855: come from
1.16      jmc      1856: .Fa new_buffer .
1.1       deraadt  1857: Note that
1.16      jmc      1858: .Fn yy_switch_to_buffer
                   1859: may be used by
                   1860: .Fn yywrap
                   1861: to set things up for continued scanning,
                   1862: instead of opening a new file and pointing
                   1863: .Fa yyin
                   1864: at it.
                   1865: Note also that switching input sources via either
                   1866: .Fn yy_switch_to_buffer
                   1867: or
                   1868: .Fn yywrap
                   1869: does not change the start condition.
                   1870: .Pp
                   1871: .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
                   1872: .Pp
                   1873: is used to reclaim the storage associated with a buffer.
                   1874: .Pf ( Fa buffer
1.1       deraadt  1875: can be nil, in which case the routine does nothing.)
1.16      jmc      1876: To clear the current contents of a buffer:
                   1877: .Pp
                   1878: .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
                   1879: .Pp
1.1       deraadt  1880: This function discards the buffer's contents,
1.16      jmc      1881: so the next time the scanner attempts to match a token from the buffer,
                   1882: it will first fill the buffer anew using
                   1883: .Dv YY_INPUT .
                   1884: .Pp
                   1885: .Fn yy_new_buffer
1.1       deraadt  1886: is an alias for
1.16      jmc      1887: .Fn yy_create_buffer ,
1.1       deraadt  1888: provided for compatibility with the C++ use of
1.16      jmc      1889: .Em new
1.1       deraadt  1890: and
1.16      jmc      1891: .Em delete
1.1       deraadt  1892: for creating and destroying dynamic objects.
1.16      jmc      1893: .Pp
1.1       deraadt  1894: Finally, the
1.16      jmc      1895: .Dv YY_CURRENT_BUFFER
1.1       deraadt  1896: macro returns a
1.16      jmc      1897: .Dv YY_BUFFER_STATE
1.1       deraadt  1898: handle to the current buffer.
1.16      jmc      1899: .Pp
1.1       deraadt  1900: Here is an example of using these features for writing a scanner
                   1901: which expands include files (the
1.16      jmc      1902: .Aq Aq EOF
1.1       deraadt  1903: feature is discussed below):
1.16      jmc      1904: .Bd -literal -offset indent
                   1905: /*
                   1906:  * the "incl" state is used for picking up the name
                   1907:  * of an include file
                   1908:  */
                   1909: %x incl
                   1910:
                   1911: %{
                   1912: #define MAX_INCLUDE_DEPTH 10
                   1913: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
                   1914: int include_stack_ptr = 0;
                   1915: %}
                   1916:
                   1917: %%
                   1918: include             BEGIN(incl);
                   1919:
                   1920: [a-z]+              ECHO;
                   1921: [^a-z\en]*\en?        ECHO;
                   1922:
                   1923: <incl>[ \et]*        /* eat the whitespace */
                   1924: <incl>[^ \et\en]+ {   /* got the include file name */
                   1925:         if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
                   1926:                 errx(1, "Includes nested too deeply");
                   1927:
                   1928:         include_stack[include_stack_ptr++] =
                   1929:             YY_CURRENT_BUFFER;
                   1930:
                   1931:         yyin = fopen(yytext, "r");
                   1932:
                   1933:         if (yyin == NULL)
                   1934:                 err(1, NULL);
1.1       deraadt  1935:
1.16      jmc      1936:         yy_switch_to_buffer(
                   1937:             yy_create_buffer(yyin, YY_BUF_SIZE));
1.1       deraadt  1938:
1.16      jmc      1939:         BEGIN(INITIAL);
                   1940: }
1.1       deraadt  1941:
1.16      jmc      1942: <<EOF>> {
                   1943:         if (--include_stack_ptr < 0)
1.1       deraadt  1944:                 yyterminate();
1.16      jmc      1945:         else {
                   1946:                 yy_delete_buffer(YY_CURRENT_BUFFER);
1.1       deraadt  1947:                 yy_switch_to_buffer(
1.16      jmc      1948:                     include_stack[include_stack_ptr]);
                   1949:        }
                   1950: }
                   1951: .Ed
                   1952: .Pp
1.1       deraadt  1953: Three routines are available for setting up input buffers for
1.16      jmc      1954: scanning in-memory strings instead of files.
                   1955: All of them create a new input buffer for scanning the string,
                   1956: and return a corresponding
                   1957: .Dv YY_BUFFER_STATE
                   1958: handle (which should be deleted afterwards using
                   1959: .Fn yy_delete_buffer ) .
                   1960: They also switch to the new buffer using
                   1961: .Fn yy_switch_to_buffer ,
1.1       deraadt  1962: so the next call to
1.16      jmc      1963: .Fn yylex
1.1       deraadt  1964: will start scanning the string.
1.16      jmc      1965: .Bl -tag -width Ds
                   1966: .It yy_scan_string(const char *str)
                   1967: Scans a NUL-terminated string.
                   1968: .It yy_scan_bytes(const char *bytes, int len)
                   1969: Scans
                   1970: .Fa len
                   1971: bytes
                   1972: .Pq including possibly NUL's
1.1       deraadt  1973: starting at location
1.16      jmc      1974: .Fa bytes .
                   1975: .El
                   1976: .Pp
                   1977: Note that both of these functions create and scan a copy
                   1978: of the string or bytes.
                   1979: (This may be desirable, since
                   1980: .Fn yylex
                   1981: modifies the contents of the buffer it is scanning.)
                   1982: The copy can be avoided by using:
                   1983: .Bl -tag -width Ds
                   1984: .It yy_scan_buffer(char *base, yy_size_t size)
                   1985: Which scans the buffer starting at
                   1986: .Fa base ,
1.1       deraadt  1987: consisting of
1.16      jmc      1988: .Fa size
                   1989: bytes, the last two bytes of which must be
                   1990: .Dv YY_END_OF_BUFFER_CHAR
                   1991: .Pq ASCII NUL .
                   1992: These last two bytes are not scanned; thus, scanning consists of
                   1993: base[0] through base[size-2], inclusive.
                   1994: .Pp
                   1995: If
                   1996: .Fa base
                   1997: is not set up in this manner
                   1998: (i.e., forget the final two
                   1999: .Dv YY_END_OF_BUFFER_CHAR
1.1       deraadt  2000: bytes), then
1.16      jmc      2001: .Fn yy_scan_buffer
1.1       deraadt  2002: returns a nil pointer instead of creating a new input buffer.
1.16      jmc      2003: .Pp
1.1       deraadt  2004: The type
1.16      jmc      2005: .Fa yy_size_t
                   2006: is an integral type which can be cast to an integer expression
1.1       deraadt  2007: reflecting the size of the buffer.
1.16      jmc      2008: .El
                   2009: .Sh END-OF-FILE RULES
                   2010: The special rule
                   2011: .Qq Aq Aq EOF
                   2012: indicates actions which are to be taken when an end-of-file is encountered and
                   2013: .Fn yywrap
                   2014: returns non-zero
                   2015: .Pq i.e., indicates no further files to process .
                   2016: The action must finish by doing one of four things:
                   2017: .Bl -dash
                   2018: .It
                   2019: Assigning
                   2020: .Em yyin
                   2021: to a new input file
                   2022: (in previous versions of
                   2023: .Nm ,
                   2024: after doing the assignment, it was necessary to call the special action
                   2025: .Dv YY_NEW_FILE ;
                   2026: this is no longer necessary).
                   2027: .It
                   2028: Executing a
                   2029: .Em return
                   2030: statement.
                   2031: .It
                   2032: Executing the special
                   2033: .Fn yyterminate
                   2034: action.
                   2035: .It
                   2036: Switching to a new buffer using
                   2037: .Fn yy_switch_to_buffer
1.1       deraadt  2038: as shown in the example above.
1.16      jmc      2039: .El
                   2040: .Pp
                   2041: .Aq Aq EOF
                   2042: rules may not be used with other patterns;
                   2043: they may only be qualified with a list of start conditions.
                   2044: If an unqualified
                   2045: .Aq Aq EOF
                   2046: rule is given, it applies to all start conditions which do not already have
                   2047: .Aq Aq EOF
                   2048: actions.
                   2049: To specify an
                   2050: .Aq Aq EOF
                   2051: rule for only the initial start condition, use
                   2052: .Pp
                   2053: .Dl <INITIAL><<EOF>>
                   2054: .Pp
1.1       deraadt  2055: These rules are useful for catching things like unclosed comments.
                   2056: An example:
1.16      jmc      2057: .Bd -literal -offset indent
                   2058: %x quote
                   2059: %%
                   2060:
                   2061: \&...other rules for dealing with quotes...
                   2062:
                   2063: <quote><<EOF>> {
                   2064:          error("unterminated quote");
                   2065:          yyterminate();
                   2066: }
                   2067: <<EOF>> {
                   2068:          if (*++filelist)
                   2069:                  yyin = fopen(*filelist, "r");
                   2070:          else
                   2071:                  yyterminate();
                   2072: }
                   2073: .Ed
                   2074: .Sh MISCELLANEOUS MACROS
1.1       deraadt  2075: The macro
1.16      jmc      2076: .Dv YY_USER_ACTION
1.1       deraadt  2077: can be defined to provide an action
1.16      jmc      2078: which is always executed prior to the matched rule's action.
                   2079: For example,
1.1       deraadt  2080: it could be #define'd to call a routine to convert yytext to lower-case.
                   2081: When
1.16      jmc      2082: .Dv YY_USER_ACTION
1.1       deraadt  2083: is invoked, the variable
1.16      jmc      2084: .Fa yy_act
                   2085: gives the number of the matched rule
                   2086: .Pq rules are numbered starting with 1 .
                   2087: For example, to profile how often each rule is matched,
                   2088: the following would do the trick:
                   2089: .Pp
                   2090: .Dl #define YY_USER_ACTION ++ctr[yy_act]
                   2091: .Pp
1.1       deraadt  2092: where
1.16      jmc      2093: .Fa ctr
                   2094: is an array to hold the counts for the different rules.
                   2095: Note that the macro
                   2096: .Dv YY_NUM_RULES
                   2097: gives the total number of rules
                   2098: (including the default rule, even if
                   2099: .Fl s
                   2100: is used),
1.1       deraadt  2101: so a correct declaration for
1.16      jmc      2102: .Fa ctr
1.1       deraadt  2103: is:
1.16      jmc      2104: .Pp
                   2105: .Dl int ctr[YY_NUM_RULES];
                   2106: .Pp
1.1       deraadt  2107: The macro
1.16      jmc      2108: .Dv YY_USER_INIT
1.1       deraadt  2109: may be defined to provide an action which is always executed before
1.16      jmc      2110: the first scan
                   2111: .Pq and before the scanner's internal initializations are done .
1.1       deraadt  2112: For example, it could be used to call a routine to read
                   2113: in a data table or open a logging file.
1.16      jmc      2114: .Pp
1.1       deraadt  2115: The macro
1.16      jmc      2116: .Dv yy_set_interactive(is_interactive)
1.1       deraadt  2117: can be used to control whether the current buffer is considered
1.16      jmc      2118: .Em interactive .
1.1       deraadt  2119: An interactive buffer is processed more slowly,
                   2120: but must be used when the scanner's input source is indeed
                   2121: interactive to avoid problems due to waiting to fill buffers
                   2122: (see the discussion of the
1.16      jmc      2123: .Fl I
                   2124: flag below).
                   2125: A non-zero value in the macro invocation marks the buffer as interactive,
                   2126: a zero value as non-interactive.
                   2127: Note that use of this macro overrides
                   2128: .Dq %option always-interactive
                   2129: or
                   2130: .Dq %option never-interactive
                   2131: (see
                   2132: .Sx OPTIONS
                   2133: below).
                   2134: .Fn yy_set_interactive
1.1       deraadt  2135: must be invoked prior to beginning to scan the buffer that is
1.16      jmc      2136: .Pq or is not
                   2137: to be considered interactive.
                   2138: .Pp
1.1       deraadt  2139: The macro
1.16      jmc      2140: .Dv yy_set_bol(at_bol)
1.1       deraadt  2141: can be used to control whether the current buffer's scanning
                   2142: context for the next token match is done as though at the
1.16      jmc      2143: beginning of a line.
                   2144: A non-zero macro argument makes rules anchored with
                   2145: .Sq ^
                   2146: active, while a zero argument makes
                   2147: .Sq ^
                   2148: rules inactive.
                   2149: .Pp
1.1       deraadt  2150: The macro
1.16      jmc      2151: .Dv YY_AT_BOL
                   2152: returns true if the next token scanned from the current buffer will have
                   2153: .Sq ^
                   2154: rules active, false otherwise.
                   2155: .Pp
1.1       deraadt  2156: In the generated scanner, the actions are all gathered in one large
                   2157: switch statement and separated using
1.16      jmc      2158: .Dv YY_BREAK ,
                   2159: which may be redefined.
                   2160: By default, it is simply a
                   2161: .Qq break ,
                   2162: to separate each rule's action from the following rules.
1.1       deraadt  2163: Redefining
1.16      jmc      2164: .Dv YY_BREAK
1.1       deraadt  2165: allows, for example, C++ users to
1.16      jmc      2166: .Dq #define YY_BREAK
                   2167: to do nothing
                   2168: (while being very careful that every rule ends with a
                   2169: .Qq break
                   2170: or a
                   2171: .Qq return ! )
                   2172: to avoid suffering from unreachable statement warnings where because a rule's
                   2173: action ends with
                   2174: .Dq return ,
                   2175: the
                   2176: .Dv YY_BREAK
1.1       deraadt  2177: is inaccessible.
1.16      jmc      2178: .Sh VALUES AVAILABLE TO THE USER
1.1       deraadt  2179: This section summarizes the various values available to the user
                   2180: in the rule actions.
1.16      jmc      2181: .Bl -tag -width Ds
                   2182: .It char *yytext
                   2183: Holds the text of the current token.
                   2184: It may be modified but not lengthened
                   2185: .Pq characters cannot be appended to the end .
                   2186: .Pp
1.1       deraadt  2187: If the special directive
1.16      jmc      2188: .Dq %array
1.1       deraadt  2189: appears in the first section of the scanner description, then
1.16      jmc      2190: .Fa yytext
1.1       deraadt  2191: is instead declared
1.16      jmc      2192: .Dq char yytext[YYLMAX] ,
1.1       deraadt  2193: where
1.16      jmc      2194: .Dv YYLMAX
                   2195: is a macro definition that can be redefined in the first section
                   2196: to change the default value
                   2197: .Pq generally 8KB .
                   2198: Using
                   2199: .Dq %array
1.1       deraadt  2200: results in somewhat slower scanners, but the value of
1.16      jmc      2201: .Fa yytext
1.1       deraadt  2202: becomes immune to calls to
1.16      jmc      2203: .Fn input
1.1       deraadt  2204: and
1.16      jmc      2205: .Fn unput ,
1.1       deraadt  2206: which potentially destroy its value when
1.16      jmc      2207: .Fa yytext
                   2208: is a character pointer.
                   2209: The opposite of
                   2210: .Dq %array
1.1       deraadt  2211: is
1.16      jmc      2212: .Dq %pointer ,
1.1       deraadt  2213: which is the default.
1.16      jmc      2214: .Pp
                   2215: .Dq %array
                   2216: cannot be used when generating C++ scanner classes
1.1       deraadt  2217: (the
1.16      jmc      2218: .Fl +
1.1       deraadt  2219: flag).
1.16      jmc      2220: .It int yyleng
                   2221: Holds the length of the current token.
                   2222: .It FILE *yyin
                   2223: Is the file which by default
                   2224: .Nm
                   2225: reads from.
                   2226: It may be redefined, but doing so only makes sense before
                   2227: scanning begins or after an
                   2228: .Dv EOF
                   2229: has been encountered.
                   2230: Changing it in the midst of scanning will have unexpected results since
                   2231: .Nm
1.1       deraadt  2232: buffers its input; use
1.16      jmc      2233: .Fn yyrestart
1.1       deraadt  2234: instead.
                   2235: Once scanning terminates because an end-of-file
1.16      jmc      2236: has been seen,
                   2237: .Fa yyin
                   2238: can be assigned as the new input file
                   2239: and the scanner can be called again to continue scanning.
                   2240: .It void yyrestart(FILE *new_file)
                   2241: May be called to point
                   2242: .Fa yyin
                   2243: at the new input file.
                   2244: The switch-over to the new file is immediate
                   2245: .Pq any previously buffered-up input is lost .
                   2246: Note that calling
                   2247: .Fn yyrestart
1.1       deraadt  2248: with
1.16      jmc      2249: .Fa yyin
1.1       deraadt  2250: as an argument thus throws away the current input buffer and continues
                   2251: scanning the same input file.
1.16      jmc      2252: .It FILE *yyout
                   2253: Is the file to which
                   2254: .Em ECHO
                   2255: actions are done.
                   2256: It can be reassigned by the user.
                   2257: .It YY_CURRENT_BUFFER
                   2258: Returns a
                   2259: .Dv YY_BUFFER_STATE
1.1       deraadt  2260: handle to the current buffer.
1.16      jmc      2261: .It YY_START
                   2262: Returns an integer value corresponding to the current start condition.
                   2263: This value can subsequently be used with
                   2264: .Em BEGIN
1.1       deraadt  2265: to return to that start condition.
1.16      jmc      2266: .El
                   2267: .Sh INTERFACING WITH YACC
1.1       deraadt  2268: One of the main uses of
1.16      jmc      2269: .Nm
1.1       deraadt  2270: is as a companion to the
1.16      jmc      2271: .Xr yacc 1
1.1       deraadt  2272: parser-generator.
1.16      jmc      2273: yacc parsers expect to call a routine named
                   2274: .Fn yylex
                   2275: to find the next input token.
                   2276: The routine is supposed to return the type of the next token
                   2277: as well as putting any associated value in the global
1.17      jmc      2278: .Fa yylval ,
                   2279: which is defined externally,
                   2280: and can be a union or any other complex data structure.
1.1       deraadt  2281: To use
1.16      jmc      2282: .Nm
                   2283: with yacc, one specifies the
                   2284: .Fl d
                   2285: option to yacc to instruct it to generate the file
                   2286: .Pa y.tab.h
1.1       deraadt  2287: containing definitions of all the
1.16      jmc      2288: .Dq %tokens
                   2289: appearing in the yacc input.
                   2290: This file is then included in the
                   2291: .Nm
                   2292: scanner.
                   2293: For example, if one of the tokens is
                   2294: .Qq TOK_NUMBER ,
1.1       deraadt  2295: part of the scanner might look like:
1.16      jmc      2296: .Bd -literal -offset indent
                   2297: %{
                   2298: #include "y.tab.h"
                   2299: %}
                   2300:
                   2301: %%
                   2302:
                   2303: [0-9]+        yylval = atoi(yytext); return TOK_NUMBER;
                   2304: .Ed
                   2305: .Sh OPTIONS
                   2306: .Nm
1.1       deraadt  2307: has the following options:
1.16      jmc      2308: .Bl -tag -width Ds
                   2309: .It Fl 7
                   2310: Instructs
                   2311: .Nm
                   2312: to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
                   2313: characters in its input.
                   2314: The advantage of using
                   2315: .Fl 7
1.1       deraadt  2316: is that the scanner's tables can be up to half the size of those generated
                   2317: using the
1.16      jmc      2318: .Fl 8
                   2319: option
                   2320: .Pq see below .
                   2321: The disadvantage is that such scanners often hang
1.1       deraadt  2322: or crash if their input contains an 8-bit character.
1.16      jmc      2323: .Pp
                   2324: Note, however, that unless generating a scanner using the
                   2325: .Fl Cf
1.1       deraadt  2326: or
1.16      jmc      2327: .Fl CF
1.1       deraadt  2328: table compression options, use of
1.16      jmc      2329: .Fl 7
                   2330: will save only a small amount of table space,
                   2331: and make the scanner considerably less portable.
                   2332: .Nm flex Ns 's
                   2333: default behavior is to generate an 8-bit scanner unless
                   2334: .Fl Cf
                   2335: or
                   2336: .Fl CF
                   2337: is specified, in which case
                   2338: .Nm
                   2339: defaults to generating 7-bit scanners unless it was
                   2340: configured to generate 8-bit scanners
                   2341: (as will often be the case with non-USA sites).
                   2342: It is possible tell whether
                   2343: .Nm
                   2344: generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
                   2345: .Fl v
                   2346: output as described below.
                   2347: .Pp
                   2348: Note that if
                   2349: .Fl Cfe
                   2350: or
                   2351: .Fl CFe
                   2352: are used
                   2353: (the table compression options, but also using equivalence classes as
                   2354: discussed below),
                   2355: .Nm
                   2356: still defaults to generating an 8-bit scanner,
                   2357: since usually with these compression options full 8-bit tables
1.1       deraadt  2358: are not much more expensive than 7-bit tables.
1.16      jmc      2359: .It Fl 8
                   2360: Instructs
                   2361: .Nm
1.1       deraadt  2362: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1.16      jmc      2363: characters.
                   2364: This flag is only needed for scanners generated using
                   2365: .Fl Cf
1.1       deraadt  2366: or
1.16      jmc      2367: .Fl CF ,
                   2368: as otherwise
                   2369: .Nm
                   2370: defaults to generating an 8-bit scanner anyway.
                   2371: .Pp
1.1       deraadt  2372: See the discussion of
1.16      jmc      2373: .Fl 7
                   2374: above for
                   2375: .Nm flex Ns 's
                   2376: default behavior and the tradeoffs between 7-bit and 8-bit scanners.
                   2377: .It Fl B
                   2378: Instructs
                   2379: .Nm
                   2380: to generate a
                   2381: .Em batch
                   2382: scanner, the opposite of
                   2383: .Em interactive
                   2384: scanners generated by
                   2385: .Fl I
                   2386: .Pq see below .
                   2387: In general,
                   2388: .Fl B
                   2389: is used when the scanner will never be used interactively,
                   2390: and you want to squeeze a little more performance out of it.
                   2391: If the aim is instead to squeeze out a lot more performance,
                   2392: use the
                   2393: .Fl Cf
                   2394: or
                   2395: .Fl CF
                   2396: options
                   2397: .Pq discussed below ,
                   2398: which turn on
                   2399: .Fl B
                   2400: automatically anyway.
                   2401: .It Fl b
                   2402: Generate backing-up information to
                   2403: .Pa lex.backup .
                   2404: This is a list of scanner states which require backing up
                   2405: and the input characters on which they do so.
                   2406: By adding rules one can remove backing-up states.
                   2407: If all backing-up states are eliminated and
                   2408: .Fl Cf
                   2409: or
                   2410: .Fl CF
                   2411: is used, the generated scanner will run faster (see the
                   2412: .Fl p
                   2413: flag).
                   2414: Only users who wish to squeeze every last cycle out of their
                   2415: scanners need worry about this option.
                   2416: (See the section on
                   2417: .Sx PERFORMANCE CONSIDERATIONS
                   2418: below.)
                   2419: .It Fl C Ns Op Cm aeFfmr
                   2420: Controls the degree of table compression and, more generally, trade-offs
1.1       deraadt  2421: between small scanners and fast scanners.
1.16      jmc      2422: .Bl -tag -width Ds
                   2423: .It Fl Ca
                   2424: Instructs
                   2425: .Nm
                   2426: to trade off larger tables in the generated scanner for faster performance
                   2427: because the elements of the tables are better aligned for memory access
                   2428: and computation.
                   2429: On some
                   2430: .Tn RISC
                   2431: architectures, fetching and manipulating longwords is more efficient
                   2432: than with smaller-sized units such as shortwords.
                   2433: This option can double the size of the tables used by the scanner.
                   2434: .It Fl Ce
                   2435: Directs
                   2436: .Nm
1.1       deraadt  2437: to construct
1.16      jmc      2438: .Em equivalence classes ,
                   2439: i.e., sets of characters which have identical lexical properties
                   2440: (for example, if the only appearance of digits in the
                   2441: .Nm
1.1       deraadt  2442: input is in the character class
1.16      jmc      2443: .Qq [0-9]
                   2444: then the digits
                   2445: .Sq 0 ,
                   2446: .Sq 1 ,
                   2447: .Sq ... ,
                   2448: .Sq 9
                   2449: will all be put in the same equivalence class).
                   2450: Equivalence classes usually give dramatic reductions in the final
                   2451: table/object file sizes
                   2452: .Pq typically a factor of 2\-5
                   2453: and are pretty cheap performance-wise
                   2454: .Pq one array look-up per character scanned .
                   2455: .It Fl CF
                   2456: Specifies that the alternate fast scanner representation
                   2457: (described below under the
                   2458: .Fl F
                   2459: option)
                   2460: should be used.
                   2461: This option cannot be used with
                   2462: .Fl + .
                   2463: .It Fl Cf
                   2464: Specifies that the
                   2465: .Em full
                   2466: scanner tables should be generated \-
                   2467: .Nm
                   2468: should not compress the tables by taking advantage of
                   2469: similar transition functions for different states.
                   2470: .It Fl \&Cm
                   2471: Directs
                   2472: .Nm
1.1       deraadt  2473: to construct
1.16      jmc      2474: .Em meta-equivalence classes ,
                   2475: which are sets of equivalence classes
                   2476: (or characters, if equivalence classes are not being used)
                   2477: that are commonly used together.
                   2478: Meta-equivalence classes are often a big win when using compressed tables,
                   2479: but they have a moderate performance impact
                   2480: (one or two
                   2481: .Qq if
                   2482: tests and one array look-up per character scanned).
                   2483: .It Fl Cr
                   2484: Causes the generated scanner to
                   2485: .Em bypass
                   2486: use of the standard I/O library
                   2487: .Pq stdio
                   2488: for input.
                   2489: Instead of calling
                   2490: .Xr fread 3
1.1       deraadt  2491: or
1.16      jmc      2492: .Xr getc 3 ,
1.1       deraadt  2493: the scanner will use the
1.16      jmc      2494: .Xr read 2
                   2495: system call,
                   2496: resulting in a performance gain which varies from system to system,
                   2497: but in general is probably negligible unless
                   2498: .Fl Cf
1.1       deraadt  2499: or
1.16      jmc      2500: .Fl CF
                   2501: are being used.
1.1       deraadt  2502: Using
1.16      jmc      2503: .Fl Cr
                   2504: can cause strange behavior if, for example, reading from
                   2505: .Fa yyin
                   2506: using stdio prior to calling the scanner
                   2507: (because the scanner will miss whatever text previous reads left
                   2508: in the stdio input buffer).
                   2509: .Pp
                   2510: .Fl Cr
                   2511: has no effect if
                   2512: .Dv YY_INPUT
                   2513: is defined
                   2514: (see
                   2515: .Sx THE GENERATED SCANNER
                   2516: above).
                   2517: .El
                   2518: .Pp
1.1       deraadt  2519: A lone
1.16      jmc      2520: .Fl C
1.1       deraadt  2521: specifies that the scanner tables should be compressed but neither
                   2522: equivalence classes nor meta-equivalence classes should be used.
1.16      jmc      2523: .Pp
1.1       deraadt  2524: The options
1.16      jmc      2525: .Fl Cf
1.1       deraadt  2526: or
1.16      jmc      2527: .Fl CF
1.1       deraadt  2528: and
1.16      jmc      2529: .Fl \&Cm
                   2530: do not make sense together \- there is no opportunity for meta-equivalence
                   2531: classes if the table is not being compressed.
                   2532: Otherwise the options may be freely mixed, and are cumulative.
                   2533: .Pp
1.1       deraadt  2534: The default setting is
1.16      jmc      2535: .Fl Cem
1.1       deraadt  2536: which specifies that
1.16      jmc      2537: .Nm
                   2538: should generate equivalence classes and meta-equivalence classes.
                   2539: This setting provides the highest degree of table compression.
                   2540: It is possible to trade off faster-executing scanners at the cost of
                   2541: larger tables with the following generally being true:
                   2542: .Bd -unfilled -offset indent
                   2543: slowest & smallest
                   2544:       -Cem
                   2545:       -Cm
                   2546:       -Ce
                   2547:       -C
                   2548:       -C{f,F}e
                   2549:       -C{f,F}
                   2550:       -C{f,F}a
                   2551: fastest & largest
                   2552: .Ed
                   2553: .Pp
1.1       deraadt  2554: Note that scanners with the smallest tables are usually generated and
1.16      jmc      2555: compiled the quickest,
                   2556: so during development the default is usually best,
                   2557: maximal compression.
                   2558: .Pp
                   2559: .Fl Cfe
                   2560: is often a good compromise between speed and size for production scanners.
                   2561: .It Fl d
                   2562: Makes the generated scanner run in debug mode.
                   2563: Whenever a pattern is recognized and the global
                   2564: .Fa yy_flex_debug
                   2565: is non-zero
                   2566: .Pq which is the default ,
                   2567: the scanner will write to stderr a line of the form:
                   2568: .Pp
                   2569: .D1 --accepting rule at line 53 ("the matched text")
                   2570: .Pp
                   2571: The line number refers to the location of the rule in the file
                   2572: defining the scanner
                   2573: (i.e., the file that was fed to
                   2574: .Nm ) .
                   2575: Messages are also generated when the scanner backs up,
                   2576: accepts the default rule,
                   2577: reaches the end of its input buffer
                   2578: (or encounters a NUL;
                   2579: at this point, the two look the same as far as the scanner's concerned),
                   2580: or reaches an end-of-file.
                   2581: .It Fl F
                   2582: Specifies that the fast scanner table representation should be used
                   2583: .Pq and stdio bypassed .
                   2584: This representation is about as fast as the full table representation
                   2585: .Pq Fl f ,
                   2586: and for some sets of patterns will be considerably smaller
                   2587: .Pq and for others, larger .
                   2588: In general, if the pattern set contains both
                   2589: .Qq keywords
                   2590: and a catch-all,
                   2591: .Qq identifier
                   2592: rule, such as in the set:
                   2593: .Bd -unfilled -offset indent
                   2594: "case"    return TOK_CASE;
                   2595: "switch"  return TOK_SWITCH;
                   2596: \&...
                   2597: "default" return TOK_DEFAULT;
                   2598: [a-z]+    return TOK_ID;
                   2599: .Ed
                   2600: .Pp
                   2601: then it's better to use the full table representation.
                   2602: If only the
                   2603: .Qq identifier
                   2604: rule is present and a hash table or some such is used to detect the keywords,
                   2605: it's better to use
                   2606: .Fl F .
                   2607: .Pp
                   2608: This option is equivalent to
                   2609: .Fl CFr
                   2610: .Pq see above .
                   2611: It cannot be used with
                   2612: .Fl + .
                   2613: .It Fl f
                   2614: Specifies
                   2615: .Em fast scanner .
                   2616: No table compression is done and stdio is bypassed.
                   2617: The result is large but fast.
                   2618: This option is equivalent to
                   2619: .Fl Cfr
                   2620: .Pq see above .
                   2621: .It Fl h
                   2622: Generates a help summary of
                   2623: .Nm flex Ns 's
                   2624: options to stdout and then exits.
                   2625: .Fl ?\&
                   2626: and
                   2627: .Fl Fl help
                   2628: are synonyms for
                   2629: .Fl h .
                   2630: .It Fl I
                   2631: Instructs
                   2632: .Nm
                   2633: to generate an
                   2634: .Em interactive
                   2635: scanner.
                   2636: An interactive scanner is one that only looks ahead to decide
                   2637: what token has been matched if it absolutely must.
                   2638: It turns out that always looking one extra character ahead,
                   2639: even if the scanner has already seen enough text
                   2640: to disambiguate the current token, is a bit faster than
                   2641: only looking ahead when necessary.
                   2642: But scanners that always look ahead give dreadful interactive performance;
                   2643: for example, when a user types a newline,
                   2644: it is not recognized as a newline token until they enter
                   2645: .Em another
                   2646: token, which often means typing in another whole line.
                   2647: .Pp
                   2648: .Nm
                   2649: scanners default to
                   2650: .Em interactive
                   2651: unless
                   2652: .Fl Cf
                   2653: or
                   2654: .Fl CF
                   2655: table-compression options are specified
                   2656: .Pq see above .
                   2657: That's because if high-performance is most important,
                   2658: one of these options should be used,
                   2659: so if they weren't,
                   2660: .Nm
1.24      sobrado  2661: assumes it is preferable to trade off a bit of run-time performance for
1.16      jmc      2662: intuitive interactive behavior.
                   2663: Note also that
                   2664: .Fl I
                   2665: cannot be used in conjunction with
                   2666: .Fl Cf
                   2667: or
                   2668: .Fl CF .
                   2669: Thus, this option is not really needed; it is on by default for all those
                   2670: cases in which it is allowed.
                   2671: .Pp
                   2672: A scanner can be forced to not be interactive by using
                   2673: .Fl B
                   2674: .Pq see above .
                   2675: .It Fl i
                   2676: Instructs
                   2677: .Nm
                   2678: to generate a case-insensitive scanner.
                   2679: The case of letters given in the
                   2680: .Nm
                   2681: input patterns will be ignored,
                   2682: and tokens in the input will be matched regardless of case.
                   2683: The matched text given in
                   2684: .Fa yytext
                   2685: will have the preserved case
                   2686: .Pq i.e., it will not be folded .
                   2687: .It Fl L
                   2688: Instructs
                   2689: .Nm
                   2690: not to generate
                   2691: .Dq #line
                   2692: directives.
                   2693: Without this option,
                   2694: .Nm
                   2695: peppers the generated scanner with #line directives so error messages
                   2696: in the actions will be correctly located with respect to either the original
                   2697: .Nm
                   2698: input file
                   2699: (if the errors are due to code in the input file),
                   2700: or
                   2701: .Pa lex.yy.c
                   2702: (if the errors are
                   2703: .Nm flex Ns 's
                   2704: fault \- these sorts of errors should be reported to the email address
                   2705: given below).
                   2706: .It Fl l
1.36      schwarze 2707: Turns on maximum compatibility with the original
                   2708: .At
1.16      jmc      2709: .Nm lex
                   2710: implementation.
                   2711: Note that this does not mean full compatibility.
                   2712: Use of this option costs a considerable amount of performance,
                   2713: and it cannot be used with the
                   2714: .Fl + , f , F , Cf ,
                   2715: or
                   2716: .Fl CF
                   2717: options.
                   2718: For details on the compatibilities it provides, see the section
                   2719: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
                   2720: below.
                   2721: This option also results in the name
                   2722: .Dv YY_FLEX_LEX_COMPAT
                   2723: being #define'd in the generated scanner.
                   2724: .It Fl n
                   2725: Another do-nothing, deprecated option included only for
                   2726: .Tn POSIX
                   2727: compliance.
                   2728: .It Fl o Ns Ar output
                   2729: Directs
                   2730: .Nm
                   2731: to write the scanner to the file
                   2732: .Ar output
1.1       deraadt  2733: instead of
1.16      jmc      2734: .Pa lex.yy.c .
                   2735: If
                   2736: .Fl o
                   2737: is combined with the
                   2738: .Fl t
                   2739: option, then the scanner is written to stdout but its
                   2740: .Dq #line
                   2741: directives
                   2742: (see the
                   2743: .Fl L
                   2744: option above)
                   2745: refer to the file
                   2746: .Ar output .
                   2747: .It Fl P Ns Ar prefix
                   2748: Changes the default
                   2749: .Qq yy
1.1       deraadt  2750: prefix used by
1.16      jmc      2751: .Nm
1.6       aaron    2752: for all globally visible variable and function names to instead be
1.16      jmc      2753: .Ar prefix .
1.1       deraadt  2754: For example,
1.16      jmc      2755: .Fl P Ns Ar foo
1.1       deraadt  2756: changes the name of
1.16      jmc      2757: .Fa yytext
1.1       deraadt  2758: to
1.16      jmc      2759: .Fa footext .
1.1       deraadt  2760: It also changes the name of the default output file from
1.16      jmc      2761: .Pa lex.yy.c
1.1       deraadt  2762: to
1.16      jmc      2763: .Pa lex.foo.c .
1.1       deraadt  2764: Here are all of the names affected:
1.16      jmc      2765: .Bd -unfilled -offset indent
                   2766: yy_create_buffer
                   2767: yy_delete_buffer
                   2768: yy_flex_debug
                   2769: yy_init_buffer
                   2770: yy_flush_buffer
                   2771: yy_load_buffer_state
                   2772: yy_switch_to_buffer
                   2773: yyin
                   2774: yyleng
                   2775: yylex
                   2776: yylineno
                   2777: yyout
                   2778: yyrestart
                   2779: yytext
                   2780: yywrap
                   2781: .Ed
                   2782: .Pp
                   2783: (If using a C++ scanner, then only
                   2784: .Fa yywrap
1.1       deraadt  2785: and
1.16      jmc      2786: .Fa yyFlexLexer
1.1       deraadt  2787: are affected.)
1.16      jmc      2788: Within the scanner itself, it is still possible to refer to the global variables
1.1       deraadt  2789: and functions using either version of their name; but externally, they
                   2790: have the modified name.
1.16      jmc      2791: .Pp
                   2792: This option allows multiple
                   2793: .Nm
                   2794: programs to be easily linked together into the same executable.
                   2795: Note, though, that using this option also renames
                   2796: .Fn yywrap ,
                   2797: so now either an
                   2798: .Pq appropriately named
                   2799: version of the routine for the scanner must be supplied, or
                   2800: .Dq %option noyywrap
                   2801: must be used, as linking with
                   2802: .Fl lfl
                   2803: no longer provides one by default.
                   2804: .It Fl p
                   2805: Generates a performance report to stderr.
                   2806: The report consists of comments regarding features of the
                   2807: .Nm
                   2808: input file which will cause a serious loss of performance in the resulting
                   2809: scanner.
                   2810: If the flag is specified twice,
                   2811: comments regarding features that lead to minor performance losses
                   2812: will also be reported>
                   2813: .Pp
                   2814: Note that the use of
                   2815: .Em REJECT ,
                   2816: .Dq %option yylineno ,
                   2817: and variable trailing context
                   2818: (see the
                   2819: .Sx BUGS
                   2820: section below)
                   2821: entails a substantial performance penalty; use of
                   2822: .Fn yymore ,
                   2823: the
                   2824: .Sq ^
                   2825: operator, and the
                   2826: .Fl I
                   2827: flag entail minor performance penalties.
                   2828: .It Fl S Ns Ar skeleton
                   2829: Overrides the default skeleton file from which
                   2830: .Nm
                   2831: constructs its scanners.
                   2832: This option is needed only for
                   2833: .Nm
1.1       deraadt  2834: maintenance or development.
1.16      jmc      2835: .It Fl s
                   2836: Causes the default rule
                   2837: .Pq that unmatched scanner input is echoed to stdout
                   2838: to be suppressed.
                   2839: If the scanner encounters input that does not
                   2840: match any of its rules, it aborts with an error.
                   2841: This option is useful for finding holes in a scanner's rule set.
                   2842: .It Fl T
                   2843: Makes
                   2844: .Nm
                   2845: run in
                   2846: .Em trace
                   2847: mode.
                   2848: It will generate a lot of messages to stderr concerning
                   2849: the form of the input and the resultant non-deterministic and deterministic
                   2850: finite automata.
                   2851: This option is mostly for use in maintaining
                   2852: .Nm .
                   2853: .It Fl t
                   2854: Instructs
                   2855: .Nm
                   2856: to write the scanner it generates to standard output instead of
                   2857: .Pa lex.yy.c .
                   2858: .It Fl V
                   2859: Prints the version number to stdout and exits.
                   2860: .Fl Fl version
                   2861: is a synonym for
                   2862: .Fl V .
                   2863: .It Fl v
                   2864: Specifies that
                   2865: .Nm
                   2866: should write to stderr
                   2867: a summary of statistics regarding the scanner it generates.
                   2868: Most of the statistics are meaningless to the casual
                   2869: .Nm
                   2870: user, but the first line identifies the version of
                   2871: .Nm
                   2872: (same as reported by
                   2873: .Fl V ) ,
                   2874: and the next line the flags used when generating the scanner,
                   2875: including those that are on by default.
                   2876: .It Fl w
                   2877: Suppresses warning messages.
                   2878: .It Fl +
                   2879: Specifies that
                   2880: .Nm
                   2881: should generate a C++ scanner class.
                   2882: See the section on
                   2883: .Sx GENERATING C++ SCANNERS
                   2884: below for details.
                   2885: .El
                   2886: .Pp
                   2887: .Nm
1.1       deraadt  2888: also provides a mechanism for controlling options within the
1.16      jmc      2889: scanner specification itself, rather than from the
                   2890: .Nm
1.33      jmc      2891: command line.
1.1       deraadt  2892: This is done by including
1.16      jmc      2893: .Dq %option
1.1       deraadt  2894: directives in the first section of the scanner specification.
1.16      jmc      2895: Multiple options can be specified with a single
                   2896: .Dq %option
                   2897: directive, and multiple directives in the first section of the
                   2898: .Nm
                   2899: input file.
                   2900: .Pp
                   2901: Most options are given simply as names, optionally preceded by the word
                   2902: .Qq no
                   2903: .Pq with no intervening whitespace
                   2904: to negate their meaning.
                   2905: A number are equivalent to
                   2906: .Nm
                   2907: flags or their negation:
                   2908: .Bd -unfilled -offset indent
                   2909: 7bit            -7 option
                   2910: 8bit            -8 option
                   2911: align           -Ca option
                   2912: backup          -b option
                   2913: batch           -B option
                   2914: c++             -+ option
                   2915:
                   2916: caseful or
                   2917: case-sensitive  opposite of -i (default)
                   2918:
                   2919: case-insensitive or
                   2920: caseless        -i option
                   2921:
                   2922: debug           -d option
                   2923: default         opposite of -s option
                   2924: ecs             -Ce option
                   2925: fast            -F option
                   2926: full            -f option
                   2927: interactive     -I option
                   2928: lex-compat      -l option
                   2929: meta-ecs        -Cm option
                   2930: perf-report     -p option
                   2931: read            -Cr option
                   2932: stdout          -t option
                   2933: verbose         -v option
                   2934: warn            opposite of -w option
                   2935:                 (use "%option nowarn" for -w)
                   2936:
                   2937: array           equivalent to "%array"
                   2938: pointer         equivalent to "%pointer" (default)
                   2939: .Ed
                   2940: .Pp
                   2941: Some %option's provide features otherwise not available:
                   2942: .Bl -tag -width Ds
                   2943: .It always-interactive
                   2944: Instructs
                   2945: .Nm
                   2946: to generate a scanner which always considers its input
                   2947: .Qq interactive .
                   2948: Normally, on each new input file the scanner calls
                   2949: .Fn isatty
                   2950: in an attempt to determine whether the scanner's input source is interactive
                   2951: and thus should be read a character at a time.
                   2952: When this option is used, however, no such call is made.
                   2953: .It main
                   2954: Directs
                   2955: .Nm
                   2956: to provide a default
                   2957: .Fn main
1.1       deraadt  2958: program for the scanner, which simply calls
1.16      jmc      2959: .Fn yylex .
1.1       deraadt  2960: This option implies
1.16      jmc      2961: .Dq noyywrap
                   2962: .Pq see below .
                   2963: .It never-interactive
                   2964: Instructs
                   2965: .Nm
                   2966: to generate a scanner which never considers its input
                   2967: .Qq interactive
                   2968: (again, no call made to
                   2969: .Fn isatty ) .
1.1       deraadt  2970: This is the opposite of
1.16      jmc      2971: .Dq always-interactive .
                   2972: .It stack
                   2973: Enables the use of start condition stacks
                   2974: (see
                   2975: .Sx START CONDITIONS
                   2976: above).
                   2977: .It stdinit
                   2978: If set (i.e.,
                   2979: .Dq %option stdinit ) ,
1.1       deraadt  2980: initializes
1.16      jmc      2981: .Fa yyin
1.1       deraadt  2982: and
1.16      jmc      2983: .Fa yyout
                   2984: to stdin and stdout, instead of the default of
                   2985: .Dq nil .
1.1       deraadt  2986: Some existing
1.16      jmc      2987: .Nm lex
                   2988: programs depend on this behavior, even though it is not compliant with ANSI C,
                   2989: which does not require stdin and stdout to be compile-time constant.
                   2990: .It yylineno
                   2991: Directs
                   2992: .Nm
1.1       deraadt  2993: to generate a scanner that maintains the number of the current line
                   2994: read from its input in the global variable
1.16      jmc      2995: .Fa yylineno .
1.1       deraadt  2996: This option is implied by
1.16      jmc      2997: .Dq %option lex-compat .
                   2998: .It yywrap
                   2999: If unset (i.e.,
                   3000: .Dq %option noyywrap ) ,
1.1       deraadt  3001: makes the scanner not call
1.16      jmc      3002: .Fn yywrap
                   3003: upon an end-of-file, but simply assume that there are no more files to scan
                   3004: (until the user points
                   3005: .Fa yyin
1.1       deraadt  3006: at a new file and calls
1.16      jmc      3007: .Fn yylex
1.1       deraadt  3008: again).
1.16      jmc      3009: .El
                   3010: .Pp
                   3011: .Nm
                   3012: scans rule actions to determine whether the
                   3013: .Em REJECT
                   3014: or
                   3015: .Fn yymore
                   3016: features are being used.
                   3017: The
                   3018: .Dq reject
1.1       deraadt  3019: and
1.16      jmc      3020: .Dq yymore
                   3021: options are available to override its decision as to whether to use the
1.1       deraadt  3022: options, either by setting them (e.g.,
1.16      jmc      3023: .Dq %option reject )
                   3024: to indicate the feature is indeed used,
                   3025: or unsetting them to indicate it actually is not used
1.1       deraadt  3026: (e.g.,
1.16      jmc      3027: .Dq %option noyymore ) .
                   3028: .Pp
                   3029: Three options take string-delimited values, offset with
                   3030: .Sq = :
                   3031: .Pp
                   3032: .D1 %option outfile="ABC"
                   3033: .Pp
1.1       deraadt  3034: is equivalent to
1.16      jmc      3035: .Fl o Ns Ar ABC ,
1.1       deraadt  3036: and
1.16      jmc      3037: .Pp
                   3038: .D1 %option prefix="XYZ"
                   3039: .Pp
1.1       deraadt  3040: is equivalent to
1.16      jmc      3041: .Fl P Ns Ar XYZ .
1.1       deraadt  3042: Finally,
1.16      jmc      3043: .Pp
                   3044: .D1 %option yyclass="foo"
                   3045: .Pp
                   3046: only applies when generating a C++ scanner
                   3047: .Pf ( Fl +
                   3048: option).
                   3049: It informs
                   3050: .Nm
                   3051: that
                   3052: .Dq foo
                   3053: has been derived as a subclass of yyFlexLexer, so
                   3054: .Nm
                   3055: will place actions in the member function
                   3056: .Dq foo::yylex()
1.1       deraadt  3057: instead of
1.16      jmc      3058: .Dq yyFlexLexer::yylex() .
1.1       deraadt  3059: It also generates a
1.16      jmc      3060: .Dq yyFlexLexer::yylex()
1.1       deraadt  3061: member function that emits a run-time error (by invoking
1.16      jmc      3062: .Dq yyFlexLexer::LexerError() )
1.1       deraadt  3063: if called.
1.16      jmc      3064: See
                   3065: .Sx GENERATING C++ SCANNERS ,
                   3066: below, for additional information.
                   3067: .Pp
                   3068: A number of options are available for
1.32      jmc      3069: lint
1.16      jmc      3070: purists who want to suppress the appearance of unneeded routines
                   3071: in the generated scanner.
                   3072: Each of the following, if unset
1.1       deraadt  3073: (e.g.,
1.16      jmc      3074: .Dq %option nounput ) ,
                   3075: results in the corresponding routine not appearing in the generated scanner:
                   3076: .Bd -unfilled -offset indent
                   3077: input, unput
                   3078: yy_push_state, yy_pop_state, yy_top_state
                   3079: yy_scan_buffer, yy_scan_bytes, yy_scan_string
                   3080: .Ed
                   3081: .Pp
1.1       deraadt  3082: (though
1.16      jmc      3083: .Fn yy_push_state
                   3084: and friends won't appear anyway unless
                   3085: .Dq %option stack
                   3086: is being used).
                   3087: .Sh PERFORMANCE CONSIDERATIONS
1.1       deraadt  3088: The main design goal of
1.16      jmc      3089: .Nm
                   3090: is that it generate high-performance scanners.
                   3091: It has been optimized for dealing well with large sets of rules.
                   3092: Aside from the effects on scanner speed of the table compression
                   3093: .Fl C
1.1       deraadt  3094: options outlined above,
1.16      jmc      3095: there are a number of options/actions which degrade performance.
                   3096: These are, from most expensive to least:
                   3097: .Bd -unfilled -offset indent
                   3098: REJECT
                   3099: %option yylineno
                   3100: arbitrary trailing context
                   3101:
                   3102: pattern sets that require backing up
                   3103: %array
                   3104: %option interactive
                   3105: %option always-interactive
                   3106:
                   3107: \&'^' beginning-of-line operator
                   3108: yymore()
                   3109: .Ed
                   3110: .Pp
                   3111: with the first three all being quite expensive
                   3112: and the last two being quite cheap.
                   3113: Note also that
                   3114: .Fn unput
                   3115: is implemented as a routine call that potentially does quite a bit of work,
                   3116: while
                   3117: .Fn yyless
                   3118: is a quite-cheap macro; so if just putting back some excess text,
                   3119: use
                   3120: .Fn yyless .
                   3121: .Pp
                   3122: .Em REJECT
1.1       deraadt  3123: should be avoided at all costs when performance is important.
                   3124: It is a particularly expensive option.
1.16      jmc      3125: .Pp
1.1       deraadt  3126: Getting rid of backing up is messy and often may be an enormous
1.16      jmc      3127: amount of work for a complicated scanner.
                   3128: In principal, one begins by using the
                   3129: .Fl b
1.1       deraadt  3130: flag to generate a
1.16      jmc      3131: .Pa lex.backup
                   3132: file.
                   3133: For example, on the input
                   3134: .Bd -literal -offset indent
                   3135: %%
                   3136: foo        return TOK_KEYWORD;
                   3137: foobar     return TOK_KEYWORD;
                   3138: .Ed
                   3139: .Pp
1.1       deraadt  3140: the file looks like:
1.16      jmc      3141: .Bd -literal -offset indent
                   3142: State #6 is non-accepting -
                   3143:  associated rule line numbers:
                   3144:        2       3
                   3145:  out-transitions: [ o ]
                   3146:  jam-transitions: EOF [ \e001-n  p-\e177 ]
                   3147:
                   3148: State #8 is non-accepting -
                   3149:  associated rule line numbers:
                   3150:        3
                   3151:  out-transitions: [ a ]
                   3152:  jam-transitions: EOF [ \e001-`  b-\e177 ]
                   3153:
                   3154: State #9 is non-accepting -
                   3155:  associated rule line numbers:
                   3156:        3
                   3157:  out-transitions: [ r ]
                   3158:  jam-transitions: EOF [ \e001-q  s-\e177 ]
                   3159:
                   3160: Compressed tables always back up.
                   3161: .Ed
                   3162: .Pp
1.1       deraadt  3163: The first few lines tell us that there's a scanner state in
1.16      jmc      3164: which it can make a transition on an
                   3165: .Sq o
                   3166: but not on any other character,
                   3167: and that in that state the currently scanned text does not match any rule.
                   3168: The state occurs when trying to match the rules found
1.1       deraadt  3169: at lines 2 and 3 in the input file.
1.16      jmc      3170: If the scanner is in that state and then reads something other than an
                   3171: .Sq o ,
                   3172: it will have to back up to find a rule which is matched.
                   3173: With a bit of headscratching one can see that this must be the
                   3174: state it's in when it has seen
                   3175: .Sq fo .
                   3176: When this has happened, if anything other than another
                   3177: .Sq o
                   3178: is seen, the scanner will have to back up to simply match the
                   3179: .Sq f
                   3180: .Pq by the default rule .
                   3181: .Pp
                   3182: The comment regarding State #8 indicates there's a problem when
                   3183: .Qq foob
                   3184: has been scanned.
                   3185: Indeed, on any character other than an
                   3186: .Sq a ,
                   3187: the scanner will have to back up to accept
                   3188: .Qq foo .
                   3189: Similarly, the comment for State #9 concerns when
                   3190: .Qq fooba
                   3191: has been scanned and an
                   3192: .Sq r
                   3193: does not follow.
                   3194: .Pp
1.1       deraadt  3195: The final comment reminds us that there's no point going to
1.16      jmc      3196: all the trouble of removing backing up from the rules unless we're using
                   3197: .Fl Cf
1.1       deraadt  3198: or
1.16      jmc      3199: .Fl CF ,
1.1       deraadt  3200: since there's no performance gain doing so with compressed scanners.
1.16      jmc      3201: .Pp
                   3202: The way to remove the backing up is to add
                   3203: .Qq error
                   3204: rules:
                   3205: .Bd -literal -offset indent
                   3206: %%
                   3207: foo    return TOK_KEYWORD;
                   3208: foobar return TOK_KEYWORD;
                   3209:
                   3210: fooba  |
                   3211: foob   |
                   3212: fo {
                   3213:         /* false alarm, not really a keyword */
                   3214:         return TOK_ID;
                   3215: }
                   3216: .Ed
                   3217: .Pp
                   3218: Eliminating backing up among a list of keywords can also be done using a
                   3219: .Qq catch-all
                   3220: rule:
                   3221: .Bd -literal -offset indent
                   3222: %%
                   3223: foo    return TOK_KEYWORD;
                   3224: foobar return TOK_KEYWORD;
                   3225:
                   3226: [a-z]+ return TOK_ID;
                   3227: .Ed
                   3228: .Pp
1.1       deraadt  3229: This is usually the best solution when appropriate.
1.16      jmc      3230: .Pp
1.1       deraadt  3231: Backing up messages tend to cascade.
1.16      jmc      3232: With a complicated set of rules it's not uncommon to get hundreds of messages.
                   3233: If one can decipher them, though,
                   3234: it often only takes a dozen or so rules to eliminate the backing up
                   3235: (though it's easy to make a mistake and have an error rule accidentally match
                   3236: a valid token; a possible future
                   3237: .Nm
1.1       deraadt  3238: feature will be to automatically add rules to eliminate backing up).
1.16      jmc      3239: .Pp
                   3240: It's important to keep in mind that the benefits of eliminating
                   3241: backing up are gained only if
                   3242: .Em every
                   3243: instance of backing up is eliminated.
                   3244: Leaving just one gains nothing.
                   3245: .Pp
                   3246: .Em Variable
                   3247: trailing context
                   3248: (where both the leading and trailing parts do not have a fixed length)
                   3249: entails almost the same performance loss as
                   3250: .Em REJECT
                   3251: .Pq i.e., substantial .
                   3252: So when possible a rule like:
                   3253: .Bd -literal -offset indent
                   3254: %%
                   3255: mouse|rat/(cat|dog)   run();
                   3256: .Ed
                   3257: .Pp
1.1       deraadt  3258: is better written:
1.16      jmc      3259: .Bd -literal -offset indent
                   3260: %%
                   3261: mouse/cat|dog         run();
                   3262: rat/cat|dog           run();
                   3263: .Ed
                   3264: .Pp
1.1       deraadt  3265: or as
1.16      jmc      3266: .Bd -literal -offset indent
                   3267: %%
                   3268: mouse|rat/cat         run();
                   3269: mouse|rat/dog         run();
                   3270: .Ed
                   3271: .Pp
                   3272: Note that here the special
                   3273: .Sq |\&
                   3274: action does not provide any savings, and can even make things worse (see
                   3275: .Sx BUGS
                   3276: below).
                   3277: .Pp
1.1       deraadt  3278: Another area where the user can increase a scanner's performance
1.16      jmc      3279: .Pq and one that's easier to implement
                   3280: arises from the fact that the longer the tokens matched,
                   3281: the faster the scanner will run.
1.1       deraadt  3282: This is because with long tokens the processing of most input
1.16      jmc      3283: characters takes place in the
                   3284: .Pq short
                   3285: inner scanning loop, and does not often have to go through the additional work
                   3286: of setting up the scanning environment (e.g.,
                   3287: .Fa yytext )
                   3288: for the action.
                   3289: Recall the scanner for C comments:
                   3290: .Bd -literal -offset indent
                   3291: %x comment
                   3292: %%
                   3293: int line_num = 1;
                   3294:
                   3295: "/*"                    BEGIN(comment);
                   3296:
                   3297: <comment>[^*\en]*
                   3298: <comment>"*"+[^*/\en]*
                   3299: <comment>\en             ++line_num;
                   3300: <comment>"*"+"/"        BEGIN(INITIAL);
                   3301: .Ed
                   3302: .Pp
1.1       deraadt  3303: This could be sped up by writing it as:
1.16      jmc      3304: .Bd -literal -offset indent
                   3305: %x comment
                   3306: %%
                   3307: int line_num = 1;
                   3308:
                   3309: "/*"                    BEGIN(comment);
                   3310:
                   3311: <comment>[^*\en]*
                   3312: <comment>[^*\en]*\en      ++line_num;
                   3313: <comment>"*"+[^*/\en]*
                   3314: <comment>"*"+[^*/\en]*\en ++line_num;
                   3315: <comment>"*"+"/"        BEGIN(INITIAL);
                   3316: .Ed
                   3317: .Pp
                   3318: Now instead of each newline requiring the processing of another action,
                   3319: recognizing the newlines is
                   3320: .Qq distributed
                   3321: over the other rules to keep the matched text as long as possible.
                   3322: Note that adding rules does
                   3323: .Em not
                   3324: slow down the scanner!
                   3325: The speed of the scanner is independent of the number of rules or
                   3326: (modulo the considerations given at the beginning of this section)
                   3327: how complicated the rules are with regard to operators such as
                   3328: .Sq *
                   3329: and
                   3330: .Sq |\& .
                   3331: .Pp
                   3332: A final example in speeding up a scanner:
                   3333: scan through a file containing identifiers and keywords, one per line
                   3334: and with no other extraneous characters, and recognize all the keywords.
                   3335: A natural first approach is:
                   3336: .Bd -literal -offset indent
                   3337: %%
                   3338: asm      |
                   3339: auto     |
                   3340: break    |
                   3341: \&... etc ...
                   3342: volatile |
                   3343: while    /* it's a keyword */
                   3344:
                   3345: \&.|\en     /* it's not a keyword */
                   3346: .Ed
                   3347: .Pp
1.1       deraadt  3348: To eliminate the back-tracking, introduce a catch-all rule:
1.16      jmc      3349: .Bd -literal -offset indent
                   3350: %%
                   3351: asm      |
                   3352: auto     |
                   3353: break    |
                   3354: \&... etc ...
                   3355: volatile |
                   3356: while    /* it's a keyword */
                   3357:
                   3358: [a-z]+   |
                   3359: \&.|\en     /* it's not a keyword */
                   3360: .Ed
                   3361: .Pp
1.1       deraadt  3362: Now, if it's guaranteed that there's exactly one word per line,
                   3363: then we can reduce the total number of matches by a half by
1.16      jmc      3364: merging in the recognition of newlines with that of the other tokens:
                   3365: .Bd -literal -offset indent
                   3366: %%
                   3367: asm\en      |
                   3368: auto\en     |
                   3369: break\en    |
                   3370: \&... etc ...
                   3371: volatile\en |
                   3372: while\en    /* it's a keyword */
                   3373:
                   3374: [a-z]+\en   |
                   3375: \&.|\en       /* it's not a keyword */
                   3376: .Ed
                   3377: .Pp
                   3378: One has to be careful here,
                   3379: as we have now reintroduced backing up into the scanner.
                   3380: In particular, while we know that there will never be any characters
                   3381: in the input stream other than letters or newlines,
                   3382: .Nm
1.1       deraadt  3383: can't figure this out, and it will plan for possibly needing to back up
1.16      jmc      3384: when it has scanned a token like
                   3385: .Qq auto
                   3386: and then the next character is something other than a newline or a letter.
                   3387: Previously it would then just match the
                   3388: .Qq auto
                   3389: rule and be done, but now it has no
                   3390: .Qq auto
                   3391: rule, only an
                   3392: .Qq auto\en
                   3393: rule.
                   3394: To eliminate the possibility of backing up,
1.40      jmc      3395: we could either duplicate all rules but without final newlines or,
1.1       deraadt  3396: since we never expect to encounter such an input and therefore don't
1.16      jmc      3397: how it's classified, we can introduce one more catch-all rule,
                   3398: this one which doesn't include a newline:
                   3399: .Bd -literal -offset indent
                   3400: %%
                   3401: asm\en      |
                   3402: auto\en     |
                   3403: break\en    |
                   3404: \&... etc ...
                   3405: volatile\en |
                   3406: while\en    /* it's a keyword */
                   3407:
                   3408: [a-z]+\en   |
                   3409: [a-z]+     |
                   3410: \&.|\en       /* it's not a keyword */
                   3411: .Ed
                   3412: .Pp
1.1       deraadt  3413: Compiled with
1.16      jmc      3414: .Fl Cf ,
1.1       deraadt  3415: this is about as fast as one can get a
1.16      jmc      3416: .Nm
1.1       deraadt  3417: scanner to go for this particular problem.
1.16      jmc      3418: .Pp
1.1       deraadt  3419: A final note:
1.16      jmc      3420: .Nm
                   3421: is slow when matching NUL's,
                   3422: particularly when a token contains multiple NUL's.
                   3423: It's best to write rules which match short
1.1       deraadt  3424: amounts of text if it's anticipated that the text will often include NUL's.
1.16      jmc      3425: .Pp
1.1       deraadt  3426: Another final note regarding performance: as mentioned above in the section
1.16      jmc      3427: .Sx HOW THE INPUT IS MATCHED ,
                   3428: dynamically resizing
                   3429: .Fa yytext
1.1       deraadt  3430: to accommodate huge tokens is a slow process because it presently requires that
1.16      jmc      3431: the
                   3432: .Pq huge
                   3433: token be rescanned from the beginning.
                   3434: Thus if performance is vital, it is better to attempt to match
                   3435: .Qq large
                   3436: quantities of text but not
                   3437: .Qq huge
                   3438: quantities, where the cutoff between the two is at about 8K characters/token.
                   3439: .Sh GENERATING C++ SCANNERS
                   3440: .Nm
                   3441: provides two different ways to generate scanners for use with C++.
                   3442: The first way is to simply compile a scanner generated by
                   3443: .Nm
                   3444: using a C++ compiler instead of a C compiler.
                   3445: This should not generate any compilation errors
                   3446: (please report any found to the email address given in the
                   3447: .Sx AUTHORS
                   3448: section below).
                   3449: C++ code can then be used in rule actions instead of C code.
                   3450: Note that the default input source for scanners remains
                   3451: .Fa yyin ,
1.1       deraadt  3452: and default echoing is still done to
1.16      jmc      3453: .Fa yyout .
1.1       deraadt  3454: Both of these remain
1.16      jmc      3455: .Fa FILE *
                   3456: variables and not C++ streams.
                   3457: .Pp
                   3458: .Nm
                   3459: can also be used to generate a C++ scanner class, using the
                   3460: .Fl +
1.1       deraadt  3461: option (or, equivalently,
1.16      jmc      3462: .Dq %option c++ ) ,
                   3463: which is automatically specified if the name of the flex executable ends in a
                   3464: .Sq + ,
                   3465: such as
                   3466: .Nm flex++ .
                   3467: When using this option,
                   3468: .Nm
                   3469: defaults to generating the scanner to the file
                   3470: .Pa lex.yy.cc
1.1       deraadt  3471: instead of
1.16      jmc      3472: .Pa lex.yy.c .
1.1       deraadt  3473: The generated scanner includes the header file
1.38      bentley  3474: .In g++/FlexLexer.h ,
1.1       deraadt  3475: which defines the interface to two C++ classes.
1.16      jmc      3476: .Pp
1.1       deraadt  3477: The first class,
1.16      jmc      3478: .Em FlexLexer ,
                   3479: provides an abstract base class defining the general scanner class interface.
                   3480: It provides the following member functions:
                   3481: .Bl -tag -width Ds
                   3482: .It const char* YYText()
                   3483: Returns the text of the most recently matched token, the equivalent of
                   3484: .Fa yytext .
                   3485: .It int YYLeng()
                   3486: Returns the length of the most recently matched token, the equivalent of
                   3487: .Fa yyleng .
                   3488: .It int lineno() const
                   3489: Returns the current input line number
1.1       deraadt  3490: (see
1.16      jmc      3491: .Dq %option yylineno ) ,
                   3492: or 1 if
                   3493: .Dq %option yylineno
1.1       deraadt  3494: was not used.
1.16      jmc      3495: .It void set_debug(int flag)
                   3496: Sets the debugging flag for the scanner, equivalent to assigning to
                   3497: .Fa yy_flex_debug
                   3498: (see the
                   3499: .Sx OPTIONS
                   3500: section above).
                   3501: Note that the scanner must be built using
                   3502: .Dq %option debug
1.1       deraadt  3503: to include debugging information in it.
1.16      jmc      3504: .It int debug() const
                   3505: Returns the current setting of the debugging flag.
                   3506: .El
                   3507: .Pp
1.1       deraadt  3508: Also provided are member functions equivalent to
1.16      jmc      3509: .Fn yy_switch_to_buffer ,
                   3510: .Fn yy_create_buffer
1.1       deraadt  3511: (though the first argument is an
1.18      espie    3512: .Fa std::istream*
1.1       deraadt  3513: object pointer and not a
1.16      jmc      3514: .Fa FILE* ) ,
                   3515: .Fn yy_flush_buffer ,
                   3516: .Fn yy_delete_buffer ,
1.1       deraadt  3517: and
1.16      jmc      3518: .Fn yyrestart
1.10      deraadt  3519: (again, the first argument is an
1.18      espie    3520: .Fa std::istream*
1.1       deraadt  3521: object pointer).
1.16      jmc      3522: .Pp
1.1       deraadt  3523: The second class defined in
1.38      bentley  3524: .In g++/FlexLexer.h
1.1       deraadt  3525: is
1.16      jmc      3526: .Fa yyFlexLexer ,
1.1       deraadt  3527: which is derived from
1.16      jmc      3528: .Fa FlexLexer .
1.1       deraadt  3529: It defines the following additional member functions:
1.16      jmc      3530: .Bl -tag -width Ds
1.18      espie    3531: .It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
1.16      jmc      3532: Constructs a
                   3533: .Fa yyFlexLexer
                   3534: object using the given streams for input and output.
                   3535: If not specified, the streams default to
                   3536: .Fa cin
1.1       deraadt  3537: and
1.16      jmc      3538: .Fa cout ,
1.1       deraadt  3539: respectively.
1.16      jmc      3540: .It virtual int yylex()
                   3541: Performs the same role as
                   3542: .Fn yylex
1.1       deraadt  3543: does for ordinary flex scanners: it scans the input stream, consuming
1.16      jmc      3544: tokens, until a rule's action returns a value.
                   3545: If subclass
                   3546: .Sq S
                   3547: is derived from
                   3548: .Fa yyFlexLexer ,
                   3549: in order to access the member functions and variables of
                   3550: .Sq S
1.1       deraadt  3551: inside
1.16      jmc      3552: .Fn yylex ,
                   3553: use
                   3554: .Dq %option yyclass="S"
1.1       deraadt  3555: to inform
1.16      jmc      3556: .Nm
                   3557: that the
                   3558: .Sq S
                   3559: subclass will be used instead of
                   3560: .Fa yyFlexLexer .
1.1       deraadt  3561: In this case, rather than generating
1.16      jmc      3562: .Dq yyFlexLexer::yylex() ,
                   3563: .Nm
1.1       deraadt  3564: generates
1.16      jmc      3565: .Dq S::yylex()
1.1       deraadt  3566: (and also generates a dummy
1.16      jmc      3567: .Dq yyFlexLexer::yylex()
1.1       deraadt  3568: that calls
1.16      jmc      3569: .Dq yyFlexLexer::LexerError()
1.1       deraadt  3570: if called).
1.18      espie    3571: .It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
1.16      jmc      3572: Reassigns
                   3573: .Fa yyin
1.1       deraadt  3574: to
1.16      jmc      3575: .Fa new_in
                   3576: .Pq if non-nil
1.1       deraadt  3577: and
1.16      jmc      3578: .Fa yyout
1.1       deraadt  3579: to
1.16      jmc      3580: .Fa new_out
                   3581: .Pq ditto ,
                   3582: deleting the previous input buffer if
                   3583: .Fa yyin
1.1       deraadt  3584: is reassigned.
1.18      espie    3585: .It int yylex(std::istream* new_in, std::ostream* new_out = 0)
1.16      jmc      3586: First switches the input streams via
                   3587: .Dq switch_streams(new_in, new_out)
1.1       deraadt  3588: and then returns the value of
1.16      jmc      3589: .Fn yylex .
                   3590: .El
                   3591: .Pp
1.1       deraadt  3592: In addition,
1.16      jmc      3593: .Fa yyFlexLexer
                   3594: defines the following protected virtual functions which can be redefined
1.1       deraadt  3595: in derived classes to tailor the scanner:
1.16      jmc      3596: .Bl -tag -width Ds
                   3597: .It virtual int LexerInput(char* buf, int max_size)
                   3598: Reads up to
                   3599: .Fa max_size
1.1       deraadt  3600: characters into
1.16      jmc      3601: .Fa buf
                   3602: and returns the number of characters read.
                   3603: To indicate end-of-input, return 0 characters.
                   3604: Note that
                   3605: .Qq interactive
                   3606: scanners (see the
                   3607: .Fl B
1.1       deraadt  3608: and
1.16      jmc      3609: .Fl I
1.1       deraadt  3610: flags) define the macro
1.16      jmc      3611: .Dv YY_INTERACTIVE .
                   3612: If
                   3613: .Fn LexerInput
                   3614: has been redefined, and it's necessary to take different actions depending on
                   3615: whether or not the scanner might be scanning an interactive input source,
                   3616: it's possible to test for the presence of this name via
                   3617: .Dq #ifdef .
                   3618: .It virtual void LexerOutput(const char* buf, int size)
                   3619: Writes out
                   3620: .Fa size
1.1       deraadt  3621: characters from the buffer
1.16      jmc      3622: .Fa buf ,
                   3623: which, while NUL-terminated, may also contain
                   3624: .Qq internal
                   3625: NUL's if the scanner's rules can match text with NUL's in them.
                   3626: .It virtual void LexerError(const char* msg)
                   3627: Reports a fatal error message.
                   3628: The default version of this function writes the message to the stream
                   3629: .Fa cerr
1.1       deraadt  3630: and exits.
1.16      jmc      3631: .El
                   3632: .Pp
1.1       deraadt  3633: Note that a
1.16      jmc      3634: .Fa yyFlexLexer
                   3635: object contains its entire scanning state.
                   3636: Thus such objects can be used to create reentrant scanners.
                   3637: Multiple instances of the same
                   3638: .Fa yyFlexLexer
                   3639: class can be instantiated, and multiple C++ scanner classes can be combined
1.1       deraadt  3640: in the same program using the
1.16      jmc      3641: .Fl P
1.1       deraadt  3642: option discussed above.
1.16      jmc      3643: .Pp
1.1       deraadt  3644: Finally, note that the
1.16      jmc      3645: .Dq %array
                   3646: feature is not available to C++ scanner classes;
                   3647: .Dq %pointer
                   3648: must be used
                   3649: .Pq the default .
                   3650: .Pp
1.1       deraadt  3651: Here is an example of a simple C++ scanner:
1.16      jmc      3652: .Bd -literal -offset indent
                   3653: // An example of using the flex C++ scanner class.
1.1       deraadt  3654:
1.16      jmc      3655: %{
                   3656: #include <errno.h>
                   3657: int mylineno = 0;
                   3658: %}
1.1       deraadt  3659:
1.16      jmc      3660: string  \e"[^\en"]+\e"
1.1       deraadt  3661:
1.16      jmc      3662: ws      [ \et]+
1.1       deraadt  3663:
1.16      jmc      3664: alpha   [A-Za-z]
                   3665: dig     [0-9]
                   3666: name    ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
                   3667: num1    [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
                   3668: num2    [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
                   3669: number  {num1}|{num2}
1.1       deraadt  3670:
1.16      jmc      3671: %%
1.1       deraadt  3672:
1.16      jmc      3673: {ws}    /* skip blanks and tabs */
1.1       deraadt  3674:
1.16      jmc      3675: "/*" {
                   3676:         int c;
1.1       deraadt  3677:
1.16      jmc      3678:         while ((c = yyinput()) != 0) {
                   3679:                 if(c == '\en')
1.1       deraadt  3680:                     ++mylineno;
1.16      jmc      3681:                 else if(c == '*') {
                   3682:                     if ((c = yyinput()) == '/')
1.1       deraadt  3683:                         break;
                   3684:                     else
                   3685:                         unput(c);
                   3686:                 }
1.16      jmc      3687:         }
                   3688: }
1.1       deraadt  3689:
1.16      jmc      3690: {number}  cout << "number " << YYText() << '\en';
1.1       deraadt  3691:
1.16      jmc      3692: \en        mylineno++;
1.1       deraadt  3693:
1.16      jmc      3694: {name}    cout << "name " << YYText() << '\en';
1.1       deraadt  3695:
1.16      jmc      3696: {string}  cout << "string " << YYText() << '\en';
                   3697:
                   3698: %%
                   3699:
                   3700: int main(int /* argc */, char** /* argv */)
                   3701: {
                   3702:        FlexLexer* lexer = new yyFlexLexer;
                   3703:        while(lexer->yylex() != 0)
                   3704:            ;
                   3705:        return 0;
                   3706: }
                   3707: .Ed
                   3708: .Pp
                   3709: To create multiple
                   3710: .Pq different
                   3711: lexer classes, use the
                   3712: .Fl P
                   3713: flag
                   3714: (or the
                   3715: .Dq prefix=
                   3716: option)
                   3717: to rename each
                   3718: .Fa yyFlexLexer
1.1       deraadt  3719: to some other
1.16      jmc      3720: .Fa xxFlexLexer .
1.38      bentley  3721: .In g++/FlexLexer.h
1.16      jmc      3722: can then be included in other sources once per lexer class, first renaming
                   3723: .Fa yyFlexLexer
1.1       deraadt  3724: as follows:
1.16      jmc      3725: .Bd -literal -offset indent
                   3726: #undef yyFlexLexer
                   3727: #define yyFlexLexer xxFlexLexer
                   3728: #include <g++/FlexLexer.h>
                   3729:
                   3730: #undef yyFlexLexer
                   3731: #define yyFlexLexer zzFlexLexer
                   3732: #include <g++/FlexLexer.h>
                   3733: .Ed
                   3734: .Pp
                   3735: If, for example,
                   3736: .Dq %option prefix="xx"
                   3737: is used for one scanner and
                   3738: .Dq %option prefix="zz"
                   3739: is used for the other.
                   3740: .Pp
                   3741: .Sy IMPORTANT :
                   3742: the present form of the scanning class is experimental
1.7       aaron    3743: and may change considerably between major releases.
1.16      jmc      3744: .Sh INCOMPATIBILITIES WITH LEX AND POSIX
                   3745: .Nm
1.25      sobrado  3746: is a rewrite of the
                   3747: .At
1.16      jmc      3748: .Nm lex
                   3749: tool
                   3750: (the two implementations do not share any code, though),
                   3751: with some extensions and incompatibilities, both of which are of concern
                   3752: to those who wish to write scanners acceptable to either implementation.
                   3753: .Nm
                   3754: is fully compliant with the
                   3755: .Tn POSIX
                   3756: .Nm lex
1.1       deraadt  3757: specification, except that when using
1.16      jmc      3758: .Dq %pointer
                   3759: .Pq the default ,
                   3760: a call to
                   3761: .Fn unput
1.1       deraadt  3762: destroys the contents of
1.16      jmc      3763: .Fa yytext ,
                   3764: which is counter to the
                   3765: .Tn POSIX
                   3766: specification.
                   3767: .Pp
                   3768: In this section we discuss all of the known areas of incompatibility between
                   3769: .Nm ,
1.36      schwarze 3770: .At
1.16      jmc      3771: .Nm lex ,
                   3772: and the
                   3773: .Tn POSIX
                   3774: specification.
                   3775: .Pp
                   3776: .Nm flex Ns 's
                   3777: .Fl l
1.36      schwarze 3778: option turns on maximum compatibility with the original
                   3779: .At
1.16      jmc      3780: .Nm lex
1.1       deraadt  3781: implementation, at the cost of a major loss in the generated scanner's
1.16      jmc      3782: performance.
                   3783: We note below which incompatibilities can be overcome using the
                   3784: .Fl l
1.1       deraadt  3785: option.
1.16      jmc      3786: .Pp
                   3787: .Nm
1.1       deraadt  3788: is fully compatible with
1.16      jmc      3789: .Nm lex
1.1       deraadt  3790: with the following exceptions:
1.16      jmc      3791: .Bl -dash
                   3792: .It
1.1       deraadt  3793: The undocumented
1.16      jmc      3794: .Nm lex
1.1       deraadt  3795: scanner internal variable
1.16      jmc      3796: .Fa yylineno
1.1       deraadt  3797: is not supported unless
1.16      jmc      3798: .Fl l
1.1       deraadt  3799: or
1.16      jmc      3800: .Dq %option yylineno
1.1       deraadt  3801: is used.
1.16      jmc      3802: .Pp
                   3803: .Fa yylineno
1.1       deraadt  3804: should be maintained on a per-buffer basis, rather than a per-scanner
1.16      jmc      3805: .Pq single global variable
                   3806: basis.
                   3807: .Pp
                   3808: .Fa yylineno
                   3809: is not part of the
                   3810: .Tn POSIX
                   3811: specification.
                   3812: .It
1.1       deraadt  3813: The
1.16      jmc      3814: .Fn input
1.1       deraadt  3815: routine is not redefinable, though it may be called to read characters
1.16      jmc      3816: following whatever has been matched by a rule.
                   3817: If
                   3818: .Fn input
                   3819: encounters an end-of-file, the normal
                   3820: .Fn yywrap
                   3821: processing is done.
                   3822: A
                   3823: .Dq real
                   3824: end-of-file is returned by
                   3825: .Fn input
1.1       deraadt  3826: as
1.16      jmc      3827: .Dv EOF .
                   3828: .Pp
1.1       deraadt  3829: Input is instead controlled by defining the
1.16      jmc      3830: .Dv YY_INPUT
1.1       deraadt  3831: macro.
1.16      jmc      3832: .Pp
1.1       deraadt  3833: The
1.16      jmc      3834: .Nm
1.1       deraadt  3835: restriction that
1.16      jmc      3836: .Fn input
                   3837: cannot be redefined is in accordance with the
                   3838: .Tn POSIX
                   3839: specification, which simply does not specify any way of controlling the
1.1       deraadt  3840: scanner's input other than by making an initial assignment to
1.16      jmc      3841: .Fa yyin .
                   3842: .It
1.1       deraadt  3843: The
1.16      jmc      3844: .Fn unput
                   3845: routine is not redefinable.
                   3846: This restriction is in accordance with
                   3847: .Tn POSIX .
                   3848: .It
                   3849: .Nm
1.1       deraadt  3850: scanners are not as reentrant as
1.16      jmc      3851: .Nm lex
                   3852: scanners.
                   3853: In particular, if a scanner is interactive and
                   3854: an interrupt handler long-jumps out of the scanner,
                   3855: and the scanner is subsequently called again,
                   3856: the following error message may be displayed:
                   3857: .Pp
                   3858: .D1 fatal flex scanner internal error--end of buffer missed
                   3859: .Pp
1.1       deraadt  3860: To reenter the scanner, first use
1.16      jmc      3861: .Pp
                   3862: .Dl yyrestart(yyin);
                   3863: .Pp
                   3864: Note that this call will throw away any buffered input;
                   3865: usually this isn't a problem with an interactive scanner.
                   3866: .Pp
                   3867: Also note that flex C++ scanner classes are reentrant,
                   3868: so if using C++ is an option , they should be used instead.
                   3869: See
                   3870: .Sx GENERATING C++ SCANNERS
                   3871: above for details.
                   3872: .It
                   3873: .Fn output
1.1       deraadt  3874: is not supported.
                   3875: Output from the
1.16      jmc      3876: .Em ECHO
1.1       deraadt  3877: macro is done to the file-pointer
1.16      jmc      3878: .Fa yyout
                   3879: .Pq default stdout .
                   3880: .Pp
                   3881: .Fn output
                   3882: is not part of the
                   3883: .Tn POSIX
                   3884: specification.
                   3885: .It
                   3886: .Nm lex
                   3887: does not support exclusive start conditions
                   3888: .Pq %x ,
                   3889: though they are in the
                   3890: .Tn POSIX
                   3891: specification.
                   3892: .It
1.1       deraadt  3893: When definitions are expanded,
1.16      jmc      3894: .Nm
1.1       deraadt  3895: encloses them in parentheses.
1.16      jmc      3896: With
                   3897: .Nm lex ,
                   3898: the following:
                   3899: .Bd -literal -offset indent
                   3900: NAME    [A-Z][A-Z0-9]*
                   3901: %%
                   3902: foo{NAME}?      printf("Found it\en");
                   3903: %%
                   3904: .Ed
                   3905: .Pp
                   3906: will not match the string
                   3907: .Qq foo
                   3908: because when the macro is expanded the rule is equivalent to
                   3909: .Qq foo[A-Z][A-Z0-9]*?
                   3910: and the precedence is such that the
                   3911: .Sq ?\&
                   3912: is associated with
                   3913: .Qq [A-Z0-9]* .
                   3914: With
                   3915: .Nm ,
1.1       deraadt  3916: the rule will be expanded to
1.16      jmc      3917: .Qq foo([A-Z][A-Z0-9]*)?
                   3918: and so the string
                   3919: .Qq foo
                   3920: will match.
                   3921: .Pp
1.1       deraadt  3922: Note that if the definition begins with
1.16      jmc      3923: .Sq ^
1.1       deraadt  3924: or ends with
1.16      jmc      3925: .Sq $
                   3926: then it is not expanded with parentheses, to allow these operators to appear in
                   3927: definitions without losing their special meanings.
                   3928: But the
                   3929: .Sq Aq s ,
                   3930: .Sq / ,
1.1       deraadt  3931: and
1.16      jmc      3932: .Aq Aq EOF
1.1       deraadt  3933: operators cannot be used in a
1.16      jmc      3934: .Nm
1.1       deraadt  3935: definition.
1.16      jmc      3936: .Pp
1.1       deraadt  3937: Using
1.16      jmc      3938: .Fl l
1.1       deraadt  3939: results in the
1.16      jmc      3940: .Nm lex
1.1       deraadt  3941: behavior of no parentheses around the definition.
1.16      jmc      3942: .Pp
                   3943: The
                   3944: .Tn POSIX
                   3945: specification is that the definition be enclosed in parentheses.
                   3946: .It
1.1       deraadt  3947: Some implementations of
1.16      jmc      3948: .Nm lex
                   3949: allow a rule's action to begin on a separate line,
                   3950: if the rule's pattern has trailing whitespace:
                   3951: .Bd -literal -offset indent
                   3952: %%
                   3953: foo|bar<space here>
                   3954:   { foobar_action(); }
                   3955: .Ed
                   3956: .Pp
                   3957: .Nm
1.1       deraadt  3958: does not support this feature.
1.16      jmc      3959: .It
1.1       deraadt  3960: The
1.16      jmc      3961: .Nm lex
                   3962: .Sq %r
                   3963: .Pq generate a Ratfor scanner
                   3964: option is not supported.
                   3965: It is not part of the
                   3966: .Tn POSIX
                   3967: specification.
                   3968: .It
1.1       deraadt  3969: After a call to
1.16      jmc      3970: .Fn unput ,
                   3971: .Fa yytext
                   3972: is undefined until the next token is matched,
                   3973: unless the scanner was built using
                   3974: .Dq %array .
1.1       deraadt  3975: This is not the case with
1.16      jmc      3976: .Nm lex
                   3977: or the
                   3978: .Tn POSIX
                   3979: specification.
                   3980: The
                   3981: .Fl l
1.1       deraadt  3982: option does away with this incompatibility.
1.16      jmc      3983: .It
1.1       deraadt  3984: The precedence of the
1.16      jmc      3985: .Sq {}
                   3986: .Pq numeric range
                   3987: operator is different.
                   3988: .Nm lex
                   3989: interprets
                   3990: .Qq abc{1,3}
                   3991: as match one, two, or three occurrences of
                   3992: .Sq abc ,
                   3993: whereas
                   3994: .Nm
                   3995: interprets it as match
                   3996: .Sq ab
                   3997: followed by one, two, or three occurrences of
                   3998: .Sq c .
                   3999: The latter is in agreement with the
                   4000: .Tn POSIX
                   4001: specification.
                   4002: .It
1.1       deraadt  4003: The precedence of the
1.16      jmc      4004: .Sq ^
1.1       deraadt  4005: operator is different.
1.16      jmc      4006: .Nm lex
                   4007: interprets
                   4008: .Qq ^foo|bar
                   4009: as match either
                   4010: .Sq foo
                   4011: at the beginning of a line, or
                   4012: .Sq bar
                   4013: anywhere, whereas
                   4014: .Nm
                   4015: interprets it as match either
                   4016: .Sq foo
                   4017: or
                   4018: .Sq bar
                   4019: if they come at the beginning of a line.
                   4020: The latter is in agreement with the
                   4021: .Tn POSIX
                   4022: specification.
                   4023: .It
1.1       deraadt  4024: The special table-size declarations such as
1.16      jmc      4025: .Sq %a
1.1       deraadt  4026: supported by
1.16      jmc      4027: .Nm lex
1.1       deraadt  4028: are not required by
1.16      jmc      4029: .Nm
1.1       deraadt  4030: scanners;
1.16      jmc      4031: .Nm
1.1       deraadt  4032: ignores them.
1.16      jmc      4033: .It
1.1       deraadt  4034: The name
1.16      jmc      4035: .Dv FLEX_SCANNER
1.1       deraadt  4036: is #define'd so scanners may be written for use with either
1.16      jmc      4037: .Nm
1.1       deraadt  4038: or
1.16      jmc      4039: .Nm lex .
1.1       deraadt  4040: Scanners also include
1.16      jmc      4041: .Dv YY_FLEX_MAJOR_VERSION
1.1       deraadt  4042: and
1.16      jmc      4043: .Dv YY_FLEX_MINOR_VERSION
1.1       deraadt  4044: indicating which version of
1.16      jmc      4045: .Nm
1.1       deraadt  4046: generated the scanner
1.16      jmc      4047: (for example, for the 2.5 release, these defines would be 2 and 5,
1.1       deraadt  4048: respectively).
1.16      jmc      4049: .El
                   4050: .Pp
1.1       deraadt  4051: The following
1.16      jmc      4052: .Nm
1.1       deraadt  4053: features are not included in
1.16      jmc      4054: .Nm lex
                   4055: or the
                   4056: .Tn POSIX
                   4057: specification:
                   4058: .Bd -unfilled -offset indent
                   4059: C++ scanners
                   4060: %option
                   4061: start condition scopes
                   4062: start condition stacks
                   4063: interactive/non-interactive scanners
                   4064: yy_scan_string() and friends
                   4065: yyterminate()
                   4066: yy_set_interactive()
                   4067: yy_set_bol()
                   4068: YY_AT_BOL()
                   4069: <<EOF>>
                   4070: <*>
                   4071: YY_DECL
                   4072: YY_START
                   4073: YY_USER_ACTION
                   4074: YY_USER_INIT
                   4075: #line directives
                   4076: %{}'s around actions
                   4077: multiple actions on a line
                   4078: .Ed
                   4079: .Pp
                   4080: plus almost all of the
                   4081: .Nm
                   4082: flags.
1.1       deraadt  4083: The last feature in the list refers to the fact that with
1.16      jmc      4084: .Nm
1.37      jmc      4085: multiple actions can be placed on the same line,
1.16      jmc      4086: separated with semi-colons, while with
                   4087: .Nm lex ,
1.1       deraadt  4088: the following
1.16      jmc      4089: .Pp
                   4090: .Dl foo    handle_foo(); ++num_foos_seen;
                   4091: .Pp
                   4092: is
                   4093: .Pq rather surprisingly
                   4094: truncated to
                   4095: .Pp
                   4096: .Dl foo    handle_foo();
                   4097: .Pp
                   4098: .Nm
                   4099: does not truncate the action.
                   4100: Actions that are not enclosed in braces
                   4101: are simply terminated at the end of the line.
                   4102: .Sh FILES
                   4103: .Bl -tag -width "<g++/FlexLexer.h>"
1.41      sobrado  4104: .It Pa flex.skl
1.16      jmc      4105: Skeleton scanner.
                   4106: This file is only used when building flex, not when
                   4107: .Nm
                   4108: executes.
1.41      sobrado  4109: .It Pa lex.backup
1.16      jmc      4110: Backing-up information for the
                   4111: .Fl b
                   4112: flag (called
                   4113: .Pa lex.bck
                   4114: on some systems).
1.41      sobrado  4115: .It Pa lex.yy.c
1.16      jmc      4116: Generated scanner
                   4117: (called
                   4118: .Pa lexyy.c
                   4119: on some systems).
1.41      sobrado  4120: .It Pa lex.yy.cc
1.16      jmc      4121: Generated C++ scanner class, when using
                   4122: .Fl + .
1.38      bentley  4123: .It In g++/FlexLexer.h
1.16      jmc      4124: Header file defining the C++ scanner base class,
                   4125: .Fa FlexLexer ,
                   4126: and its derived class,
                   4127: .Fa yyFlexLexer .
1.41      sobrado  4128: .It Pa /usr/lib/libl.*
1.16      jmc      4129: .Nm
                   4130: libraries.
                   4131: The
                   4132: .Pa /usr/lib/libfl.*\&
                   4133: libraries are links to these.
                   4134: Scanners must be linked using either
                   4135: .Fl \&ll
                   4136: or
                   4137: .Fl lfl .
                   4138: .El
1.29      jmc      4139: .Sh EXIT STATUS
                   4140: .Ex -std flex
1.16      jmc      4141: .Sh DIAGNOSTICS
                   4142: .Bl -diag
                   4143: .It warning, rule cannot be matched
                   4144: Indicates that the given rule cannot be matched because it follows other rules
                   4145: that will always match the same text as it.
                   4146: For example, in the following
                   4147: .Dq foo
                   4148: cannot be matched because it comes after an identifier
                   4149: .Qq catch-all
                   4150: rule:
                   4151: .Bd -literal -offset indent
                   4152: [a-z]+    got_identifier();
                   4153: foo       got_foo();
                   4154: .Ed
                   4155: .Pp
1.1       deraadt  4156: Using
1.16      jmc      4157: .Em REJECT
1.1       deraadt  4158: in a scanner suppresses this warning.
1.16      jmc      4159: .It "warning, \-s option given but default rule can be matched"
                   4160: Means that it is possible
                   4161: .Pq perhaps only in a particular start condition
                   4162: that the default rule
                   4163: .Pq match any single character
                   4164: is the only one that will match a particular input.
                   4165: Since
                   4166: .Fl s
1.1       deraadt  4167: was given, presumably this is not intended.
1.16      jmc      4168: .It reject_used_but_not_detected undefined
                   4169: .It yymore_used_but_not_detected undefined
                   4170: These errors can occur at compile time.
                   4171: They indicate that the scanner uses
                   4172: .Em REJECT
1.1       deraadt  4173: or
1.16      jmc      4174: .Fn yymore
1.1       deraadt  4175: but that
1.16      jmc      4176: .Nm
1.1       deraadt  4177: failed to notice the fact, meaning that
1.16      jmc      4178: .Nm
1.1       deraadt  4179: scanned the first two sections looking for occurrences of these actions
1.16      jmc      4180: and failed to find any, but somehow they snuck in
                   4181: .Pq via an #include file, for example .
                   4182: Use
                   4183: .Dq %option reject
                   4184: or
                   4185: .Dq %option yymore
                   4186: to indicate to
                   4187: .Nm
                   4188: that these features are really needed.
                   4189: .It flex scanner jammed
                   4190: A scanner compiled with
                   4191: .Fl s
                   4192: has encountered an input string which wasn't matched by any of its rules.
                   4193: This error can also occur due to internal problems.
                   4194: .It token too large, exceeds YYLMAX
                   4195: The scanner uses
                   4196: .Dq %array
1.1       deraadt  4197: and one of its rules matched a string longer than the
1.16      jmc      4198: .Dv YYLMAX
                   4199: constant
                   4200: .Pq 8K bytes by default .
                   4201: The value can be increased by #define'ing
                   4202: .Dv YYLMAX
                   4203: in the definitions section of
                   4204: .Nm
1.1       deraadt  4205: input.
1.16      jmc      4206: .It "scanner requires \-8 flag to use the character 'x'"
                   4207: The scanner specification includes recognizing the 8-bit character
                   4208: .Sq x
                   4209: and the
                   4210: .Fl 8
                   4211: flag was not specified, and defaulted to 7-bit because the
                   4212: .Fl Cf
                   4213: or
                   4214: .Fl CF
                   4215: table compression options were used.
                   4216: See the discussion of the
                   4217: .Fl 7
1.1       deraadt  4218: flag for details.
1.16      jmc      4219: .It flex scanner push-back overflow
                   4220: unput() was used to push back so much text that the scanner's buffer
                   4221: could not hold both the pushed-back text and the current token in
                   4222: .Fa yytext .
                   4223: Ideally the scanner should dynamically resize the buffer in this case,
                   4224: but at present it does not.
                   4225: .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
                   4226: The scanner was working on matching an extremely large token and needed
                   4227: to expand the input buffer.
                   4228: This doesn't work with scanners that use
                   4229: .Em REJECT .
                   4230: .It "fatal flex scanner internal error--end of buffer missed"
1.1       deraadt  4231: This can occur in an scanner which is reentered after a long-jump
1.16      jmc      4232: has jumped out
                   4233: .Pq or over
                   4234: the scanner's activation frame.
                   4235: Before reentering the scanner, use:
                   4236: .Pp
                   4237: .Dl yyrestart(yyin);
                   4238: .Pp
1.1       deraadt  4239: or, as noted above, switch to using the C++ scanner class.
1.16      jmc      4240: .It "too many start conditions in <> construct!"
                   4241: More start conditions than exist were listed in a <> construct
                   4242: (so at least one of them must have been listed twice).
                   4243: .El
                   4244: .Sh SEE ALSO
                   4245: .Xr awk 1 ,
                   4246: .Xr sed 1 ,
                   4247: .Xr yacc 1
                   4248: .Rs
                   4249: .%A John Levine
                   4250: .%A Tony Mason
                   4251: .%A Doug Brown
                   4252: .%B Lex & Yacc
                   4253: .%I O'Reilly and Associates
                   4254: .%N 2nd edition
                   4255: .Re
                   4256: .Rs
                   4257: .%A Alfred Aho
                   4258: .%A Ravi Sethi
                   4259: .%A Jeffrey Ullman
                   4260: .%B Compilers: Principles, Techniques and Tools
                   4261: .%I Addison-Wesley
                   4262: .%D 1986
                   4263: .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
                   4264: .Re
1.23      jmc      4265: .Sh STANDARDS
                   4266: The
                   4267: .Nm lex
                   4268: utility is compliant with the
                   4269: .St -p1003.1-2008
                   4270: specification,
                   4271: though its presence is optional.
                   4272: .Pp
                   4273: The flags
1.31      jmc      4274: .Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
1.23      jmc      4275: .Op Fl -help ,
                   4276: and
                   4277: .Op Fl -version
                   4278: are extensions to that specification.
1.37      jmc      4279: .Pp
                   4280: See also the
                   4281: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
                   4282: section, above.
1.16      jmc      4283: .Sh AUTHORS
1.1       deraadt  4284: Vern Paxson, with the help of many ideas and much inspiration from
1.16      jmc      4285: Van Jacobson.
                   4286: Original version by Jef Poskanzer.
                   4287: The fast table representation is a partial implementation of a design done by
                   4288: Van Jacobson.
                   4289: The implementation was done by Kevin Gong and Vern Paxson.
                   4290: .Pp
1.1       deraadt  4291: Thanks to the many
1.16      jmc      4292: .Nm
1.1       deraadt  4293: beta-testers, feedbackers, and contributors, especially Francois Pinard,
                   4294: Casey Leedom,
                   4295: Robert Abramovitz,
                   4296: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
1.39      bentley  4297: Neal Becker, Nelson H.F. Beebe,
                   4298: .Mt benson@odi.com ,
1.1       deraadt  4299: Karl Berry, Peter A. Bigot, Simon Blanchard,
                   4300: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
                   4301: Brian Clapper, J.T. Conklin,
                   4302: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11      deraadt  4303: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1       deraadt  4304: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
                   4305: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
                   4306: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
                   4307: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
                   4308: Jan Hajic, Charles Hemphill, NORO Hideo,
                   4309: Jarkko Hietaniemi, Scott Hofmann,
                   4310: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
                   4311: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
                   4312: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
1.39      bentley  4313: Amir Katz,
                   4314: .Mt ken@ken.hilco.com ,
                   4315: Kevin B. Kenny,
1.1       deraadt  4316: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
                   4317: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
                   4318: David Loffredo, Mike Long,
                   4319: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
                   4320: Bengt Martensson, Chris Metcalf,
                   4321: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
                   4322: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
                   4323: Richard Ohnemus, Karsten Pahnke,
1.16      jmc      4324: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
                   4325: Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
1.1       deraadt  4326: Frederic Raimbault, Pat Rankin, Rick Richardson,
                   4327: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
                   4328: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
                   4329: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
                   4330: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
                   4331: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
                   4332: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
1.16      jmc      4333: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
                   4334: Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
                   4335: and those whose names have slipped my marginal mail-archiving skills
                   4336: but whose contributions are appreciated all the
1.1       deraadt  4337: same.
1.16      jmc      4338: .Pp
1.1       deraadt  4339: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
                   4340: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
                   4341: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
                   4342: distribution headaches.
1.16      jmc      4343: .Pp
                   4344: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
                   4345: to Benson Margulies and Fred Burke for C++ support;
                   4346: to Kent Williams and Tom Epperly for C++ class support;
                   4347: to Ove Ewerlid for support of NUL's;
                   4348: and to Eric Hughes for support of multiple buffers.
                   4349: .Pp
1.1       deraadt  4350: This work was primarily done when I was with the Real Time Systems Group
1.16      jmc      4351: at the Lawrence Berkeley Laboratory in Berkeley, CA.
                   4352: Many thanks to all there for the support I received.
                   4353: .Pp
                   4354: Send comments to
1.34      schwarze 4355: .Aq Mt vern@ee.lbl.gov .
1.16      jmc      4356: .Sh BUGS
                   4357: Some trailing context patterns cannot be properly matched and generate
                   4358: warning messages
                   4359: .Pq "dangerous trailing context" .
                   4360: These are patterns where the ending of the first part of the rule
                   4361: matches the beginning of the second part, such as
                   4362: .Qq zx*/xy* ,
                   4363: where the
                   4364: .Sq x*
                   4365: matches the
                   4366: .Sq x
                   4367: at the beginning of the trailing context.
                   4368: (Note that the POSIX draft states that the text matched by such patterns
                   4369: is undefined.)
                   4370: .Pp
                   4371: For some trailing context rules, parts which are actually fixed-length are
                   4372: not recognized as such, leading to the above mentioned performance loss.
                   4373: In particular, parts using
                   4374: .Sq |\&
                   4375: or
                   4376: .Sq {n}
                   4377: (such as
                   4378: .Qq foo{3} )
                   4379: are always considered variable-length.
                   4380: .Pp
                   4381: Combining trailing context with the special
                   4382: .Sq |\&
                   4383: action can result in fixed trailing context being turned into
                   4384: the more expensive variable trailing context.
                   4385: For example, in the following:
                   4386: .Bd -literal -offset indent
                   4387: %%
                   4388: abc      |
                   4389: xyz/def
                   4390: .Ed
                   4391: .Pp
                   4392: Use of
                   4393: .Fn unput
                   4394: invalidates yytext and yyleng, unless the
                   4395: .Dq %array
                   4396: directive
                   4397: or the
                   4398: .Fl l
                   4399: option has been used.
                   4400: .Pp
                   4401: Pattern-matching of NUL's is substantially slower than matching other
                   4402: characters.
                   4403: .Pp
                   4404: Dynamic resizing of the input buffer is slow, as it entails rescanning
                   4405: all the text matched so far by the current
                   4406: .Pq generally huge
                   4407: token.
                   4408: .Pp
                   4409: Due to both buffering of input and read-ahead,
                   4410: it is not possible to intermix calls to
1.38      bentley  4411: .In stdio.h
1.16      jmc      4412: routines, such as, for example,
                   4413: .Fn getchar ,
                   4414: with
                   4415: .Nm
                   4416: rules and expect it to work.
                   4417: Call
                   4418: .Fn input
                   4419: instead.
                   4420: .Pp
                   4421: The total table entries listed by the
                   4422: .Fl v
                   4423: flag excludes the number of table entries needed to determine
                   4424: what rule has been matched.
                   4425: The number of entries is equal to the number of DFA states
                   4426: if the scanner does not use
                   4427: .Em REJECT ,
                   4428: and somewhat greater than the number of states if it does.
                   4429: .Pp
                   4430: .Em REJECT
                   4431: cannot be used with the
                   4432: .Fl f
                   4433: or
                   4434: .Fl F
                   4435: options.
                   4436: .Pp
                   4437: The
                   4438: .Nm
                   4439: internal algorithms need documentation.