Annotation of src/usr.bin/lex/flex.1, Revision 1.16
1.16 ! jmc 1: .\" $OpenBSD: flex.1,v 1.15 2003/10/07 19:41:31 tedu Exp $
! 2: .\"
1.12 jmc 3: .\" Copyright (c) 1990 The Regents of the University of California.
4: .\" All rights reserved.
1.2 deraadt 5: .\"
1.12 jmc 6: .\" This code is derived from software contributed to Berkeley by
7: .\" Vern Paxson.
8: .\"
9: .\" The United States Government has rights in this work pursuant
10: .\" to contract no. DE-AC03-76SF00098 between the United States
11: .\" Department of Energy and the University of California.
12: .\"
13: .\" Redistribution and use in source and binary forms, with or without
1.13 millert 14: .\" modification, are permitted provided that the following conditions
15: .\" are met:
16: .\"
17: .\" 1. Redistributions of source code must retain the above copyright
18: .\" notice, this list of conditions and the following disclaimer.
19: .\" 2. Redistributions in binary form must reproduce the above copyright
20: .\" notice, this list of conditions and the following disclaimer in the
21: .\" documentation and/or other materials provided with the distribution.
22: .\"
23: .\" Neither the name of the University nor the names of its contributors
24: .\" may be used to endorse or promote products derived from this software
25: .\" without specific prior written permission.
26: .\"
27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30: .\" PURPOSE.
1.16 ! jmc 31: .\"
! 32: .Dd April 1, 1995
! 33: .Dt FLEX 1
! 34: .Os
! 35: .Sh NAME
! 36: .Nm flex
! 37: .Nd fast lexical analyzer generator
! 38: .Sh SYNOPSIS
! 39: .Nm
! 40: .Op Fl 78BbcdFfhIiLlnpsTtVvw+?
! 41: .Op Fl C Ns Op Cm aeFfmr
! 42: .Op Fl Fl help
! 43: .Op Fl Fl version
! 44: .Sm off
! 45: .Op Fl o Ar output
! 46: .Op Fl P Ar prefix
! 47: .Op Fl S Ar skeleton
! 48: .Op Ar filename ...
! 49: .Sm on
! 50: .Sh OVERVIEW
1.1 deraadt 51: This manual describes
1.16 ! jmc 52: .Nm ,
! 53: a tool for generating programs that perform pattern-matching on text.
! 54: The manual includes both tutorial and reference sections:
! 55: .Bl -ohang
! 56: .It Sy Description
! 57: A brief overview of the tool.
! 58: .It Sy Some Simple Examples
! 59: .It Sy Format of the Input File
! 60: .It Sy Patterns
! 61: The extended regular expressions used by
! 62: .Nm .
! 63: .It Sy How the Input is Matched
! 64: The rules for determining what has been matched.
! 65: .It Sy Actions
! 66: How to specify what to do when a pattern is matched.
! 67: .It Sy The Generated Scanner
! 68: Details regarding the scanner that
! 69: .Nm
! 70: produces;
! 71: how to control the input source.
! 72: .It Sy Start Conditions
! 73: Introducing context into scanners, and managing
! 74: .Qq mini-scanners .
! 75: .It Sy Multiple Input Buffers
! 76: How to manipulate multiple input sources;
! 77: how to scan from strings instead of files.
! 78: .It Sy End-of-File Rules
! 79: Special rules for matching the end of the input.
! 80: .It Sy Miscellaneous Macros
! 81: A summary of macros available to the actions.
! 82: .It Sy Values Available to the User
! 83: A summary of values available to the actions.
! 84: .It Sy Interfacing with Yacc
! 85: Connecting flex scanners together with
! 86: .Xr yacc 1
! 87: parsers.
! 88: .It Sy Options
! 89: .Nm
! 90: command-line options, and the
! 91: .Dq %option
! 92: directive.
! 93: .It Sy Performance Considerations
! 94: How to make scanners go as fast as possible.
! 95: .It Sy Generating C++ Scanners
! 96: The
! 97: .Pq experimental
! 98: facility for generating C++ scanner classes.
! 99: .It Sy Incompatibilities with Lex and POSIX
! 100: How
! 101: .Nm
! 102: differs from AT&T lex and the
! 103: .Tn POSIX
! 104: lex standard.
! 105: .It Sy Files
! 106: Files used by
! 107: .Nm .
! 108: .It Sy Diagnostics
! 109: Those error messages produced by
! 110: .Nm
! 111: .Pq or scanners it generates
! 112: whose meanings might not be apparent.
! 113: .It Sy See Also
! 114: Other documentation, related tools.
! 115: .It Sy Authors
! 116: Includes contact information.
! 117: .It Sy Bugs
! 118: Known problems with
! 119: .Nm .
! 120: .El
! 121: .Sh DESCRIPTION
! 122: .Nm
1.1 deraadt 123: is a tool for generating
1.16 ! jmc 124: .Em scanners :
1.9 millert 125: programs which recognize lexical patterns in text.
1.16 ! jmc 126: .Nm
! 127: reads the given input files, or its standard input if no file names are given,
! 128: for a description of a scanner to generate.
! 129: The description is in the form of pairs of regular expressions and C code,
! 130: called
! 131: .Em rules .
! 132: .Nm
1.1 deraadt 133: generates as output a C source file,
1.16 ! jmc 134: .Pa lex.yy.c ,
1.1 deraadt 135: which defines a routine
1.16 ! jmc 136: .Fn yylex .
1.1 deraadt 137: This file is compiled and linked with the
1.16 ! jmc 138: .Fl lfl
! 139: library to produce an executable.
! 140: When the executable is run, it analyzes its input for occurrences
! 141: of the regular expressions.
! 142: Whenever it finds one, it executes the corresponding C code.
! 143: .Sh SOME SIMPLE EXAMPLES
1.1 deraadt 144: First some simple examples to get the flavor of how one uses
1.16 ! jmc 145: .Nm .
1.1 deraadt 146: The following
1.16 ! jmc 147: .Nm
1.1 deraadt 148: input specifies a scanner which whenever it encounters the string
1.16 ! jmc 149: .Qq username
! 150: will replace it with the user's login name:
! 151: .Bd -literal -offset indent
! 152: %%
! 153: username printf("%s", getlogin());
! 154: .Ed
! 155: .Pp
1.1 deraadt 156: By default, any text not matched by a
1.16 ! jmc 157: .Nm
! 158: scanner is copied to the output, so the net effect of this scanner is
! 159: to copy its input file to its output with each occurrence of
! 160: .Qq username
! 161: expanded.
! 162: In this input, there is just one rule.
! 163: .Qq username
! 164: is the
! 165: .Em pattern
! 166: and the
! 167: .Qq printf
! 168: is the
! 169: .Em action .
! 170: The
! 171: .Qq %%
! 172: marks the beginning of the rules.
! 173: .Pp
1.1 deraadt 174: Here's another simple example:
1.16 ! jmc 175: .Bd -literal -offset indent
! 176: int num_lines = 0, num_chars = 0;
1.1 deraadt 177:
1.16 ! jmc 178: %%
! 179: \en ++num_lines; ++num_chars;
! 180: \&. ++num_chars;
! 181:
! 182: %%
! 183: main()
! 184: {
! 185: yylex();
! 186: printf("# of lines = %d, # of chars = %d\en",
! 187: num_lines, num_chars);
! 188: }
! 189: .Ed
! 190: .Pp
1.1 deraadt 191: This scanner counts the number of characters and the number
1.16 ! jmc 192: of lines in its input
! 193: (it produces no output other than the final report on the counts).
! 194: The first line declares two globals,
! 195: .Qq num_lines
! 196: and
! 197: .Qq num_chars ,
! 198: which are accessible both inside
! 199: .Fn yylex
1.1 deraadt 200: and in the
1.16 ! jmc 201: .Fn main
! 202: routine declared after the second
! 203: .Qq %% .
! 204: There are two rules, one which matches a newline
! 205: .Pq \&"\en\&"
! 206: and increments both the line count and the character count,
! 207: and one which matches any character other than a newline
! 208: (indicated by the
! 209: .Qq \&.
! 210: regular expression).
! 211: .Pp
1.1 deraadt 212: A somewhat more complicated example:
1.16 ! jmc 213: .Bd -literal -offset indent
! 214: /* scanner for a toy Pascal-like language */
1.1 deraadt 215:
1.16 ! jmc 216: %{
! 217: /* need this for the call to atof() below */
! 218: #include <math.h>
! 219: %}
1.1 deraadt 220:
1.16 ! jmc 221: DIGIT [0-9]
! 222: ID [a-z][a-z0-9]*
1.1 deraadt 223:
1.16 ! jmc 224: %%
1.1 deraadt 225:
1.16 ! jmc 226: {DIGIT}+ {
! 227: printf("An integer: %s (%d)\en", yytext,
! 228: atoi(yytext));
! 229: }
1.1 deraadt 230:
1.16 ! jmc 231: {DIGIT}+"."{DIGIT}* {
! 232: printf("A float: %s (%g)\en", yytext,
! 233: atof(yytext));
! 234: }
1.1 deraadt 235:
1.16 ! jmc 236: if|then|begin|end|procedure|function {
! 237: printf("A keyword: %s\en", yytext);
! 238: }
1.1 deraadt 239:
1.16 ! jmc 240: {ID} printf("An identifier: %s\en", yytext);
1.1 deraadt 241:
1.16 ! jmc 242: "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
1.1 deraadt 243:
1.16 ! jmc 244: "{"[^}\en]*"}" /* eat up one-line comments */
1.1 deraadt 245:
1.16 ! jmc 246: [ \et\en]+ /* eat up whitespace */
1.1 deraadt 247:
1.16 ! jmc 248: \&. printf("Unrecognized character: %s\en", yytext);
1.1 deraadt 249:
1.16 ! jmc 250: %%
1.1 deraadt 251:
1.16 ! jmc 252: main(int argc, char *argv[])
! 253: {
! 254: ++argv; --argc; /* skip over program name */
! 255: if (argc > 0)
! 256: yyin = fopen(argv[0], "r");
1.1 deraadt 257: else
258: yyin = stdin;
1.7 aaron 259:
1.1 deraadt 260: yylex();
1.16 ! jmc 261: }
! 262: .Ed
! 263: .Pp
! 264: This is the beginnings of a simple scanner for a language like Pascal.
! 265: It identifies different types of
! 266: .Em tokens
1.1 deraadt 267: and reports on what it has seen.
1.16 ! jmc 268: .Pp
! 269: The details of this example will be explained in the following sections.
! 270: .Sh FORMAT OF THE INPUT FILE
1.1 deraadt 271: The
1.16 ! jmc 272: .Nm
1.1 deraadt 273: input file consists of three sections, separated by a line with just
1.16 ! jmc 274: .Qq %%
1.1 deraadt 275: in it:
1.16 ! jmc 276: .Bd -unfilled -offset indent
! 277: definitions
! 278: %%
! 279: rules
! 280: %%
! 281: user code
! 282: .Ed
! 283: .Pp
1.1 deraadt 284: The
1.16 ! jmc 285: .Em definitions
1.1 deraadt 286: section contains declarations of simple
1.16 ! jmc 287: .Em name
1.1 deraadt 288: definitions to simplify the scanner specification, and declarations of
1.16 ! jmc 289: .Em start conditions ,
1.1 deraadt 290: which are explained in a later section.
1.16 ! jmc 291: .Pp
1.1 deraadt 292: Name definitions have the form:
1.16 ! jmc 293: .Pp
! 294: .D1 name definition
! 295: .Pp
! 296: The
! 297: .Qq name
! 298: is a word beginning with a letter or an underscore
! 299: .Pq Sq _
! 300: followed by zero or more letters, digits,
! 301: .Sq _ ,
! 302: or
! 303: .Sq -
! 304: .Pq dash .
1.8 aaron 305: The definition is taken to begin at the first non-whitespace character
1.1 deraadt 306: following the name and continuing to the end of the line.
1.16 ! jmc 307: The definition can subsequently be referred to using
! 308: .Qq {name} ,
! 309: which will expand to
! 310: .Qq (definition) .
! 311: For example:
! 312: .Bd -literal -offset indent
! 313: DIGIT [0-9]
! 314: ID [a-z][a-z0-9]*
! 315: .Ed
! 316: .Pp
! 317: This defines
! 318: .Qq DIGIT
! 319: to be a regular expression which matches a single digit, and
! 320: .Qq ID
! 321: to be a regular expression which matches a letter
1.1 deraadt 322: followed by zero-or-more letters-or-digits.
323: A subsequent reference to
1.16 ! jmc 324: .Pp
! 325: .Dl {DIGIT}+"."{DIGIT}*
! 326: .Pp
1.1 deraadt 327: is identical to
1.16 ! jmc 328: .Pp
! 329: .Dl ([0-9])+"."([0-9])*
! 330: .Pp
! 331: and matches one-or-more digits followed by a
! 332: .Sq .\&
! 333: followed by zero-or-more digits.
! 334: .Pp
1.1 deraadt 335: The
1.16 ! jmc 336: .Em rules
1.1 deraadt 337: section of the
1.16 ! jmc 338: .Nm
1.1 deraadt 339: input contains a series of rules of the form:
1.16 ! jmc 340: .Pp
! 341: .D1 pattern action
! 342: .Pp
! 343: The pattern must be unindented and the action must begin
1.1 deraadt 344: on the same line.
1.16 ! jmc 345: .Pp
1.1 deraadt 346: See below for a further description of patterns and actions.
1.16 ! jmc 347: .Pp
1.1 deraadt 348: Finally, the user code section is simply copied to
1.16 ! jmc 349: .Pa lex.yy.c
1.1 deraadt 350: verbatim.
1.16 ! jmc 351: It is used for companion routines which call or are called by the scanner.
! 352: The presence of this section is optional;
1.1 deraadt 353: if it is missing, the second
1.16 ! jmc 354: .Qq %%
! 355: in the input file may be skipped too.
! 356: .Pp
! 357: In the definitions and rules sections, any indented text or text enclosed in
! 358: .Sq %{
1.1 deraadt 359: and
1.16 ! jmc 360: .Sq %}
! 361: is copied verbatim to the output
! 362: .Pq with the %{}'s removed .
1.1 deraadt 363: The %{}'s must appear unindented on lines by themselves.
1.16 ! jmc 364: .Pp
1.1 deraadt 365: In the rules section,
1.16 ! jmc 366: any indented or %{} text appearing before the first rule may be used to
! 367: declare variables which are local to the scanning routine and
! 368: .Pq after the declarations
1.1 deraadt 369: code which is to be executed whenever the scanning routine is entered.
370: Other indented or %{} text in the rule section is still copied to the output,
371: but its meaning is not well-defined and it may well cause compile-time
372: errors (this feature is present for
1.16 ! jmc 373: .Tn POSIX
1.1 deraadt 374: compliance; see below for other such features).
1.16 ! jmc 375: .Pp
! 376: In the definitions section
! 377: .Pq but not in the rules section ,
! 378: an unindented comment
! 379: (i.e., a line beginning with
! 380: .Qq /* )
! 381: is also copied verbatim to the output up to the next
! 382: .Qq */ .
! 383: .Sh PATTERNS
1.1 deraadt 384: The patterns in the input are written using an extended set of regular
1.16 ! jmc 385: expressions.
! 386: These are:
! 387: .Bl -tag -width "XXXXXXXX"
! 388: .It x
! 389: Match the character
! 390: .Sq x .
! 391: .It .\&
! 392: Any character
! 393: .Pq byte
! 394: except newline.
! 395: .It [xyz]
! 396: A
! 397: .Qq character class ;
! 398: in this case, the pattern matches either an
! 399: .Sq x ,
! 400: a
! 401: .Sq y ,
! 402: or a
! 403: .Sq z .
! 404: .It [abj-oZ]
! 405: A
! 406: .Qq character class
! 407: with a range in it; matches an
! 408: .Sq a ,
! 409: a
! 410: .Sq b ,
! 411: any letter from
! 412: .Sq j
! 413: through
! 414: .Sq o ,
! 415: or a
! 416: .Sq Z .
! 417: .It [^A-Z]
! 418: A
! 419: .Qq negated character class ,
! 420: i.e., any character but those in the class.
! 421: In this case, any character EXCEPT an uppercase letter.
! 422: .It [^A-Z\en]
! 423: Any character EXCEPT an uppercase letter or a newline.
! 424: .It r*
! 425: Zero or more r's, where
! 426: .Sq r
! 427: is any regular expression.
! 428: .It r+
! 429: One or more r's.
! 430: .It r?
! 431: Zero or one r's (that is,
! 432: .Qq an optional r ) .
! 433: .It r{2,5}
! 434: Anywhere from two to five r's.
! 435: .It r{2,}
! 436: Two or more r's.
! 437: .It r{4}
! 438: Exactly 4 r's.
! 439: .It {name}
! 440: The expansion of the
! 441: .Qq name
! 442: definition
! 443: .Pq see above .
! 444: .It \&"[xyz]\e\&"foo\&"
! 445: The literal string: [xyz]"foo.
! 446: .It \eX
! 447: If
! 448: .Sq X
! 449: is an
! 450: .Sq a ,
! 451: .Sq b ,
! 452: .Sq f ,
! 453: .Sq n ,
! 454: .Sq r ,
! 455: .Sq t ,
! 456: or
! 457: .Sq v ,
! 458: then the ANSI-C interpretation of
! 459: .Sq \eX .
! 460: Otherwise, a literal
! 461: .Sq X
! 462: (used to escape operators such as
! 463: .Sq * ) .
! 464: .It \e0
! 465: A NUL character
! 466: .Pq ASCII code 0 .
! 467: .It \e123
! 468: The character with octal value 123.
! 469: .It \ex2a
! 470: The character with hexadecimal value 2a.
! 471: .It (r)
! 472: Match an
! 473: .Sq r ;
! 474: parentheses are used to override precedence
! 475: .Pq see below .
! 476: .It rs
! 477: The regular expression
! 478: .Sq r
! 479: followed by the regular expression
! 480: .Sq s ;
! 481: called
! 482: .Qq concatenation .
! 483: .It r|s
! 484: Either an
! 485: .Sq r
! 486: or an
! 487: .Sq s .
! 488: .It r/s
! 489: An
! 490: .Sq r ,
! 491: but only if it is followed by an
! 492: .Sq s .
! 493: The text matched by
! 494: .Sq s
! 495: is included when determining whether this rule is the
! 496: .Qq longest match ,
! 497: but is then returned to the input before the action is executed.
! 498: So the action only sees the text matched by
! 499: .Sq r .
! 500: This type of pattern is called
! 501: .Qq trailing context .
! 502: (There are some combinations of r/s that
! 503: .Nm
! 504: cannot match correctly; see notes in the
! 505: .Sx BUGS
! 506: section below regarding
! 507: .Qq dangerous trailing context . )
! 508: .It ^r
! 509: An
! 510: .Sq r ,
! 511: but only at the beginning of a line
! 512: (i.e., just starting to scan, or right after a newline has been scanned).
! 513: .It r$
! 514: An
! 515: .Sq r ,
! 516: but only at the end of a line
! 517: .Pq i.e., just before a newline .
! 518: Equivalent to
! 519: .Qq r/\en .
! 520: .Pp
! 521: Note that
! 522: .Nm flex Ns 's
! 523: notion of
! 524: .Qq newline
! 525: is exactly whatever the C compiler used to compile
! 526: .Nm
! 527: interprets
! 528: .Sq \en
! 529: as.
! 530: .\" In particular, on some DOS systems you must either filter out \er's in the
! 531: .\" input yourself, or explicitly use r/\er\en for
! 532: .\" .Qq r$ .
! 533: .It <s>r
! 534: An
! 535: .Sq r ,
! 536: but only in start condition
! 537: .Sq s
! 538: .Pq see below for discussion of start conditions .
! 539: .It <s1,s2,s3>r
! 540: The same, but in any of start conditions s1, s2, or s3.
! 541: .It <*>r
! 542: An
! 543: .Sq r
! 544: in any start condition, even an exclusive one.
! 545: .It <<EOF>>
! 546: An end-of-file.
! 547: .It <s1,s2><<EOF>>
! 548: An end-of-file when in start condition s1 or s2.
! 549: .El
! 550: .Pp
1.1 deraadt 551: Note that inside of a character class, all regular expression operators
1.16 ! jmc 552: lose their special meaning except escape
! 553: .Pq Sq \e
! 554: and the character class operators,
! 555: .Sq - ,
! 556: .Sq ]\& ,
! 557: and, at the beginning of the class,
! 558: .Sq ^ .
! 559: .Pp
1.1 deraadt 560: The regular expressions listed above are grouped according to
561: precedence, from highest precedence at the top to lowest at the bottom.
1.16 ! jmc 562: Those grouped together have equal precedence.
! 563: For example,
! 564: .Pp
! 565: .D1 foo|bar*
! 566: .Pp
1.1 deraadt 567: is the same as
1.16 ! jmc 568: .Pp
! 569: .D1 (foo)|(ba(r*))
! 570: .Pp
! 571: since the
! 572: .Sq *
! 573: operator has higher precedence than concatenation,
! 574: and concatenation higher than alternation
! 575: .Pq Sq |\& .
! 576: This pattern therefore matches
! 577: .Em either
! 578: the string
! 579: .Qq foo
! 580: .Em or
! 581: the string
! 582: .Qq ba
! 583: followed by zero-or-more r's.
! 584: To match
! 585: .Qq foo
! 586: or zero-or-more "bar"'s,
! 587: use:
! 588: .Pp
! 589: .D1 foo|(bar)*
! 590: .Pp
1.1 deraadt 591: and to match zero-or-more "foo"'s-or-"bar"'s:
1.16 ! jmc 592: .Pp
! 593: .D1 (foo|bar)*
! 594: .Pp
1.1 deraadt 595: In addition to characters and ranges of characters, character classes
596: can also contain character class
1.16 ! jmc 597: .Em expressions .
1.1 deraadt 598: These are expressions enclosed inside
1.16 ! jmc 599: .Sq [:
! 600: and
! 601: .Sq :]
! 602: delimiters (which themselves must appear between the
! 603: .Sq [
1.1 deraadt 604: and
1.16 ! jmc 605: .Sq ]\&
! 606: of the
1.1 deraadt 607: character class; other elements may occur inside the character class, too).
608: The valid expressions are:
1.16 ! jmc 609: .Bd -unfilled -offset indent
! 610: [:alnum:] [:alpha:] [:blank:]
! 611: [:cntrl:] [:digit:] [:graph:]
! 612: [:lower:] [:print:] [:punct:]
! 613: [:space:] [:upper:] [:xdigit:]
! 614: .Ed
! 615: .Pp
1.1 deraadt 616: These expressions all designate a set of characters equivalent to
617: the corresponding standard C
1.16 ! jmc 618: .Fn isXXX
! 619: function.
! 620: For example, [:alnum:] designates those characters for which
! 621: .Xr isalnum 3
! 622: returns true \- i.e., any alphabetic or numeric.
1.1 deraadt 623: Some systems don't provide
1.16 ! jmc 624: .Xr isblank 3 ,
! 625: so
! 626: .Nm
! 627: defines [:blank:] as a blank or a tab.
! 628: .Pp
1.1 deraadt 629: For example, the following character classes are all equivalent:
1.16 ! jmc 630: .Bd -unfilled -offset indent
! 631: [[:alnum:]]
! 632: [[:alpha:][:digit:]]
! 633: [[:alpha:]0-9]
! 634: [a-zA-Z0-9]
! 635: .Ed
! 636: .Pp
! 637: If the scanner is case-insensitive (the
! 638: .Fl i
! 639: flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
! 640: .Pp
1.1 deraadt 641: Some notes on patterns:
1.16 ! jmc 642: .Bl -dash
! 643: .It
! 644: A negated character class such as the example
! 645: .Qq [^A-Z]
! 646: above will match a newline unless "\en"
! 647: .Pq or an equivalent escape sequence
! 648: is one of the characters explicitly present in the negated character class
! 649: (e.g.,
! 650: .Qq [^A-Z\en] ) .
! 651: This is unlike how many other regular expression tools treat negated character
! 652: classes, but unfortunately the inconsistency is historically entrenched.
! 653: Matching newlines means that a pattern like
! 654: .Qq [^"]*
! 655: can match the entire input unless there's another quote in the input.
! 656: .It
! 657: A rule can have at most one instance of trailing context
! 658: (the
! 659: .Sq /
! 660: operator or the
! 661: .Sq $
! 662: operator).
! 663: The start condition,
! 664: .Sq ^ ,
! 665: and
! 666: .Qq <<EOF>>
! 667: patterns can only occur at the beginning of a pattern, and, as well as with
! 668: .Sq /
! 669: and
! 670: .Sq $ ,
! 671: cannot be grouped inside parentheses.
! 672: A
! 673: .Sq ^
! 674: which does not occur at the beginning of a rule or a
! 675: .Sq $
! 676: which does not occur at the end of a rule loses its special properties
! 677: and is treated as a normal character.
! 678: .It
1.1 deraadt 679: The following are illegal:
1.16 ! jmc 680: .Bd -unfilled -offset indent
! 681: foo/bar$
! 682: <sc1>foo<sc2>bar
! 683: .Ed
! 684: .Pp
! 685: Note that the first of these, can be written
! 686: .Qq foo/bar\en .
! 687: .It
! 688: The following will result in
! 689: .Sq $
! 690: or
! 691: .Sq ^
! 692: being treated as a normal character:
! 693: .Bd -unfilled -offset indent
! 694: foo|(bar$)
! 695: foo|^bar
! 696: .Ed
! 697: .Pp
! 698: If what's wanted is a
! 699: .Qq foo
! 700: or a bar-followed-by-a-newline, the following could be used
! 701: (the special
! 702: .Sq |\&
! 703: action is explained below):
! 704: .Bd -unfilled -offset indent
! 705: foo |
! 706: bar$ /* action goes here */
! 707: .Ed
! 708: .Pp
1.1 deraadt 709: A similar trick will work for matching a foo or a
710: bar-at-the-beginning-of-a-line.
1.16 ! jmc 711: .El
! 712: .Sh HOW THE INPUT IS MATCHED
! 713: When the generated scanner is run,
! 714: it analyzes its input looking for strings which match any of its patterns.
! 715: If it finds more than one match,
! 716: it takes the one matching the most text
! 717: (for trailing context rules, this includes the length of the trailing part,
! 718: even though it will then be returned to the input).
! 719: If it finds two or more matches of the same length,
! 720: the rule listed first in the
! 721: .Nm
1.1 deraadt 722: input file is chosen.
1.16 ! jmc 723: .Pp
1.1 deraadt 724: Once the match is determined, the text corresponding to the match
725: (called the
1.16 ! jmc 726: .Em token )
1.1 deraadt 727: is made available in the global character pointer
1.16 ! jmc 728: .Fa yytext ,
1.1 deraadt 729: and its length in the global integer
1.16 ! jmc 730: .Fa yyleng .
1.1 deraadt 731: The
1.16 ! jmc 732: .Em action
! 733: corresponding to the matched pattern is then executed
! 734: .Pq a more detailed description of actions follows ,
! 735: and then the remaining input is scanned for another match.
! 736: .Pp
! 737: If no match is found, then the default rule is executed:
! 738: the next character in the input is considered matched and
! 739: copied to the standard output.
! 740: Thus, the simplest legal
! 741: .Nm
1.1 deraadt 742: input is:
1.16 ! jmc 743: .Pp
! 744: .D1 %%
! 745: .Pp
! 746: which generates a scanner that simply copies its input
! 747: .Pq one character at a time
! 748: to its output.
! 749: .Pp
1.1 deraadt 750: Note that
1.16 ! jmc 751: .Fa yytext
! 752: can be defined in two different ways:
! 753: either as a character pointer or as a character array.
! 754: Which definition
! 755: .Nm
! 756: uses can be controlled by including one of the special directives
! 757: .Dq %pointer
! 758: or
! 759: .Dq %array
! 760: in the first
! 761: .Pq definitions
! 762: section of flex input.
! 763: The default is
! 764: .Dq %pointer ,
! 765: unless the
! 766: .Fl l
! 767: lex compatibility option is used, in which case
! 768: .Fa yytext
1.1 deraadt 769: will be an array.
770: The advantage of using
1.16 ! jmc 771: .Dq %pointer
1.1 deraadt 772: is substantially faster scanning and no buffer overflow when matching
1.16 ! jmc 773: very large tokens
! 774: .Pq unless not enough dynamic memory is available .
! 775: The disadvantage is that actions are restricted in how they can modify
! 776: .Fa yytext
! 777: .Pq see the next section ,
! 778: and calls to the
! 779: .Fn unput
1.10 deraadt 780: function destroy the present contents of
1.16 ! jmc 781: .Fa yytext ,
1.1 deraadt 782: which can be a considerable porting headache when moving between different
1.16 ! jmc 783: .Nm lex
1.1 deraadt 784: versions.
1.16 ! jmc 785: .Pp
1.1 deraadt 786: The advantage of
1.16 ! jmc 787: .Dq %array
! 788: is that
! 789: .Fa yytext
! 790: can be modified as much as wanted, and calls to
! 791: .Fn unput
1.1 deraadt 792: do not destroy
1.16 ! jmc 793: .Fa yytext
! 794: .Pq see below .
! 795: Furthermore, existing
! 796: .Nm lex
1.1 deraadt 797: programs sometimes access
1.16 ! jmc 798: .Fa yytext
1.1 deraadt 799: externally using declarations of the form:
1.16 ! jmc 800: .Pp
! 801: .D1 extern char yytext[];
! 802: .Pp
1.1 deraadt 803: This definition is erroneous when used with
1.16 ! jmc 804: .Dq %pointer ,
1.1 deraadt 805: but correct for
1.16 ! jmc 806: .Dq %array .
! 807: .Pp
! 808: .Dq %array
1.1 deraadt 809: defines
1.16 ! jmc 810: .Fa yytext
1.1 deraadt 811: to be an array of
1.16 ! jmc 812: .Dv YYLMAX
! 813: characters, which defaults to a fairly large value.
! 814: The size can be changed by simply #define'ing
! 815: .Dv YYLMAX
! 816: to a different value in the first section of
! 817: .Nm
! 818: input.
! 819: As mentioned above, with
! 820: .Dq %pointer
! 821: yytext grows dynamically to accommodate large tokens.
! 822: While this means a
! 823: .Dq %pointer
! 824: scanner can accommodate very large tokens
! 825: .Pq such as matching entire blocks of comments ,
! 826: bear in mind that each time the scanner must resize
! 827: .Fa yytext
1.1 deraadt 828: it also must rescan the entire token from the beginning, so matching such
829: tokens can prove slow.
1.16 ! jmc 830: .Fa yytext
! 831: presently does not dynamically grow if a call to
! 832: .Fn unput
1.1 deraadt 833: results in too much text being pushed back; instead, a run-time error results.
1.16 ! jmc 834: .Pp
! 835: Also note that
! 836: .Dq %array
! 837: cannot be used with C++ scanner classes
! 838: .Pq the c++ option; see below .
! 839: .Sh ACTIONS
! 840: Each pattern in a rule has a corresponding action,
! 841: which can be any arbitrary C statement.
! 842: The pattern ends at the first non-escaped whitespace character;
! 843: the remainder of the line is its action.
! 844: If the action is empty,
! 845: then when the pattern is matched the input token is simply discarded.
! 846: For example, here is the specification for a program
! 847: which deletes all occurrences of
! 848: .Qq zap me
! 849: from its input:
! 850: .Bd -literal -offset indent
! 851: %%
! 852: "zap me"
! 853: .Ed
! 854: .Pp
1.1 deraadt 855: (It will copy all other characters in the input to the output since
856: they will be matched by the default rule.)
1.16 ! jmc 857: .Pp
1.1 deraadt 858: Here is a program which compresses multiple blanks and tabs down to
859: a single blank, and throws away whitespace found at the end of a line:
1.16 ! jmc 860: .Bd -literal -offset indent
! 861: %%
! 862: [ \et]+ putchar(' ');
! 863: [ \et]+$ /* ignore this token */
! 864: .Ed
! 865: .Pp
! 866: If the action contains a
! 867: .Sq { ,
! 868: then the action spans till the balancing
! 869: .Sq }
1.1 deraadt 870: is found, and the action may cross multiple lines.
1.16 ! jmc 871: .Nm
1.1 deraadt 872: knows about C strings and comments and won't be fooled by braces found
873: within them, but also allows actions to begin with
1.16 ! jmc 874: .Sq %{
1.1 deraadt 875: and will consider the action to be all the text up to the next
1.16 ! jmc 876: .Sq %}
! 877: .Pq regardless of ordinary braces inside the action .
! 878: .Pp
! 879: An action consisting solely of a vertical bar
! 880: .Pq Sq |\&
! 881: means
! 882: .Qq same as the action for the next rule .
! 883: See below for an illustration.
! 884: .Pp
! 885: Actions can include arbitrary C code,
! 886: including return statements to return a value to whatever routine called
! 887: .Fn yylex .
1.1 deraadt 888: Each time
1.16 ! jmc 889: .Fn yylex
! 890: is called, it continues processing tokens from where it last left off
! 891: until it either reaches the end of the file or executes a return.
! 892: .Pp
1.1 deraadt 893: Actions are free to modify
1.16 ! jmc 894: .Fa yytext
! 895: except for lengthening it
! 896: (adding characters to its end \- these will overwrite later characters in the
! 897: input stream).
! 898: This, however, does not apply when using
! 899: .Dq %array
! 900: .Pq see above ;
! 901: in that case,
! 902: .Fa yytext
1.1 deraadt 903: may be freely modified in any way.
1.16 ! jmc 904: .Pp
1.1 deraadt 905: Actions are free to modify
1.16 ! jmc 906: .Fa yyleng
1.1 deraadt 907: except they should not do so if the action also includes use of
1.16 ! jmc 908: .Fn yymore
! 909: .Pq see below .
! 910: .Pp
1.1 deraadt 911: There are a number of special directives which can be included within
912: an action:
1.16 ! jmc 913: .Bl -tag -width Ds
! 914: .It ECHO
! 915: Copies
! 916: .Fa yytext
! 917: to the scanner's output.
! 918: .It BEGIN
! 919: Followed by the name of a start condition, places the scanner in the
! 920: corresponding start condition
! 921: .Pq see below .
! 922: .It REJECT
! 923: Directs the scanner to proceed on to the
! 924: .Qq second best
! 925: rule which matched the input
! 926: .Pq or a prefix of the input .
! 927: The rule is chosen as described above in
! 928: .Sx HOW THE INPUT IS MATCHED ,
! 929: and
! 930: .Fa yytext
1.1 deraadt 931: and
1.16 ! jmc 932: .Fa yyleng
1.1 deraadt 933: set up appropriately.
934: It may either be one which matched as much text
935: as the originally chosen rule but came later in the
1.16 ! jmc 936: .Nm
1.1 deraadt 937: input file, or one which matched less text.
938: For example, the following will both count the
1.16 ! jmc 939: words in the input and call the routine
! 940: .Fn special
! 941: whenever
! 942: .Qq frob
! 943: is seen:
! 944: .Bd -literal -offset indent
! 945: int word_count = 0;
! 946: %%
! 947:
! 948: frob special(); REJECT;
! 949: [^ \et\en]+ ++word_count;
! 950: .Ed
! 951: .Pp
1.1 deraadt 952: Without the
1.16 ! jmc 953: .Em REJECT ,
! 954: any "frob"'s in the input would not be counted as words,
! 955: since the scanner normally executes only one action per token.
1.1 deraadt 956: Multiple
1.16 ! jmc 957: .Em REJECT Ns 's
! 958: are allowed,
! 959: each one finding the next best choice to the currently active rule.
! 960: For example, when the following scanner scans the token
! 961: .Qq abcd ,
! 962: it will write
! 963: .Qq abcdabcaba
! 964: to the output:
! 965: .Bd -literal -offset indent
! 966: %%
! 967: a |
! 968: ab |
! 969: abc |
! 970: abcd ECHO; REJECT;
! 971: \&.|\en /* eat up any unmatched character */
! 972: .Ed
! 973: .Pp
1.1 deraadt 974: (The first three rules share the fourth's action since they use
1.16 ! jmc 975: the special
! 976: .Sq |\&
! 977: action.)
! 978: .Em REJECT
1.1 deraadt 979: is a particularly expensive feature in terms of scanner performance;
1.16 ! jmc 980: if it is used in any of the scanner's actions it will slow down
! 981: all of the scanner's matching.
! 982: Furthermore,
! 983: .Em REJECT
1.1 deraadt 984: cannot be used with the
1.16 ! jmc 985: .Fl Cf
1.1 deraadt 986: or
1.16 ! jmc 987: .Fl CF
! 988: options
! 989: .Pq see below .
! 990: .Pp
1.1 deraadt 991: Note also that unlike the other special actions,
1.16 ! jmc 992: .Em REJECT
1.1 deraadt 993: is a
1.16 ! jmc 994: .Em branch ;
! 995: code immediately following it in the action will not be executed.
! 996: .It yymore()
! 997: Tells the scanner that the next time it matches a rule, the corresponding
! 998: token should be appended onto the current value of
! 999: .Fa yytext
! 1000: rather than replacing it.
! 1001: For example, given the input
! 1002: .Qq mega-kludge
! 1003: the following will write
! 1004: .Qq mega-mega-kludge
! 1005: to the output:
! 1006: .Bd -literal -offset indent
! 1007: %%
! 1008: mega- ECHO; yymore();
! 1009: kludge ECHO;
! 1010: .Ed
! 1011: .Pp
! 1012: First
! 1013: .Qq mega-
! 1014: is matched and echoed to the output.
! 1015: Then
! 1016: .Qq kludge
! 1017: is matched, but the previous
! 1018: .Qq mega-
! 1019: is still hanging around at the beginning of
! 1020: .Fa yytext
1.1 deraadt 1021: so the
1.16 ! jmc 1022: .Em ECHO
! 1023: for the
! 1024: .Qq kludge
! 1025: rule will actually write
! 1026: .Qq mega-kludge .
! 1027: .Pp
1.1 deraadt 1028: Two notes regarding use of
1.16 ! jmc 1029: .Fn yymore :
1.1 deraadt 1030: First,
1.16 ! jmc 1031: .Fn yymore
1.1 deraadt 1032: depends on the value of
1.16 ! jmc 1033: .Fa yyleng
! 1034: correctly reflecting the size of the current token, so
! 1035: .Fa yyleng
! 1036: must not be modified when using
! 1037: .Fn yymore .
1.1 deraadt 1038: Second, the presence of
1.16 ! jmc 1039: .Fn yymore
1.1 deraadt 1040: in the scanner's action entails a minor performance penalty in the
1041: scanner's matching speed.
1.16 ! jmc 1042: .It yyless(n)
! 1043: Returns all but the first
! 1044: .Ar n
1.1 deraadt 1045: characters of the current token back to the input stream, where they
1046: will be rescanned when the scanner looks for the next match.
1.16 ! jmc 1047: .Fa yytext
1.1 deraadt 1048: and
1.16 ! jmc 1049: .Fa yyleng
1.1 deraadt 1050: are adjusted appropriately (e.g.,
1.16 ! jmc 1051: .Fa yyleng
1.1 deraadt 1052: will now be equal to
1.16 ! jmc 1053: .Ar n ) .
! 1054: For example, on the input
! 1055: .Qq foobar
! 1056: the following will write out
! 1057: .Qq foobarbar :
! 1058: .Bd -literal -offset indent
! 1059: %%
! 1060: foobar ECHO; yyless(3);
! 1061: [a-z]+ ECHO;
! 1062: .Ed
! 1063: .Pp
1.1 deraadt 1064: An argument of 0 to
1.16 ! jmc 1065: .Fa yyless
! 1066: will cause the entire current input string to be scanned again.
! 1067: Unless how the scanner will subsequently process its input has been changed
! 1068: (using
! 1069: .Em BEGIN ,
! 1070: for example),
! 1071: this will result in an endless loop.
! 1072: .Pp
1.1 deraadt 1073: Note that
1.16 ! jmc 1074: .Fa yyless
! 1075: is a macro and can only be used in the
! 1076: .Nm
! 1077: input file, not from other source files.
! 1078: .It unput(c)
! 1079: Puts the character
! 1080: .Ar c
! 1081: back into the input stream.
! 1082: It will be the next character scanned.
1.1 deraadt 1083: The following action will take the current token and cause it
1084: to be rescanned enclosed in parentheses.
1.16 ! jmc 1085: .Bd -literal -offset indent
! 1086: {
! 1087: int i;
! 1088: char *yycopy;
! 1089:
! 1090: /* Copy yytext because unput() trashes yytext */
! 1091: if ((yycopy = strdup(yytext)) == NULL)
! 1092: err(1, NULL);
! 1093: unput(')');
! 1094: for (i = yyleng - 1; i >= 0; --i)
! 1095: unput(yycopy[i]);
! 1096: unput('(');
! 1097: free(yycopy);
! 1098: }
! 1099: .Ed
! 1100: .Pp
1.1 deraadt 1101: Note that since each
1.16 ! jmc 1102: .Fn unput
! 1103: puts the given character back at the beginning of the input stream,
! 1104: pushing back strings must be done back-to-front.
! 1105: .Pp
1.1 deraadt 1106: An important potential problem when using
1.16 ! jmc 1107: .Fn unput
! 1108: is that if using
! 1109: .Dq %pointer
! 1110: .Pq the default ,
! 1111: a call to
! 1112: .Fn unput
! 1113: destroys the contents of
! 1114: .Fa yytext ,
1.1 deraadt 1115: starting with its rightmost character and devouring one character to
1.16 ! jmc 1116: the left with each call.
! 1117: If the value of
! 1118: .Fa yytext
! 1119: should be preserved after a call to
! 1120: .Fn unput
! 1121: .Pq as in the above example ,
! 1122: it must either first be copied elsewhere, or the scanner must be built using
! 1123: .Dq %array
! 1124: instead (see
! 1125: .Sx HOW THE INPUT IS MATCHED ) .
! 1126: .Pp
! 1127: Finally, note that EOF cannot be put back
1.1 deraadt 1128: to attempt to mark the input stream with an end-of-file.
1.16 ! jmc 1129: .It input()
! 1130: Reads the next character from the input stream.
! 1131: For example, the following is one way to eat up C comments:
! 1132: .Bd -literal -offset indent
! 1133: %%
! 1134: "/*" {
! 1135: int c;
! 1136:
! 1137: for (;;) {
! 1138: while ((c = input()) != '*' && c != EOF)
! 1139: ; /* eat up text of comment */
! 1140:
! 1141: if (c == '*') {
! 1142: while ((c = input()) == '*')
! 1143: ;
! 1144: if (c == '/')
! 1145: break; /* found the end */
! 1146: }
! 1147:
! 1148: if (c == EOF) {
! 1149: errx(1, "EOF in comment");
1.1 deraadt 1150: break;
1151: }
1.16 ! jmc 1152: }
! 1153: }
! 1154: .Ed
! 1155: .Pp
! 1156: (Note that if the scanner is compiled using C++, then
! 1157: .Fn input
1.1 deraadt 1158: is instead referred to as
1.16 ! jmc 1159: .Fn yyinput ,
! 1160: in order to avoid a name clash with the C++ stream by the name of input.)
! 1161: .It YY_FLUSH_BUFFER
! 1162: Flushes the scanner's internal buffer
! 1163: so that the next time the scanner attempts to match a token,
! 1164: it will first refill the buffer using
! 1165: .Dv YY_INPUT
! 1166: (see
! 1167: .Sx THE GENERATED SCANNER ,
! 1168: below).
! 1169: This action is a special case of the more general
! 1170: .Fn yy_flush_buffer
! 1171: function, described below in the section
! 1172: .Sx MULTIPLE INPUT BUFFERS .
! 1173: .It yyterminate()
! 1174: Can be used in lieu of a return statement in an action.
! 1175: It terminates the scanner and returns a 0 to the scanner's caller, indicating
! 1176: .Qq all done .
1.1 deraadt 1177: By default,
1.16 ! jmc 1178: .Fn yyterminate
! 1179: is also called when an end-of-file is encountered.
! 1180: It is a macro and may be redefined.
! 1181: .El
! 1182: .Sh THE GENERATED SCANNER
1.1 deraadt 1183: The output of
1.16 ! jmc 1184: .Nm
1.1 deraadt 1185: is the file
1.16 ! jmc 1186: .Pa lex.yy.c ,
1.1 deraadt 1187: which contains the scanning routine
1.16 ! jmc 1188: .Fn yylex ,
! 1189: a number of tables used by it for matching tokens,
! 1190: and a number of auxiliary routines and macros.
! 1191: By default,
! 1192: .Fn yylex
1.1 deraadt 1193: is declared as follows:
1.16 ! jmc 1194: .Bd -unfilled -offset indent
! 1195: int yylex()
! 1196: {
! 1197: ... various definitions and the actions in here ...
! 1198: }
! 1199: .Ed
! 1200: .Pp
! 1201: (If the environment supports function prototypes, then it will
! 1202: be "int yylex(void)".)
! 1203: This definition may be changed by defining the
! 1204: .Dv YY_DECL
! 1205: macro.
! 1206: For example:
! 1207: .Bd -literal -offset indent
! 1208: #define YY_DECL float lexscan(a, b) float a, b;
! 1209: .Ed
! 1210: .Pp
! 1211: would give the scanning routine the name
! 1212: .Em lexscan ,
! 1213: returning a float, and taking two floats as arguments.
! 1214: Note that if arguments are given to the scanning routine using a
! 1215: K&R-style/non-prototyped function declaration,
! 1216: the definition must be terminated with a semi-colon
! 1217: .Pq Sq ;\& .
! 1218: .Pp
1.1 deraadt 1219: Whenever
1.16 ! jmc 1220: .Fn yylex
1.1 deraadt 1221: is called, it scans tokens from the global input file
1.16 ! jmc 1222: .Pa yyin
! 1223: .Pq which defaults to stdin .
! 1224: It continues until it either reaches an end-of-file
! 1225: .Pq at which point it returns the value 0
! 1226: or one of its actions executes a
! 1227: .Em return
1.1 deraadt 1228: statement.
1.16 ! jmc 1229: .Pp
1.1 deraadt 1230: If the scanner reaches an end-of-file, subsequent calls are undefined
1231: unless either
1.16 ! jmc 1232: .Em yyin
! 1233: is pointed at a new input file
! 1234: .Pq in which case scanning continues from that file ,
! 1235: or
! 1236: .Fn yyrestart
1.1 deraadt 1237: is called.
1.16 ! jmc 1238: .Fn yyrestart
1.1 deraadt 1239: takes one argument, a
1.16 ! jmc 1240: .Fa FILE *
! 1241: pointer (which can be nil, if
! 1242: .Dv YY_INPUT
! 1243: has been set up to scan from a source other than
! 1244: .Em yyin ) ,
1.1 deraadt 1245: and initializes
1.16 ! jmc 1246: .Em yyin
! 1247: for scanning from that file.
! 1248: Essentially there is no difference between just assigning
! 1249: .Em yyin
1.1 deraadt 1250: to a new input file or using
1.16 ! jmc 1251: .Fn yyrestart
! 1252: to do so; the latter is available for compatibility with previous versions of
! 1253: .Nm ,
1.1 deraadt 1254: and because it can be used to switch input files in the middle of scanning.
1.16 ! jmc 1255: It can also be used to throw away the current input buffer,
! 1256: by calling it with an argument of
! 1257: .Em yyin ;
1.1 deraadt 1258: but better is to use
1.16 ! jmc 1259: .Dv YY_FLUSH_BUFFER
! 1260: .Pq see above .
1.1 deraadt 1261: Note that
1.16 ! jmc 1262: .Fn yyrestart
! 1263: does not reset the start condition to
! 1264: .Em INITIAL
! 1265: (see
! 1266: .Sx START CONDITIONS ,
! 1267: below).
! 1268: .Pp
1.1 deraadt 1269: If
1.16 ! jmc 1270: .Fn yylex
1.1 deraadt 1271: stops scanning due to executing a
1.16 ! jmc 1272: .Em return
1.1 deraadt 1273: statement in one of the actions, the scanner may then be called again and it
1274: will resume scanning where it left off.
1.16 ! jmc 1275: .Pp
! 1276: By default
! 1277: .Pq and for purposes of efficiency ,
! 1278: the scanner uses block-reads rather than simple
! 1279: .Xr getc 3
1.1 deraadt 1280: calls to read characters from
1.16 ! jmc 1281: .Em yyin .
1.1 deraadt 1282: The nature of how it gets its input can be controlled by defining the
1.16 ! jmc 1283: .Dv YY_INPUT
1.1 deraadt 1284: macro.
1.16 ! jmc 1285: .Dv YY_INPUT Ns 's
! 1286: calling sequence is
! 1287: .Qq YY_INPUT(buf,result,max_size) .
! 1288: Its action is to place up to
! 1289: .Dv max_size
1.1 deraadt 1290: characters in the character array
1.16 ! jmc 1291: .Em buf
1.1 deraadt 1292: and return in the integer variable
1.16 ! jmc 1293: .Em result
! 1294: either the number of characters read or the constant
! 1295: .Dv YY_NULL
! 1296: (0 on
! 1297: .Ux
! 1298: systems)
! 1299: to indicate
! 1300: .Dv EOF .
! 1301: The default
! 1302: .Dv YY_INPUT
! 1303: reads from the global file-pointer
! 1304: .Qq yyin .
! 1305: .Pp
! 1306: A sample definition of
! 1307: .Dv YY_INPUT
! 1308: .Pq in the definitions section of the input file :
! 1309: .Bd -unfilled -offset indent
! 1310: %{
! 1311: #define YY_INPUT(buf,result,max_size) \e
! 1312: { \e
! 1313: int c = getchar(); \e
! 1314: result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
! 1315: }
! 1316: %}
! 1317: .Ed
! 1318: .Pp
1.1 deraadt 1319: This definition will change the input processing to occur
1320: one character at a time.
1.16 ! jmc 1321: .Pp
! 1322: When the scanner receives an end-of-file indication from
! 1323: .Dv YY_INPUT ,
1.1 deraadt 1324: it then checks the
1.16 ! jmc 1325: .Fn yywrap
! 1326: function.
! 1327: If
! 1328: .Fn yywrap
! 1329: returns false
! 1330: .Pq zero ,
! 1331: then it is assumed that the function has gone ahead and set up
! 1332: .Em yyin
! 1333: to point to another input file, and scanning continues.
! 1334: If it returns true
! 1335: .Pq non-zero ,
! 1336: then the scanner terminates, returning 0 to its caller.
! 1337: Note that in either case, the start condition remains unchanged;
! 1338: it does not revert to
! 1339: .Em INITIAL .
! 1340: .Pp
1.1 deraadt 1341: If you do not supply your own version of
1.16 ! jmc 1342: .Fn yywrap ,
1.1 deraadt 1343: then you must either use
1.16 ! jmc 1344: .Dq %option noyywrap
1.1 deraadt 1345: (in which case the scanner behaves as though
1.16 ! jmc 1346: .Fn yywrap
1.1 deraadt 1347: returned 1), or you must link with
1.16 ! jmc 1348: .Fl lfl
1.1 deraadt 1349: to obtain the default version of the routine, which always returns 1.
1.16 ! jmc 1350: .Pp
1.1 deraadt 1351: Three routines are available for scanning from in-memory buffers rather
1352: than files:
1.16 ! jmc 1353: .Fn yy_scan_string ,
! 1354: .Fn yy_scan_bytes ,
1.1 deraadt 1355: and
1.16 ! jmc 1356: .Fn yy_scan_buffer .
! 1357: See the discussion of them below in the section
! 1358: .Sx MULTIPLE INPUT BUFFERS .
! 1359: .Pp
1.1 deraadt 1360: The scanner writes its
1.16 ! jmc 1361: .Em ECHO
1.1 deraadt 1362: output to the
1.16 ! jmc 1363: .Em yyout
! 1364: global
! 1365: .Pq default, stdout ,
! 1366: which may be redefined by the user simply by assigning it to some other
! 1367: .Va FILE
1.1 deraadt 1368: pointer.
1.16 ! jmc 1369: .Sh START CONDITIONS
! 1370: .Nm
! 1371: provides a mechanism for conditionally activating rules.
! 1372: Any rule whose pattern is prefixed with
! 1373: .Qq Aq sc
! 1374: will only be active when the scanner is in the start condition named
! 1375: .Qq sc .
! 1376: For example,
! 1377: .Bd -literal -offset indent
! 1378: <STRING>[^"]* { /* eat up the string body ... */
! 1379: ...
! 1380: }
! 1381: .Ed
! 1382: .Pp
! 1383: will be active only when the scanner is in the
! 1384: .Qq STRING
! 1385: start condition, and
! 1386: .Bd -literal -offset indent
! 1387: <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
! 1388: ...
! 1389: }
! 1390: .Ed
! 1391: .Pp
! 1392: will be active only when the current start condition is either
! 1393: .Qq INITIAL ,
! 1394: .Qq STRING ,
! 1395: or
! 1396: .Qq QUOTE .
! 1397: .Pp
! 1398: Start conditions are declared in the definitions
! 1399: .Pq first
! 1400: section of the input using unindented lines beginning with either
! 1401: .Sq %s
1.1 deraadt 1402: or
1.16 ! jmc 1403: .Sq %x
1.1 deraadt 1404: followed by a list of names.
1405: The former declares
1.16 ! jmc 1406: .Em inclusive
1.1 deraadt 1407: start conditions, the latter
1.16 ! jmc 1408: .Em exclusive
! 1409: start conditions.
! 1410: A start condition is activated using the
! 1411: .Em BEGIN
! 1412: action.
! 1413: Until the next
! 1414: .Em BEGIN
! 1415: action is executed, rules with the given start condition will be active and
1.1 deraadt 1416: rules with other start conditions will be inactive.
1.16 ! jmc 1417: If the start condition is inclusive,
1.1 deraadt 1418: then rules with no start conditions at all will also be active.
1.16 ! jmc 1419: If it is exclusive,
! 1420: then only rules qualified with the start condition will be active.
1.1 deraadt 1421: A set of rules contingent on the same exclusive start condition
1422: describe a scanner which is independent of any of the other rules in the
1.16 ! jmc 1423: .Nm
! 1424: input.
! 1425: Because of this, exclusive start conditions make it easy to specify
! 1426: .Qq mini-scanners
1.1 deraadt 1427: which scan portions of the input that are syntactically different
1.16 ! jmc 1428: from the rest
! 1429: .Pq e.g., comments .
! 1430: .Pp
1.1 deraadt 1431: If the distinction between inclusive and exclusive start conditions
1432: is still a little vague, here's a simple example illustrating the
1.16 ! jmc 1433: connection between the two.
! 1434: The set of rules:
! 1435: .Bd -literal -offset indent
! 1436: %s example
! 1437: %%
! 1438:
! 1439: <example>foo do_something();
! 1440:
! 1441: bar something_else();
! 1442: .Ed
! 1443: .Pp
1.1 deraadt 1444: is equivalent to
1.16 ! jmc 1445: .Bd -literal -offset indent
! 1446: %x example
! 1447: %%
! 1448:
! 1449: <example>foo do_something();
! 1450:
! 1451: <INITIAL,example>bar something_else();
! 1452: .Ed
! 1453: .Pp
1.1 deraadt 1454: Without the
1.16 ! jmc 1455: .Aq INITIAL,example
1.1 deraadt 1456: qualifier, the
1.16 ! jmc 1457: .Dq bar
! 1458: pattern in the second example wouldn't be active
! 1459: .Pq i.e., couldn't match
1.1 deraadt 1460: when in start condition
1.16 ! jmc 1461: .Dq example .
1.1 deraadt 1462: If we just used
1.16 ! jmc 1463: .Aq example
1.1 deraadt 1464: to qualify
1.16 ! jmc 1465: .Dq bar ,
1.1 deraadt 1466: though, then it would only be active in
1.16 ! jmc 1467: .Dq example
1.1 deraadt 1468: and not in
1.16 ! jmc 1469: .Em INITIAL ,
! 1470: while in the first example it's active in both,
! 1471: because in the first example the
! 1472: .Dq example
! 1473: start condition is an inclusive
! 1474: .Pq Sq %s
1.1 deraadt 1475: start condition.
1.16 ! jmc 1476: .Pp
1.1 deraadt 1477: Also note that the special start-condition specifier
1.16 ! jmc 1478: .Sq Aq *
! 1479: matches every start condition.
! 1480: Thus, the above example could also have been written:
! 1481: .Bd -literal -offset indent
! 1482: %x example
! 1483: %%
! 1484:
! 1485: <example>foo do_something();
! 1486:
! 1487: <*>bar something_else();
! 1488: .Ed
! 1489: .Pp
1.1 deraadt 1490: The default rule (to
1.16 ! jmc 1491: .Em ECHO
! 1492: any unmatched character) remains active in start conditions.
! 1493: It is equivalent to:
! 1494: .Bd -literal -offset indent
! 1495: <*>.|\en ECHO;
! 1496: .Ed
! 1497: .Pp
! 1498: .Dq BEGIN(0)
1.1 deraadt 1499: returns to the original state where only the rules with
1.16 ! jmc 1500: no start conditions are active.
! 1501: This state can also be referred to as the start-condition
! 1502: .Em INITIAL ,
! 1503: so
! 1504: .Dq BEGIN(INITIAL)
1.1 deraadt 1505: is equivalent to
1.16 ! jmc 1506: .Dq BEGIN(0) .
1.1 deraadt 1507: (The parentheses around the start condition name are not required but
1508: are considered good style.)
1.16 ! jmc 1509: .Pp
! 1510: .Em BEGIN
1.1 deraadt 1511: actions can also be given as indented code at the beginning
1.16 ! jmc 1512: of the rules section.
! 1513: For example, the following will cause the scanner to enter the
! 1514: .Qq SPECIAL
! 1515: start condition whenever
! 1516: .Fn yylex
1.1 deraadt 1517: is called and the global variable
1.16 ! jmc 1518: .Fa enter_special
1.1 deraadt 1519: is true:
1.16 ! jmc 1520: .Bd -literal -offset indent
! 1521: int enter_special;
1.1 deraadt 1522:
1.16 ! jmc 1523: %x SPECIAL
! 1524: %%
! 1525: if (enter_special)
1.1 deraadt 1526: BEGIN(SPECIAL);
1527:
1.16 ! jmc 1528: <SPECIAL>blahblahblah
! 1529: \&...more rules follow...
! 1530: .Ed
! 1531: .Pp
1.1 deraadt 1532: To illustrate the uses of start conditions,
1533: here is a scanner which provides two different interpretations
1.16 ! jmc 1534: of a string like
! 1535: .Qq 123.456 .
! 1536: By default it will treat it as three tokens: the integer
! 1537: .Qq 123 ,
! 1538: a dot
! 1539: .Pq Sq .\& ,
! 1540: and the integer
! 1541: .Qq 456 .
1.1 deraadt 1542: But if the string is preceded earlier in the line by the string
1.16 ! jmc 1543: .Qq expect-floats
! 1544: it will treat it as a single token, the floating-point number 123.456:
! 1545: .Bd -literal -offset indent
! 1546: %{
! 1547: #include <math.h>
! 1548: %}
! 1549: %s expect
! 1550:
! 1551: %%
! 1552: expect-floats BEGIN(expect);
! 1553:
! 1554: <expect>[0-9]+"."[0-9]+ {
! 1555: printf("found a float, = %f\en",
! 1556: atof(yytext));
! 1557: }
! 1558: <expect>\en {
! 1559: /*
! 1560: * That's the end of the line, so
! 1561: * we need another "expect-number"
! 1562: * before we'll recognize any more
! 1563: * numbers.
! 1564: */
! 1565: BEGIN(INITIAL);
! 1566: }
! 1567:
! 1568: [0-9]+ {
! 1569: printf("found an integer, = %d\en",
! 1570: atoi(yytext));
! 1571: }
! 1572:
! 1573: "." printf("found a dot\en");
! 1574: .Ed
! 1575: .Pp
! 1576: Here is a scanner which recognizes
! 1577: .Pq and discards
! 1578: C comments while maintaining a count of the current input line:
! 1579: .Bd -literal -offset indent
! 1580: %x comment
! 1581: %%
! 1582: int line_num = 1;
! 1583:
! 1584: "/*" BEGIN(comment);
! 1585:
! 1586: <comment>[^*\en]* /* eat anything that's not a '*' */
! 1587: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
! 1588: <comment>\en ++line_num;
! 1589: <comment>"*"+"/" BEGIN(INITIAL);
! 1590: .Ed
! 1591: .Pp
1.1 deraadt 1592: This scanner goes to a bit of trouble to match as much
1.16 ! jmc 1593: text as possible with each rule.
! 1594: In general, when attempting to write a high-speed scanner
! 1595: try to match as much as possible in each rule, as it's a big win.
! 1596: .Pp
1.10 deraadt 1597: Note that start-condition names are really integer values and
1.16 ! jmc 1598: can be stored as such.
! 1599: Thus, the above could be extended in the following fashion:
! 1600: .Bd -literal -offset indent
! 1601: %x comment foo
! 1602: %%
! 1603: int line_num = 1;
! 1604: int comment_caller;
! 1605:
! 1606: "/*" {
! 1607: comment_caller = INITIAL;
! 1608: BEGIN(comment);
! 1609: }
! 1610:
! 1611: \&...
! 1612:
! 1613: <foo>"/*" {
! 1614: comment_caller = foo;
! 1615: BEGIN(comment);
! 1616: }
! 1617:
! 1618: <comment>[^*\en]* /* eat anything that's not a '*' */
! 1619: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
! 1620: <comment>\en ++line_num;
! 1621: <comment>"*"+"/" BEGIN(comment_caller);
! 1622: .Ed
! 1623: .Pp
! 1624: Furthermore, the current start condition can be accessed by using
1.1 deraadt 1625: the integer-valued
1.16 ! jmc 1626: .Dv YY_START
! 1627: macro.
! 1628: For example, the above assignments to
! 1629: .Em comment_caller
1.1 deraadt 1630: could instead be written
1.16 ! jmc 1631: .Pp
! 1632: .Dl comment_caller = YY_START;
! 1633: .Pp
1.1 deraadt 1634: Flex provides
1.16 ! jmc 1635: .Dv YYSTATE
1.1 deraadt 1636: as an alias for
1.16 ! jmc 1637: .Dv YY_START
1.1 deraadt 1638: (since that is what's used by AT&T
1.16 ! jmc 1639: .Nm lex ) .
! 1640: .Pp
! 1641: Note that start conditions do not have their own name-space;
! 1642: %s's and %x's declare names in the same fashion as #define's.
! 1643: .Pp
1.1 deraadt 1644: Finally, here's an example of how to match C-style quoted strings using
1.16 ! jmc 1645: exclusive start conditions, including expanded escape sequences
! 1646: (but not including checking for a string that's too long):
! 1647: .Bd -literal -offset indent
! 1648: %x str
! 1649:
! 1650: %%
! 1651: #define MAX_STR_CONST 1024
! 1652: char string_buf[MAX_STR_CONST];
! 1653: char *string_buf_ptr;
! 1654:
! 1655: \e" string_buf_ptr = string_buf; BEGIN(str);
! 1656:
! 1657: <str>\e" { /* saw closing quote - all done */
! 1658: BEGIN(INITIAL);
! 1659: *string_buf_ptr = '\e0';
! 1660: /*
! 1661: * return string constant token type and
! 1662: * value to parser
! 1663: */
! 1664: }
! 1665:
! 1666: <str>\en {
! 1667: /* error - unterminated string constant */
! 1668: /* generate error message */
! 1669: }
! 1670:
! 1671: <str>\e\e[0-7]{1,3} {
! 1672: /* octal escape sequence */
! 1673: int result;
! 1674:
! 1675: (void) sscanf(yytext + 1, "%o", &result);
! 1676:
! 1677: if (result > 0xff) {
! 1678: /* error, constant is out-of-bounds */
! 1679: } else
! 1680: *string_buf_ptr++ = result;
! 1681: }
! 1682:
! 1683: <str>\e\e[0-9]+ {
! 1684: /*
! 1685: * generate error - bad escape sequence; something
! 1686: * like '\e48' or '\e0777777'
! 1687: */
! 1688: }
! 1689:
! 1690: <str>\e\en *string_buf_ptr++ = '\en';
! 1691: <str>\e\et *string_buf_ptr++ = '\et';
! 1692: <str>\e\er *string_buf_ptr++ = '\er';
! 1693: <str>\e\eb *string_buf_ptr++ = '\eb';
! 1694: <str>\e\ef *string_buf_ptr++ = '\ef';
! 1695:
! 1696: <str>\e\e(.|\en) *string_buf_ptr++ = yytext[1];
! 1697:
! 1698: <str>[^\e\e\en\e"]+ {
! 1699: char *yptr = yytext;
! 1700:
! 1701: while (*yptr)
! 1702: *string_buf_ptr++ = *yptr++;
! 1703: }
! 1704: .Ed
! 1705: .Pp
! 1706: Often, such as in some of the examples above,
! 1707: a whole bunch of rules are all preceded by the same start condition(s).
! 1708: .Nm
1.1 deraadt 1709: makes this a little easier and cleaner by introducing a notion of
1710: start condition
1.16 ! jmc 1711: .Em scope .
1.1 deraadt 1712: A start condition scope is begun with:
1.16 ! jmc 1713: .Pp
! 1714: .Dl <SCs>{
! 1715: .Pp
1.1 deraadt 1716: where
1.16 ! jmc 1717: .Dq SCs
! 1718: is a list of one or more start conditions.
! 1719: Inside the start condition scope, every rule automatically has the prefix
! 1720: .Aq SCs
1.1 deraadt 1721: applied to it, until a
1.16 ! jmc 1722: .Sq }
1.1 deraadt 1723: which matches the initial
1.16 ! jmc 1724: .Sq { .
1.1 deraadt 1725: So, for example,
1.16 ! jmc 1726: .Bd -literal -offset indent
! 1727: <ESC>{
! 1728: "\e\en" return '\en';
! 1729: "\e\er" return '\er';
! 1730: "\e\ef" return '\ef';
! 1731: "\e\e0" return '\e0';
! 1732: }
! 1733: .Ed
! 1734: .Pp
1.1 deraadt 1735: is equivalent to:
1.16 ! jmc 1736: .Bd -literal -offset indent
! 1737: <ESC>"\e\en" return '\en';
! 1738: <ESC>"\e\er" return '\er';
! 1739: <ESC>"\e\ef" return '\ef';
! 1740: <ESC>"\e\e0" return '\e0';
! 1741: .Ed
! 1742: .Pp
1.1 deraadt 1743: Start condition scopes may be nested.
1.16 ! jmc 1744: .Pp
1.1 deraadt 1745: Three routines are available for manipulating stacks of start conditions:
1.16 ! jmc 1746: .Bl -tag -width Ds
! 1747: .It void yy_push_state(int new_state)
! 1748: Pushes the current start condition onto the top of the start condition
1.1 deraadt 1749: stack and switches to
1.16 ! jmc 1750: .Fa new_state
! 1751: as though
! 1752: .Dq BEGIN new_state
! 1753: had been used
! 1754: .Pq recall that start condition names are also integers .
! 1755: .It void yy_pop_state()
! 1756: Pops the top of the stack and switches to it via
! 1757: .Em BEGIN .
! 1758: .It int yy_top_state()
! 1759: Returns the top of the stack without altering the stack's contents.
! 1760: .El
! 1761: .Pp
1.1 deraadt 1762: The start condition stack grows dynamically and so has no built-in
1.16 ! jmc 1763: size limitation.
! 1764: If memory is exhausted, program execution aborts.
! 1765: .Pp
! 1766: To use start condition stacks, scanners must include a
! 1767: .Dq %option stack
! 1768: directive (see
! 1769: .Sx OPTIONS
! 1770: below).
! 1771: .Sh MULTIPLE INPUT BUFFERS
! 1772: Some scanners
! 1773: (such as those which support
! 1774: .Qq include
! 1775: files)
! 1776: require reading from several input streams.
! 1777: As
! 1778: .Nm
1.1 deraadt 1779: scanners do a large amount of buffering, one cannot control
1780: where the next input will be read from by simply writing a
1.16 ! jmc 1781: .Dv YY_INPUT
1.1 deraadt 1782: which is sensitive to the scanning context.
1.16 ! jmc 1783: .Dv YY_INPUT
1.1 deraadt 1784: is only called when the scanner reaches the end of its buffer, which
1.16 ! jmc 1785: may be a long time after scanning a statement such as an
! 1786: .Qq include
1.1 deraadt 1787: which requires switching the input source.
1.16 ! jmc 1788: .Pp
1.1 deraadt 1789: To negotiate these sorts of problems,
1.16 ! jmc 1790: .Nm
1.1 deraadt 1791: provides a mechanism for creating and switching between multiple
1.16 ! jmc 1792: input buffers.
! 1793: An input buffer is created by using:
! 1794: .Pp
! 1795: .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
! 1796: .Pp
1.1 deraadt 1797: which takes a
1.16 ! jmc 1798: .Fa FILE
! 1799: pointer and a
! 1800: .Fa size
! 1801: and creates a buffer associated with the given file and large enough to hold
! 1802: .Fa size
1.1 deraadt 1803: characters (when in doubt, use
1.16 ! jmc 1804: .Dv YY_BUF_SIZE
! 1805: for the size).
! 1806: It returns a
! 1807: .Dv YY_BUFFER_STATE
! 1808: handle, which may then be passed to other routines
! 1809: .Pq see below .
! 1810: The
! 1811: .Dv YY_BUFFER_STATE
1.1 deraadt 1812: type is a pointer to an opaque
1.16 ! jmc 1813: .Dq struct yy_buffer_state
! 1814: structure, so
! 1815: .Dv YY_BUFFER_STATE
! 1816: variables may be safely initialized to
! 1817: .Dq ((YY_BUFFER_STATE) 0)
! 1818: if desired, and the opaque structure can also be referred to in order to
! 1819: correctly declare input buffers in source files other than that of scanners.
! 1820: Note that the
! 1821: .Fa FILE
1.1 deraadt 1822: pointer in the call to
1.16 ! jmc 1823: .Fn yy_create_buffer
1.1 deraadt 1824: is only used as the value of
1.16 ! jmc 1825: .Fa yyin
1.1 deraadt 1826: seen by
1.16 ! jmc 1827: .Dv YY_INPUT ;
! 1828: if
! 1829: .Dv YY_INPUT
! 1830: is redefined so that it no longer uses
! 1831: .Fa yyin ,
! 1832: then a nil
! 1833: .Fa FILE
! 1834: pointer can safely be passed to
! 1835: .Fn yy_create_buffer .
! 1836: To select a particular buffer to scan:
! 1837: .Pp
! 1838: .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
! 1839: .Pp
! 1840: It switches the scanner's input buffer so subsequent tokens will
1.1 deraadt 1841: come from
1.16 ! jmc 1842: .Fa new_buffer .
1.1 deraadt 1843: Note that
1.16 ! jmc 1844: .Fn yy_switch_to_buffer
! 1845: may be used by
! 1846: .Fn yywrap
! 1847: to set things up for continued scanning,
! 1848: instead of opening a new file and pointing
! 1849: .Fa yyin
! 1850: at it.
! 1851: Note also that switching input sources via either
! 1852: .Fn yy_switch_to_buffer
! 1853: or
! 1854: .Fn yywrap
! 1855: does not change the start condition.
! 1856: .Pp
! 1857: .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
! 1858: .Pp
! 1859: is used to reclaim the storage associated with a buffer.
! 1860: .Pf ( Fa buffer
1.1 deraadt 1861: can be nil, in which case the routine does nothing.)
1.16 ! jmc 1862: To clear the current contents of a buffer:
! 1863: .Pp
! 1864: .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
! 1865: .Pp
1.1 deraadt 1866: This function discards the buffer's contents,
1.16 ! jmc 1867: so the next time the scanner attempts to match a token from the buffer,
! 1868: it will first fill the buffer anew using
! 1869: .Dv YY_INPUT .
! 1870: .Pp
! 1871: .Fn yy_new_buffer
1.1 deraadt 1872: is an alias for
1.16 ! jmc 1873: .Fn yy_create_buffer ,
1.1 deraadt 1874: provided for compatibility with the C++ use of
1.16 ! jmc 1875: .Em new
1.1 deraadt 1876: and
1.16 ! jmc 1877: .Em delete
1.1 deraadt 1878: for creating and destroying dynamic objects.
1.16 ! jmc 1879: .Pp
1.1 deraadt 1880: Finally, the
1.16 ! jmc 1881: .Dv YY_CURRENT_BUFFER
1.1 deraadt 1882: macro returns a
1.16 ! jmc 1883: .Dv YY_BUFFER_STATE
1.1 deraadt 1884: handle to the current buffer.
1.16 ! jmc 1885: .Pp
1.1 deraadt 1886: Here is an example of using these features for writing a scanner
1887: which expands include files (the
1.16 ! jmc 1888: .Aq Aq EOF
1.1 deraadt 1889: feature is discussed below):
1.16 ! jmc 1890: .Bd -literal -offset indent
! 1891: /*
! 1892: * the "incl" state is used for picking up the name
! 1893: * of an include file
! 1894: */
! 1895: %x incl
! 1896:
! 1897: %{
! 1898: #define MAX_INCLUDE_DEPTH 10
! 1899: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
! 1900: int include_stack_ptr = 0;
! 1901: %}
! 1902:
! 1903: %%
! 1904: include BEGIN(incl);
! 1905:
! 1906: [a-z]+ ECHO;
! 1907: [^a-z\en]*\en? ECHO;
! 1908:
! 1909: <incl>[ \et]* /* eat the whitespace */
! 1910: <incl>[^ \et\en]+ { /* got the include file name */
! 1911: if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
! 1912: errx(1, "Includes nested too deeply");
! 1913:
! 1914: include_stack[include_stack_ptr++] =
! 1915: YY_CURRENT_BUFFER;
! 1916:
! 1917: yyin = fopen(yytext, "r");
! 1918:
! 1919: if (yyin == NULL)
! 1920: err(1, NULL);
1.1 deraadt 1921:
1.16 ! jmc 1922: yy_switch_to_buffer(
! 1923: yy_create_buffer(yyin, YY_BUF_SIZE));
1.1 deraadt 1924:
1.16 ! jmc 1925: BEGIN(INITIAL);
! 1926: }
1.1 deraadt 1927:
1.16 ! jmc 1928: <<EOF>> {
! 1929: if (--include_stack_ptr < 0)
1.1 deraadt 1930: yyterminate();
1.16 ! jmc 1931: else {
! 1932: yy_delete_buffer(YY_CURRENT_BUFFER);
1.1 deraadt 1933: yy_switch_to_buffer(
1.16 ! jmc 1934: include_stack[include_stack_ptr]);
! 1935: }
! 1936: }
! 1937: .Ed
! 1938: .Pp
1.1 deraadt 1939: Three routines are available for setting up input buffers for
1.16 ! jmc 1940: scanning in-memory strings instead of files.
! 1941: All of them create a new input buffer for scanning the string,
! 1942: and return a corresponding
! 1943: .Dv YY_BUFFER_STATE
! 1944: handle (which should be deleted afterwards using
! 1945: .Fn yy_delete_buffer ) .
! 1946: They also switch to the new buffer using
! 1947: .Fn yy_switch_to_buffer ,
1.1 deraadt 1948: so the next call to
1.16 ! jmc 1949: .Fn yylex
1.1 deraadt 1950: will start scanning the string.
1.16 ! jmc 1951: .Bl -tag -width Ds
! 1952: .It yy_scan_string(const char *str)
! 1953: Scans a NUL-terminated string.
! 1954: .It yy_scan_bytes(const char *bytes, int len)
! 1955: Scans
! 1956: .Fa len
! 1957: bytes
! 1958: .Pq including possibly NUL's
1.1 deraadt 1959: starting at location
1.16 ! jmc 1960: .Fa bytes .
! 1961: .El
! 1962: .Pp
! 1963: Note that both of these functions create and scan a copy
! 1964: of the string or bytes.
! 1965: (This may be desirable, since
! 1966: .Fn yylex
! 1967: modifies the contents of the buffer it is scanning.)
! 1968: The copy can be avoided by using:
! 1969: .Bl -tag -width Ds
! 1970: .It yy_scan_buffer(char *base, yy_size_t size)
! 1971: Which scans the buffer starting at
! 1972: .Fa base ,
1.1 deraadt 1973: consisting of
1.16 ! jmc 1974: .Fa size
! 1975: bytes, the last two bytes of which must be
! 1976: .Dv YY_END_OF_BUFFER_CHAR
! 1977: .Pq ASCII NUL .
! 1978: These last two bytes are not scanned; thus, scanning consists of
! 1979: base[0] through base[size-2], inclusive.
! 1980: .Pp
! 1981: If
! 1982: .Fa base
! 1983: is not set up in this manner
! 1984: (i.e., forget the final two
! 1985: .Dv YY_END_OF_BUFFER_CHAR
1.1 deraadt 1986: bytes), then
1.16 ! jmc 1987: .Fn yy_scan_buffer
1.1 deraadt 1988: returns a nil pointer instead of creating a new input buffer.
1.16 ! jmc 1989: .Pp
1.1 deraadt 1990: The type
1.16 ! jmc 1991: .Fa yy_size_t
! 1992: is an integral type which can be cast to an integer expression
1.1 deraadt 1993: reflecting the size of the buffer.
1.16 ! jmc 1994: .El
! 1995: .Sh END-OF-FILE RULES
! 1996: The special rule
! 1997: .Qq Aq Aq EOF
! 1998: indicates actions which are to be taken when an end-of-file is encountered and
! 1999: .Fn yywrap
! 2000: returns non-zero
! 2001: .Pq i.e., indicates no further files to process .
! 2002: The action must finish by doing one of four things:
! 2003: .Bl -dash
! 2004: .It
! 2005: Assigning
! 2006: .Em yyin
! 2007: to a new input file
! 2008: (in previous versions of
! 2009: .Nm ,
! 2010: after doing the assignment, it was necessary to call the special action
! 2011: .Dv YY_NEW_FILE ;
! 2012: this is no longer necessary).
! 2013: .It
! 2014: Executing a
! 2015: .Em return
! 2016: statement.
! 2017: .It
! 2018: Executing the special
! 2019: .Fn yyterminate
! 2020: action.
! 2021: .It
! 2022: Switching to a new buffer using
! 2023: .Fn yy_switch_to_buffer
1.1 deraadt 2024: as shown in the example above.
1.16 ! jmc 2025: .El
! 2026: .Pp
! 2027: .Aq Aq EOF
! 2028: rules may not be used with other patterns;
! 2029: they may only be qualified with a list of start conditions.
! 2030: If an unqualified
! 2031: .Aq Aq EOF
! 2032: rule is given, it applies to all start conditions which do not already have
! 2033: .Aq Aq EOF
! 2034: actions.
! 2035: To specify an
! 2036: .Aq Aq EOF
! 2037: rule for only the initial start condition, use
! 2038: .Pp
! 2039: .Dl <INITIAL><<EOF>>
! 2040: .Pp
1.1 deraadt 2041: These rules are useful for catching things like unclosed comments.
2042: An example:
1.16 ! jmc 2043: .Bd -literal -offset indent
! 2044: %x quote
! 2045: %%
! 2046:
! 2047: \&...other rules for dealing with quotes...
! 2048:
! 2049: <quote><<EOF>> {
! 2050: error("unterminated quote");
! 2051: yyterminate();
! 2052: }
! 2053: <<EOF>> {
! 2054: if (*++filelist)
! 2055: yyin = fopen(*filelist, "r");
! 2056: else
! 2057: yyterminate();
! 2058: }
! 2059: .Ed
! 2060: .Sh MISCELLANEOUS MACROS
1.1 deraadt 2061: The macro
1.16 ! jmc 2062: .Dv YY_USER_ACTION
1.1 deraadt 2063: can be defined to provide an action
1.16 ! jmc 2064: which is always executed prior to the matched rule's action.
! 2065: For example,
1.1 deraadt 2066: it could be #define'd to call a routine to convert yytext to lower-case.
2067: When
1.16 ! jmc 2068: .Dv YY_USER_ACTION
1.1 deraadt 2069: is invoked, the variable
1.16 ! jmc 2070: .Fa yy_act
! 2071: gives the number of the matched rule
! 2072: .Pq rules are numbered starting with 1 .
! 2073: For example, to profile how often each rule is matched,
! 2074: the following would do the trick:
! 2075: .Pp
! 2076: .Dl #define YY_USER_ACTION ++ctr[yy_act]
! 2077: .Pp
1.1 deraadt 2078: where
1.16 ! jmc 2079: .Fa ctr
! 2080: is an array to hold the counts for the different rules.
! 2081: Note that the macro
! 2082: .Dv YY_NUM_RULES
! 2083: gives the total number of rules
! 2084: (including the default rule, even if
! 2085: .Fl s
! 2086: is used),
1.1 deraadt 2087: so a correct declaration for
1.16 ! jmc 2088: .Fa ctr
1.1 deraadt 2089: is:
1.16 ! jmc 2090: .Pp
! 2091: .Dl int ctr[YY_NUM_RULES];
! 2092: .Pp
1.1 deraadt 2093: The macro
1.16 ! jmc 2094: .Dv YY_USER_INIT
1.1 deraadt 2095: may be defined to provide an action which is always executed before
1.16 ! jmc 2096: the first scan
! 2097: .Pq and before the scanner's internal initializations are done .
1.1 deraadt 2098: For example, it could be used to call a routine to read
2099: in a data table or open a logging file.
1.16 ! jmc 2100: .Pp
1.1 deraadt 2101: The macro
1.16 ! jmc 2102: .Dv yy_set_interactive(is_interactive)
1.1 deraadt 2103: can be used to control whether the current buffer is considered
1.16 ! jmc 2104: .Em interactive .
1.1 deraadt 2105: An interactive buffer is processed more slowly,
2106: but must be used when the scanner's input source is indeed
2107: interactive to avoid problems due to waiting to fill buffers
2108: (see the discussion of the
1.16 ! jmc 2109: .Fl I
! 2110: flag below).
! 2111: A non-zero value in the macro invocation marks the buffer as interactive,
! 2112: a zero value as non-interactive.
! 2113: Note that use of this macro overrides
! 2114: .Dq %option always-interactive
! 2115: or
! 2116: .Dq %option never-interactive
! 2117: (see
! 2118: .Sx OPTIONS
! 2119: below).
! 2120: .Fn yy_set_interactive
1.1 deraadt 2121: must be invoked prior to beginning to scan the buffer that is
1.16 ! jmc 2122: .Pq or is not
! 2123: to be considered interactive.
! 2124: .Pp
1.1 deraadt 2125: The macro
1.16 ! jmc 2126: .Dv yy_set_bol(at_bol)
1.1 deraadt 2127: can be used to control whether the current buffer's scanning
2128: context for the next token match is done as though at the
1.16 ! jmc 2129: beginning of a line.
! 2130: A non-zero macro argument makes rules anchored with
! 2131: .Sq ^
! 2132: active, while a zero argument makes
! 2133: .Sq ^
! 2134: rules inactive.
! 2135: .Pp
1.1 deraadt 2136: The macro
1.16 ! jmc 2137: .Dv YY_AT_BOL
! 2138: returns true if the next token scanned from the current buffer will have
! 2139: .Sq ^
! 2140: rules active, false otherwise.
! 2141: .Pp
1.1 deraadt 2142: In the generated scanner, the actions are all gathered in one large
2143: switch statement and separated using
1.16 ! jmc 2144: .Dv YY_BREAK ,
! 2145: which may be redefined.
! 2146: By default, it is simply a
! 2147: .Qq break ,
! 2148: to separate each rule's action from the following rules.
1.1 deraadt 2149: Redefining
1.16 ! jmc 2150: .Dv YY_BREAK
1.1 deraadt 2151: allows, for example, C++ users to
1.16 ! jmc 2152: .Dq #define YY_BREAK
! 2153: to do nothing
! 2154: (while being very careful that every rule ends with a
! 2155: .Qq break
! 2156: or a
! 2157: .Qq return ! )
! 2158: to avoid suffering from unreachable statement warnings where because a rule's
! 2159: action ends with
! 2160: .Dq return ,
! 2161: the
! 2162: .Dv YY_BREAK
1.1 deraadt 2163: is inaccessible.
1.16 ! jmc 2164: .Sh VALUES AVAILABLE TO THE USER
1.1 deraadt 2165: This section summarizes the various values available to the user
2166: in the rule actions.
1.16 ! jmc 2167: .Bl -tag -width Ds
! 2168: .It char *yytext
! 2169: Holds the text of the current token.
! 2170: It may be modified but not lengthened
! 2171: .Pq characters cannot be appended to the end .
! 2172: .Pp
1.1 deraadt 2173: If the special directive
1.16 ! jmc 2174: .Dq %array
1.1 deraadt 2175: appears in the first section of the scanner description, then
1.16 ! jmc 2176: .Fa yytext
1.1 deraadt 2177: is instead declared
1.16 ! jmc 2178: .Dq char yytext[YYLMAX] ,
1.1 deraadt 2179: where
1.16 ! jmc 2180: .Dv YYLMAX
! 2181: is a macro definition that can be redefined in the first section
! 2182: to change the default value
! 2183: .Pq generally 8KB .
! 2184: Using
! 2185: .Dq %array
1.1 deraadt 2186: results in somewhat slower scanners, but the value of
1.16 ! jmc 2187: .Fa yytext
1.1 deraadt 2188: becomes immune to calls to
1.16 ! jmc 2189: .Fn input
1.1 deraadt 2190: and
1.16 ! jmc 2191: .Fn unput ,
1.1 deraadt 2192: which potentially destroy its value when
1.16 ! jmc 2193: .Fa yytext
! 2194: is a character pointer.
! 2195: The opposite of
! 2196: .Dq %array
1.1 deraadt 2197: is
1.16 ! jmc 2198: .Dq %pointer ,
1.1 deraadt 2199: which is the default.
1.16 ! jmc 2200: .Pp
! 2201: .Dq %array
! 2202: cannot be used when generating C++ scanner classes
1.1 deraadt 2203: (the
1.16 ! jmc 2204: .Fl +
1.1 deraadt 2205: flag).
1.16 ! jmc 2206: .It int yyleng
! 2207: Holds the length of the current token.
! 2208: .It FILE *yyin
! 2209: Is the file which by default
! 2210: .Nm
! 2211: reads from.
! 2212: It may be redefined, but doing so only makes sense before
! 2213: scanning begins or after an
! 2214: .Dv EOF
! 2215: has been encountered.
! 2216: Changing it in the midst of scanning will have unexpected results since
! 2217: .Nm
1.1 deraadt 2218: buffers its input; use
1.16 ! jmc 2219: .Fn yyrestart
1.1 deraadt 2220: instead.
2221: Once scanning terminates because an end-of-file
1.16 ! jmc 2222: has been seen,
! 2223: .Fa yyin
! 2224: can be assigned as the new input file
! 2225: and the scanner can be called again to continue scanning.
! 2226: .It void yyrestart(FILE *new_file)
! 2227: May be called to point
! 2228: .Fa yyin
! 2229: at the new input file.
! 2230: The switch-over to the new file is immediate
! 2231: .Pq any previously buffered-up input is lost .
! 2232: Note that calling
! 2233: .Fn yyrestart
1.1 deraadt 2234: with
1.16 ! jmc 2235: .Fa yyin
1.1 deraadt 2236: as an argument thus throws away the current input buffer and continues
2237: scanning the same input file.
1.16 ! jmc 2238: .It FILE *yyout
! 2239: Is the file to which
! 2240: .Em ECHO
! 2241: actions are done.
! 2242: It can be reassigned by the user.
! 2243: .It YY_CURRENT_BUFFER
! 2244: Returns a
! 2245: .Dv YY_BUFFER_STATE
1.1 deraadt 2246: handle to the current buffer.
1.16 ! jmc 2247: .It YY_START
! 2248: Returns an integer value corresponding to the current start condition.
! 2249: This value can subsequently be used with
! 2250: .Em BEGIN
1.1 deraadt 2251: to return to that start condition.
1.16 ! jmc 2252: .El
! 2253: .Sh INTERFACING WITH YACC
1.1 deraadt 2254: One of the main uses of
1.16 ! jmc 2255: .Nm
1.1 deraadt 2256: is as a companion to the
1.16 ! jmc 2257: .Xr yacc 1
1.1 deraadt 2258: parser-generator.
1.16 ! jmc 2259: yacc parsers expect to call a routine named
! 2260: .Fn yylex
! 2261: to find the next input token.
! 2262: The routine is supposed to return the type of the next token
! 2263: as well as putting any associated value in the global
! 2264: .Fa yylval .
1.1 deraadt 2265: To use
1.16 ! jmc 2266: .Nm
! 2267: with yacc, one specifies the
! 2268: .Fl d
! 2269: option to yacc to instruct it to generate the file
! 2270: .Pa y.tab.h
1.1 deraadt 2271: containing definitions of all the
1.16 ! jmc 2272: .Dq %tokens
! 2273: appearing in the yacc input.
! 2274: This file is then included in the
! 2275: .Nm
! 2276: scanner.
! 2277: For example, if one of the tokens is
! 2278: .Qq TOK_NUMBER ,
1.1 deraadt 2279: part of the scanner might look like:
1.16 ! jmc 2280: .Bd -literal -offset indent
! 2281: %{
! 2282: #include "y.tab.h"
! 2283: %}
! 2284:
! 2285: %%
! 2286:
! 2287: [0-9]+ yylval = atoi(yytext); return TOK_NUMBER;
! 2288: .Ed
! 2289: .Sh OPTIONS
! 2290: .Nm
1.1 deraadt 2291: has the following options:
1.16 ! jmc 2292: .Bl -tag -width Ds
! 2293: .It Fl 7
! 2294: Instructs
! 2295: .Nm
! 2296: to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
! 2297: characters in its input.
! 2298: The advantage of using
! 2299: .Fl 7
1.1 deraadt 2300: is that the scanner's tables can be up to half the size of those generated
2301: using the
1.16 ! jmc 2302: .Fl 8
! 2303: option
! 2304: .Pq see below .
! 2305: The disadvantage is that such scanners often hang
1.1 deraadt 2306: or crash if their input contains an 8-bit character.
1.16 ! jmc 2307: .Pp
! 2308: Note, however, that unless generating a scanner using the
! 2309: .Fl Cf
1.1 deraadt 2310: or
1.16 ! jmc 2311: .Fl CF
1.1 deraadt 2312: table compression options, use of
1.16 ! jmc 2313: .Fl 7
! 2314: will save only a small amount of table space,
! 2315: and make the scanner considerably less portable.
! 2316: .Nm flex Ns 's
! 2317: default behavior is to generate an 8-bit scanner unless
! 2318: .Fl Cf
! 2319: or
! 2320: .Fl CF
! 2321: is specified, in which case
! 2322: .Nm
! 2323: defaults to generating 7-bit scanners unless it was
! 2324: configured to generate 8-bit scanners
! 2325: (as will often be the case with non-USA sites).
! 2326: It is possible tell whether
! 2327: .Nm
! 2328: generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
! 2329: .Fl v
! 2330: output as described below.
! 2331: .Pp
! 2332: Note that if
! 2333: .Fl Cfe
! 2334: or
! 2335: .Fl CFe
! 2336: are used
! 2337: (the table compression options, but also using equivalence classes as
! 2338: discussed below),
! 2339: .Nm
! 2340: still defaults to generating an 8-bit scanner,
! 2341: since usually with these compression options full 8-bit tables
1.1 deraadt 2342: are not much more expensive than 7-bit tables.
1.16 ! jmc 2343: .It Fl 8
! 2344: Instructs
! 2345: .Nm
1.1 deraadt 2346: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1.16 ! jmc 2347: characters.
! 2348: This flag is only needed for scanners generated using
! 2349: .Fl Cf
1.1 deraadt 2350: or
1.16 ! jmc 2351: .Fl CF ,
! 2352: as otherwise
! 2353: .Nm
! 2354: defaults to generating an 8-bit scanner anyway.
! 2355: .Pp
1.1 deraadt 2356: See the discussion of
1.16 ! jmc 2357: .Fl 7
! 2358: above for
! 2359: .Nm flex Ns 's
! 2360: default behavior and the tradeoffs between 7-bit and 8-bit scanners.
! 2361: .It Fl B
! 2362: Instructs
! 2363: .Nm
! 2364: to generate a
! 2365: .Em batch
! 2366: scanner, the opposite of
! 2367: .Em interactive
! 2368: scanners generated by
! 2369: .Fl I
! 2370: .Pq see below .
! 2371: In general,
! 2372: .Fl B
! 2373: is used when the scanner will never be used interactively,
! 2374: and you want to squeeze a little more performance out of it.
! 2375: If the aim is instead to squeeze out a lot more performance,
! 2376: use the
! 2377: .Fl Cf
! 2378: or
! 2379: .Fl CF
! 2380: options
! 2381: .Pq discussed below ,
! 2382: which turn on
! 2383: .Fl B
! 2384: automatically anyway.
! 2385: .It Fl b
! 2386: Generate backing-up information to
! 2387: .Pa lex.backup .
! 2388: This is a list of scanner states which require backing up
! 2389: and the input characters on which they do so.
! 2390: By adding rules one can remove backing-up states.
! 2391: If all backing-up states are eliminated and
! 2392: .Fl Cf
! 2393: or
! 2394: .Fl CF
! 2395: is used, the generated scanner will run faster (see the
! 2396: .Fl p
! 2397: flag).
! 2398: Only users who wish to squeeze every last cycle out of their
! 2399: scanners need worry about this option.
! 2400: (See the section on
! 2401: .Sx PERFORMANCE CONSIDERATIONS
! 2402: below.)
! 2403: .It Fl C Ns Op Cm aeFfmr
! 2404: Controls the degree of table compression and, more generally, trade-offs
1.1 deraadt 2405: between small scanners and fast scanners.
1.16 ! jmc 2406: .Bl -tag -width Ds
! 2407: .It Fl Ca
! 2408: Instructs
! 2409: .Nm
! 2410: to trade off larger tables in the generated scanner for faster performance
! 2411: because the elements of the tables are better aligned for memory access
! 2412: and computation.
! 2413: On some
! 2414: .Tn RISC
! 2415: architectures, fetching and manipulating longwords is more efficient
! 2416: than with smaller-sized units such as shortwords.
! 2417: This option can double the size of the tables used by the scanner.
! 2418: .It Fl Ce
! 2419: Directs
! 2420: .Nm
1.1 deraadt 2421: to construct
1.16 ! jmc 2422: .Em equivalence classes ,
! 2423: i.e., sets of characters which have identical lexical properties
! 2424: (for example, if the only appearance of digits in the
! 2425: .Nm
1.1 deraadt 2426: input is in the character class
1.16 ! jmc 2427: .Qq [0-9]
! 2428: then the digits
! 2429: .Sq 0 ,
! 2430: .Sq 1 ,
! 2431: .Sq ... ,
! 2432: .Sq 9
! 2433: will all be put in the same equivalence class).
! 2434: Equivalence classes usually give dramatic reductions in the final
! 2435: table/object file sizes
! 2436: .Pq typically a factor of 2\-5
! 2437: and are pretty cheap performance-wise
! 2438: .Pq one array look-up per character scanned .
! 2439: .It Fl CF
! 2440: Specifies that the alternate fast scanner representation
! 2441: (described below under the
! 2442: .Fl F
! 2443: option)
! 2444: should be used.
! 2445: This option cannot be used with
! 2446: .Fl + .
! 2447: .It Fl Cf
! 2448: Specifies that the
! 2449: .Em full
! 2450: scanner tables should be generated \-
! 2451: .Nm
! 2452: should not compress the tables by taking advantage of
! 2453: similar transition functions for different states.
! 2454: .It Fl \&Cm
! 2455: Directs
! 2456: .Nm
1.1 deraadt 2457: to construct
1.16 ! jmc 2458: .Em meta-equivalence classes ,
! 2459: which are sets of equivalence classes
! 2460: (or characters, if equivalence classes are not being used)
! 2461: that are commonly used together.
! 2462: Meta-equivalence classes are often a big win when using compressed tables,
! 2463: but they have a moderate performance impact
! 2464: (one or two
! 2465: .Qq if
! 2466: tests and one array look-up per character scanned).
! 2467: .It Fl Cr
! 2468: Causes the generated scanner to
! 2469: .Em bypass
! 2470: use of the standard I/O library
! 2471: .Pq stdio
! 2472: for input.
! 2473: Instead of calling
! 2474: .Xr fread 3
1.1 deraadt 2475: or
1.16 ! jmc 2476: .Xr getc 3 ,
1.1 deraadt 2477: the scanner will use the
1.16 ! jmc 2478: .Xr read 2
! 2479: system call,
! 2480: resulting in a performance gain which varies from system to system,
! 2481: but in general is probably negligible unless
! 2482: .Fl Cf
1.1 deraadt 2483: or
1.16 ! jmc 2484: .Fl CF
! 2485: are being used.
1.1 deraadt 2486: Using
1.16 ! jmc 2487: .Fl Cr
! 2488: can cause strange behavior if, for example, reading from
! 2489: .Fa yyin
! 2490: using stdio prior to calling the scanner
! 2491: (because the scanner will miss whatever text previous reads left
! 2492: in the stdio input buffer).
! 2493: .Pp
! 2494: .Fl Cr
! 2495: has no effect if
! 2496: .Dv YY_INPUT
! 2497: is defined
! 2498: (see
! 2499: .Sx THE GENERATED SCANNER
! 2500: above).
! 2501: .El
! 2502: .Pp
1.1 deraadt 2503: A lone
1.16 ! jmc 2504: .Fl C
1.1 deraadt 2505: specifies that the scanner tables should be compressed but neither
2506: equivalence classes nor meta-equivalence classes should be used.
1.16 ! jmc 2507: .Pp
1.1 deraadt 2508: The options
1.16 ! jmc 2509: .Fl Cf
1.1 deraadt 2510: or
1.16 ! jmc 2511: .Fl CF
1.1 deraadt 2512: and
1.16 ! jmc 2513: .Fl \&Cm
! 2514: do not make sense together \- there is no opportunity for meta-equivalence
! 2515: classes if the table is not being compressed.
! 2516: Otherwise the options may be freely mixed, and are cumulative.
! 2517: .Pp
1.1 deraadt 2518: The default setting is
1.16 ! jmc 2519: .Fl Cem
1.1 deraadt 2520: which specifies that
1.16 ! jmc 2521: .Nm
! 2522: should generate equivalence classes and meta-equivalence classes.
! 2523: This setting provides the highest degree of table compression.
! 2524: It is possible to trade off faster-executing scanners at the cost of
! 2525: larger tables with the following generally being true:
! 2526: .Bd -unfilled -offset indent
! 2527: slowest & smallest
! 2528: -Cem
! 2529: -Cm
! 2530: -Ce
! 2531: -C
! 2532: -C{f,F}e
! 2533: -C{f,F}
! 2534: -C{f,F}a
! 2535: fastest & largest
! 2536: .Ed
! 2537: .Pp
1.1 deraadt 2538: Note that scanners with the smallest tables are usually generated and
1.16 ! jmc 2539: compiled the quickest,
! 2540: so during development the default is usually best,
! 2541: maximal compression.
! 2542: .Pp
! 2543: .Fl Cfe
! 2544: is often a good compromise between speed and size for production scanners.
! 2545: .It Fl c
! 2546: A do-nothing, deprecated option included for
! 2547: .Tn POSIX
! 2548: compliance.
! 2549: .It Fl d
! 2550: Makes the generated scanner run in debug mode.
! 2551: Whenever a pattern is recognized and the global
! 2552: .Fa yy_flex_debug
! 2553: is non-zero
! 2554: .Pq which is the default ,
! 2555: the scanner will write to stderr a line of the form:
! 2556: .Pp
! 2557: .D1 --accepting rule at line 53 ("the matched text")
! 2558: .Pp
! 2559: The line number refers to the location of the rule in the file
! 2560: defining the scanner
! 2561: (i.e., the file that was fed to
! 2562: .Nm ) .
! 2563: Messages are also generated when the scanner backs up,
! 2564: accepts the default rule,
! 2565: reaches the end of its input buffer
! 2566: (or encounters a NUL;
! 2567: at this point, the two look the same as far as the scanner's concerned),
! 2568: or reaches an end-of-file.
! 2569: .It Fl F
! 2570: Specifies that the fast scanner table representation should be used
! 2571: .Pq and stdio bypassed .
! 2572: This representation is about as fast as the full table representation
! 2573: .Pq Fl f ,
! 2574: and for some sets of patterns will be considerably smaller
! 2575: .Pq and for others, larger .
! 2576: In general, if the pattern set contains both
! 2577: .Qq keywords
! 2578: and a catch-all,
! 2579: .Qq identifier
! 2580: rule, such as in the set:
! 2581: .Bd -unfilled -offset indent
! 2582: "case" return TOK_CASE;
! 2583: "switch" return TOK_SWITCH;
! 2584: \&...
! 2585: "default" return TOK_DEFAULT;
! 2586: [a-z]+ return TOK_ID;
! 2587: .Ed
! 2588: .Pp
! 2589: then it's better to use the full table representation.
! 2590: If only the
! 2591: .Qq identifier
! 2592: rule is present and a hash table or some such is used to detect the keywords,
! 2593: it's better to use
! 2594: .Fl F .
! 2595: .Pp
! 2596: This option is equivalent to
! 2597: .Fl CFr
! 2598: .Pq see above .
! 2599: It cannot be used with
! 2600: .Fl + .
! 2601: .It Fl f
! 2602: Specifies
! 2603: .Em fast scanner .
! 2604: No table compression is done and stdio is bypassed.
! 2605: The result is large but fast.
! 2606: This option is equivalent to
! 2607: .Fl Cfr
! 2608: .Pq see above .
! 2609: .It Fl h
! 2610: Generates a help summary of
! 2611: .Nm flex Ns 's
! 2612: options to stdout and then exits.
! 2613: .Fl ?\&
! 2614: and
! 2615: .Fl Fl help
! 2616: are synonyms for
! 2617: .Fl h .
! 2618: .It Fl I
! 2619: Instructs
! 2620: .Nm
! 2621: to generate an
! 2622: .Em interactive
! 2623: scanner.
! 2624: An interactive scanner is one that only looks ahead to decide
! 2625: what token has been matched if it absolutely must.
! 2626: It turns out that always looking one extra character ahead,
! 2627: even if the scanner has already seen enough text
! 2628: to disambiguate the current token, is a bit faster than
! 2629: only looking ahead when necessary.
! 2630: But scanners that always look ahead give dreadful interactive performance;
! 2631: for example, when a user types a newline,
! 2632: it is not recognized as a newline token until they enter
! 2633: .Em another
! 2634: token, which often means typing in another whole line.
! 2635: .Pp
! 2636: .Nm
! 2637: scanners default to
! 2638: .Em interactive
! 2639: unless
! 2640: .Fl Cf
! 2641: or
! 2642: .Fl CF
! 2643: table-compression options are specified
! 2644: .Pq see above .
! 2645: That's because if high-performance is most important,
! 2646: one of these options should be used,
! 2647: so if they weren't,
! 2648: .Nm
! 2649: assumes it is preferrable to trade off a bit of run-time performance for
! 2650: intuitive interactive behavior.
! 2651: Note also that
! 2652: .Fl I
! 2653: cannot be used in conjunction with
! 2654: .Fl Cf
! 2655: or
! 2656: .Fl CF .
! 2657: Thus, this option is not really needed; it is on by default for all those
! 2658: cases in which it is allowed.
! 2659: .Pp
! 2660: A scanner can be forced to not be interactive by using
! 2661: .Fl B
! 2662: .Pq see above .
! 2663: .It Fl i
! 2664: Instructs
! 2665: .Nm
! 2666: to generate a case-insensitive scanner.
! 2667: The case of letters given in the
! 2668: .Nm
! 2669: input patterns will be ignored,
! 2670: and tokens in the input will be matched regardless of case.
! 2671: The matched text given in
! 2672: .Fa yytext
! 2673: will have the preserved case
! 2674: .Pq i.e., it will not be folded .
! 2675: .It Fl L
! 2676: Instructs
! 2677: .Nm
! 2678: not to generate
! 2679: .Dq #line
! 2680: directives.
! 2681: Without this option,
! 2682: .Nm
! 2683: peppers the generated scanner with #line directives so error messages
! 2684: in the actions will be correctly located with respect to either the original
! 2685: .Nm
! 2686: input file
! 2687: (if the errors are due to code in the input file),
! 2688: or
! 2689: .Pa lex.yy.c
! 2690: (if the errors are
! 2691: .Nm flex Ns 's
! 2692: fault \- these sorts of errors should be reported to the email address
! 2693: given below).
! 2694: .It Fl l
! 2695: Turns on maximum compatibility with the original AT&T
! 2696: .Nm lex
! 2697: implementation.
! 2698: Note that this does not mean full compatibility.
! 2699: Use of this option costs a considerable amount of performance,
! 2700: and it cannot be used with the
! 2701: .Fl + , f , F , Cf ,
! 2702: or
! 2703: .Fl CF
! 2704: options.
! 2705: For details on the compatibilities it provides, see the section
! 2706: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
! 2707: below.
! 2708: This option also results in the name
! 2709: .Dv YY_FLEX_LEX_COMPAT
! 2710: being #define'd in the generated scanner.
! 2711: .It Fl n
! 2712: Another do-nothing, deprecated option included only for
! 2713: .Tn POSIX
! 2714: compliance.
! 2715: .It Fl o Ns Ar output
! 2716: Directs
! 2717: .Nm
! 2718: to write the scanner to the file
! 2719: .Ar output
1.1 deraadt 2720: instead of
1.16 ! jmc 2721: .Pa lex.yy.c .
! 2722: If
! 2723: .Fl o
! 2724: is combined with the
! 2725: .Fl t
! 2726: option, then the scanner is written to stdout but its
! 2727: .Dq #line
! 2728: directives
! 2729: (see the
! 2730: .Fl L
! 2731: option above)
! 2732: refer to the file
! 2733: .Ar output .
! 2734: .It Fl P Ns Ar prefix
! 2735: Changes the default
! 2736: .Qq yy
1.1 deraadt 2737: prefix used by
1.16 ! jmc 2738: .Nm
1.6 aaron 2739: for all globally visible variable and function names to instead be
1.16 ! jmc 2740: .Ar prefix .
1.1 deraadt 2741: For example,
1.16 ! jmc 2742: .Fl P Ns Ar foo
1.1 deraadt 2743: changes the name of
1.16 ! jmc 2744: .Fa yytext
1.1 deraadt 2745: to
1.16 ! jmc 2746: .Fa footext .
1.1 deraadt 2747: It also changes the name of the default output file from
1.16 ! jmc 2748: .Pa lex.yy.c
1.1 deraadt 2749: to
1.16 ! jmc 2750: .Pa lex.foo.c .
1.1 deraadt 2751: Here are all of the names affected:
1.16 ! jmc 2752: .Bd -unfilled -offset indent
! 2753: yy_create_buffer
! 2754: yy_delete_buffer
! 2755: yy_flex_debug
! 2756: yy_init_buffer
! 2757: yy_flush_buffer
! 2758: yy_load_buffer_state
! 2759: yy_switch_to_buffer
! 2760: yyin
! 2761: yyleng
! 2762: yylex
! 2763: yylineno
! 2764: yyout
! 2765: yyrestart
! 2766: yytext
! 2767: yywrap
! 2768: .Ed
! 2769: .Pp
! 2770: (If using a C++ scanner, then only
! 2771: .Fa yywrap
1.1 deraadt 2772: and
1.16 ! jmc 2773: .Fa yyFlexLexer
1.1 deraadt 2774: are affected.)
1.16 ! jmc 2775: Within the scanner itself, it is still possible to refer to the global variables
1.1 deraadt 2776: and functions using either version of their name; but externally, they
2777: have the modified name.
1.16 ! jmc 2778: .Pp
! 2779: This option allows multiple
! 2780: .Nm
! 2781: programs to be easily linked together into the same executable.
! 2782: Note, though, that using this option also renames
! 2783: .Fn yywrap ,
! 2784: so now either an
! 2785: .Pq appropriately named
! 2786: version of the routine for the scanner must be supplied, or
! 2787: .Dq %option noyywrap
! 2788: must be used, as linking with
! 2789: .Fl lfl
! 2790: no longer provides one by default.
! 2791: .It Fl p
! 2792: Generates a performance report to stderr.
! 2793: The report consists of comments regarding features of the
! 2794: .Nm
! 2795: input file which will cause a serious loss of performance in the resulting
! 2796: scanner.
! 2797: If the flag is specified twice,
! 2798: comments regarding features that lead to minor performance losses
! 2799: will also be reported>
! 2800: .Pp
! 2801: Note that the use of
! 2802: .Em REJECT ,
! 2803: .Dq %option yylineno ,
! 2804: and variable trailing context
! 2805: (see the
! 2806: .Sx BUGS
! 2807: section below)
! 2808: entails a substantial performance penalty; use of
! 2809: .Fn yymore ,
! 2810: the
! 2811: .Sq ^
! 2812: operator, and the
! 2813: .Fl I
! 2814: flag entail minor performance penalties.
! 2815: .It Fl S Ns Ar skeleton
! 2816: Overrides the default skeleton file from which
! 2817: .Nm
! 2818: constructs its scanners.
! 2819: This option is needed only for
! 2820: .Nm
1.1 deraadt 2821: maintenance or development.
1.16 ! jmc 2822: .It Fl s
! 2823: Causes the default rule
! 2824: .Pq that unmatched scanner input is echoed to stdout
! 2825: to be suppressed.
! 2826: If the scanner encounters input that does not
! 2827: match any of its rules, it aborts with an error.
! 2828: This option is useful for finding holes in a scanner's rule set.
! 2829: .It Fl T
! 2830: Makes
! 2831: .Nm
! 2832: run in
! 2833: .Em trace
! 2834: mode.
! 2835: It will generate a lot of messages to stderr concerning
! 2836: the form of the input and the resultant non-deterministic and deterministic
! 2837: finite automata.
! 2838: This option is mostly for use in maintaining
! 2839: .Nm .
! 2840: .It Fl t
! 2841: Instructs
! 2842: .Nm
! 2843: to write the scanner it generates to standard output instead of
! 2844: .Pa lex.yy.c .
! 2845: .It Fl V
! 2846: Prints the version number to stdout and exits.
! 2847: .Fl Fl version
! 2848: is a synonym for
! 2849: .Fl V .
! 2850: .It Fl v
! 2851: Specifies that
! 2852: .Nm
! 2853: should write to stderr
! 2854: a summary of statistics regarding the scanner it generates.
! 2855: Most of the statistics are meaningless to the casual
! 2856: .Nm
! 2857: user, but the first line identifies the version of
! 2858: .Nm
! 2859: (same as reported by
! 2860: .Fl V ) ,
! 2861: and the next line the flags used when generating the scanner,
! 2862: including those that are on by default.
! 2863: .It Fl w
! 2864: Suppresses warning messages.
! 2865: .It Fl +
! 2866: Specifies that
! 2867: .Nm
! 2868: should generate a C++ scanner class.
! 2869: See the section on
! 2870: .Sx GENERATING C++ SCANNERS
! 2871: below for details.
! 2872: .El
! 2873: .Pp
! 2874: .Nm
1.1 deraadt 2875: also provides a mechanism for controlling options within the
1.16 ! jmc 2876: scanner specification itself, rather than from the
! 2877: .Nm
! 2878: command-line.
1.1 deraadt 2879: This is done by including
1.16 ! jmc 2880: .Dq %option
1.1 deraadt 2881: directives in the first section of the scanner specification.
1.16 ! jmc 2882: Multiple options can be specified with a single
! 2883: .Dq %option
! 2884: directive, and multiple directives in the first section of the
! 2885: .Nm
! 2886: input file.
! 2887: .Pp
! 2888: Most options are given simply as names, optionally preceded by the word
! 2889: .Qq no
! 2890: .Pq with no intervening whitespace
! 2891: to negate their meaning.
! 2892: A number are equivalent to
! 2893: .Nm
! 2894: flags or their negation:
! 2895: .Bd -unfilled -offset indent
! 2896: 7bit -7 option
! 2897: 8bit -8 option
! 2898: align -Ca option
! 2899: backup -b option
! 2900: batch -B option
! 2901: c++ -+ option
! 2902:
! 2903: caseful or
! 2904: case-sensitive opposite of -i (default)
! 2905:
! 2906: case-insensitive or
! 2907: caseless -i option
! 2908:
! 2909: debug -d option
! 2910: default opposite of -s option
! 2911: ecs -Ce option
! 2912: fast -F option
! 2913: full -f option
! 2914: interactive -I option
! 2915: lex-compat -l option
! 2916: meta-ecs -Cm option
! 2917: perf-report -p option
! 2918: read -Cr option
! 2919: stdout -t option
! 2920: verbose -v option
! 2921: warn opposite of -w option
! 2922: (use "%option nowarn" for -w)
! 2923:
! 2924: array equivalent to "%array"
! 2925: pointer equivalent to "%pointer" (default)
! 2926: .Ed
! 2927: .Pp
! 2928: Some %option's provide features otherwise not available:
! 2929: .Bl -tag -width Ds
! 2930: .It always-interactive
! 2931: Instructs
! 2932: .Nm
! 2933: to generate a scanner which always considers its input
! 2934: .Qq interactive .
! 2935: Normally, on each new input file the scanner calls
! 2936: .Fn isatty
! 2937: in an attempt to determine whether the scanner's input source is interactive
! 2938: and thus should be read a character at a time.
! 2939: When this option is used, however, no such call is made.
! 2940: .It main
! 2941: Directs
! 2942: .Nm
! 2943: to provide a default
! 2944: .Fn main
1.1 deraadt 2945: program for the scanner, which simply calls
1.16 ! jmc 2946: .Fn yylex .
1.1 deraadt 2947: This option implies
1.16 ! jmc 2948: .Dq noyywrap
! 2949: .Pq see below .
! 2950: .It never-interactive
! 2951: Instructs
! 2952: .Nm
! 2953: to generate a scanner which never considers its input
! 2954: .Qq interactive
! 2955: (again, no call made to
! 2956: .Fn isatty ) .
1.1 deraadt 2957: This is the opposite of
1.16 ! jmc 2958: .Dq always-interactive .
! 2959: .It stack
! 2960: Enables the use of start condition stacks
! 2961: (see
! 2962: .Sx START CONDITIONS
! 2963: above).
! 2964: .It stdinit
! 2965: If set (i.e.,
! 2966: .Dq %option stdinit ) ,
1.1 deraadt 2967: initializes
1.16 ! jmc 2968: .Fa yyin
1.1 deraadt 2969: and
1.16 ! jmc 2970: .Fa yyout
! 2971: to stdin and stdout, instead of the default of
! 2972: .Dq nil .
1.1 deraadt 2973: Some existing
1.16 ! jmc 2974: .Nm lex
! 2975: programs depend on this behavior, even though it is not compliant with ANSI C,
! 2976: which does not require stdin and stdout to be compile-time constant.
! 2977: .It yylineno
! 2978: Directs
! 2979: .Nm
1.1 deraadt 2980: to generate a scanner that maintains the number of the current line
2981: read from its input in the global variable
1.16 ! jmc 2982: .Fa yylineno .
1.1 deraadt 2983: This option is implied by
1.16 ! jmc 2984: .Dq %option lex-compat .
! 2985: .It yywrap
! 2986: If unset (i.e.,
! 2987: .Dq %option noyywrap ) ,
1.1 deraadt 2988: makes the scanner not call
1.16 ! jmc 2989: .Fn yywrap
! 2990: upon an end-of-file, but simply assume that there are no more files to scan
! 2991: (until the user points
! 2992: .Fa yyin
1.1 deraadt 2993: at a new file and calls
1.16 ! jmc 2994: .Fn yylex
1.1 deraadt 2995: again).
1.16 ! jmc 2996: .El
! 2997: .Pp
! 2998: .Nm
! 2999: scans rule actions to determine whether the
! 3000: .Em REJECT
! 3001: or
! 3002: .Fn yymore
! 3003: features are being used.
! 3004: The
! 3005: .Dq reject
1.1 deraadt 3006: and
1.16 ! jmc 3007: .Dq yymore
! 3008: options are available to override its decision as to whether to use the
1.1 deraadt 3009: options, either by setting them (e.g.,
1.16 ! jmc 3010: .Dq %option reject )
! 3011: to indicate the feature is indeed used,
! 3012: or unsetting them to indicate it actually is not used
1.1 deraadt 3013: (e.g.,
1.16 ! jmc 3014: .Dq %option noyymore ) .
! 3015: .Pp
! 3016: Three options take string-delimited values, offset with
! 3017: .Sq = :
! 3018: .Pp
! 3019: .D1 %option outfile="ABC"
! 3020: .Pp
1.1 deraadt 3021: is equivalent to
1.16 ! jmc 3022: .Fl o Ns Ar ABC ,
1.1 deraadt 3023: and
1.16 ! jmc 3024: .Pp
! 3025: .D1 %option prefix="XYZ"
! 3026: .Pp
1.1 deraadt 3027: is equivalent to
1.16 ! jmc 3028: .Fl P Ns Ar XYZ .
1.1 deraadt 3029: Finally,
1.16 ! jmc 3030: .Pp
! 3031: .D1 %option yyclass="foo"
! 3032: .Pp
! 3033: only applies when generating a C++ scanner
! 3034: .Pf ( Fl +
! 3035: option).
! 3036: It informs
! 3037: .Nm
! 3038: that
! 3039: .Dq foo
! 3040: has been derived as a subclass of yyFlexLexer, so
! 3041: .Nm
! 3042: will place actions in the member function
! 3043: .Dq foo::yylex()
1.1 deraadt 3044: instead of
1.16 ! jmc 3045: .Dq yyFlexLexer::yylex() .
1.1 deraadt 3046: It also generates a
1.16 ! jmc 3047: .Dq yyFlexLexer::yylex()
1.1 deraadt 3048: member function that emits a run-time error (by invoking
1.16 ! jmc 3049: .Dq yyFlexLexer::LexerError() )
1.1 deraadt 3050: if called.
1.16 ! jmc 3051: See
! 3052: .Sx GENERATING C++ SCANNERS ,
! 3053: below, for additional information.
! 3054: .Pp
! 3055: A number of options are available for
! 3056: .Xr lint 1
! 3057: purists who want to suppress the appearance of unneeded routines
! 3058: in the generated scanner.
! 3059: Each of the following, if unset
1.1 deraadt 3060: (e.g.,
1.16 ! jmc 3061: .Dq %option nounput ) ,
! 3062: results in the corresponding routine not appearing in the generated scanner:
! 3063: .Bd -unfilled -offset indent
! 3064: input, unput
! 3065: yy_push_state, yy_pop_state, yy_top_state
! 3066: yy_scan_buffer, yy_scan_bytes, yy_scan_string
! 3067: .Ed
! 3068: .Pp
1.1 deraadt 3069: (though
1.16 ! jmc 3070: .Fn yy_push_state
! 3071: and friends won't appear anyway unless
! 3072: .Dq %option stack
! 3073: is being used).
! 3074: .Sh PERFORMANCE CONSIDERATIONS
1.1 deraadt 3075: The main design goal of
1.16 ! jmc 3076: .Nm
! 3077: is that it generate high-performance scanners.
! 3078: It has been optimized for dealing well with large sets of rules.
! 3079: Aside from the effects on scanner speed of the table compression
! 3080: .Fl C
1.1 deraadt 3081: options outlined above,
1.16 ! jmc 3082: there are a number of options/actions which degrade performance.
! 3083: These are, from most expensive to least:
! 3084: .Bd -unfilled -offset indent
! 3085: REJECT
! 3086: %option yylineno
! 3087: arbitrary trailing context
! 3088:
! 3089: pattern sets that require backing up
! 3090: %array
! 3091: %option interactive
! 3092: %option always-interactive
! 3093:
! 3094: \&'^' beginning-of-line operator
! 3095: yymore()
! 3096: .Ed
! 3097: .Pp
! 3098: with the first three all being quite expensive
! 3099: and the last two being quite cheap.
! 3100: Note also that
! 3101: .Fn unput
! 3102: is implemented as a routine call that potentially does quite a bit of work,
! 3103: while
! 3104: .Fn yyless
! 3105: is a quite-cheap macro; so if just putting back some excess text,
! 3106: use
! 3107: .Fn yyless .
! 3108: .Pp
! 3109: .Em REJECT
1.1 deraadt 3110: should be avoided at all costs when performance is important.
3111: It is a particularly expensive option.
1.16 ! jmc 3112: .Pp
1.1 deraadt 3113: Getting rid of backing up is messy and often may be an enormous
1.16 ! jmc 3114: amount of work for a complicated scanner.
! 3115: In principal, one begins by using the
! 3116: .Fl b
1.1 deraadt 3117: flag to generate a
1.16 ! jmc 3118: .Pa lex.backup
! 3119: file.
! 3120: For example, on the input
! 3121: .Bd -literal -offset indent
! 3122: %%
! 3123: foo return TOK_KEYWORD;
! 3124: foobar return TOK_KEYWORD;
! 3125: .Ed
! 3126: .Pp
1.1 deraadt 3127: the file looks like:
1.16 ! jmc 3128: .Bd -literal -offset indent
! 3129: State #6 is non-accepting -
! 3130: associated rule line numbers:
! 3131: 2 3
! 3132: out-transitions: [ o ]
! 3133: jam-transitions: EOF [ \e001-n p-\e177 ]
! 3134:
! 3135: State #8 is non-accepting -
! 3136: associated rule line numbers:
! 3137: 3
! 3138: out-transitions: [ a ]
! 3139: jam-transitions: EOF [ \e001-` b-\e177 ]
! 3140:
! 3141: State #9 is non-accepting -
! 3142: associated rule line numbers:
! 3143: 3
! 3144: out-transitions: [ r ]
! 3145: jam-transitions: EOF [ \e001-q s-\e177 ]
! 3146:
! 3147: Compressed tables always back up.
! 3148: .Ed
! 3149: .Pp
1.1 deraadt 3150: The first few lines tell us that there's a scanner state in
1.16 ! jmc 3151: which it can make a transition on an
! 3152: .Sq o
! 3153: but not on any other character,
! 3154: and that in that state the currently scanned text does not match any rule.
! 3155: The state occurs when trying to match the rules found
1.1 deraadt 3156: at lines 2 and 3 in the input file.
1.16 ! jmc 3157: If the scanner is in that state and then reads something other than an
! 3158: .Sq o ,
! 3159: it will have to back up to find a rule which is matched.
! 3160: With a bit of headscratching one can see that this must be the
! 3161: state it's in when it has seen
! 3162: .Sq fo .
! 3163: When this has happened, if anything other than another
! 3164: .Sq o
! 3165: is seen, the scanner will have to back up to simply match the
! 3166: .Sq f
! 3167: .Pq by the default rule .
! 3168: .Pp
! 3169: The comment regarding State #8 indicates there's a problem when
! 3170: .Qq foob
! 3171: has been scanned.
! 3172: Indeed, on any character other than an
! 3173: .Sq a ,
! 3174: the scanner will have to back up to accept
! 3175: .Qq foo .
! 3176: Similarly, the comment for State #9 concerns when
! 3177: .Qq fooba
! 3178: has been scanned and an
! 3179: .Sq r
! 3180: does not follow.
! 3181: .Pp
1.1 deraadt 3182: The final comment reminds us that there's no point going to
1.16 ! jmc 3183: all the trouble of removing backing up from the rules unless we're using
! 3184: .Fl Cf
1.1 deraadt 3185: or
1.16 ! jmc 3186: .Fl CF ,
1.1 deraadt 3187: since there's no performance gain doing so with compressed scanners.
1.16 ! jmc 3188: .Pp
! 3189: The way to remove the backing up is to add
! 3190: .Qq error
! 3191: rules:
! 3192: .Bd -literal -offset indent
! 3193: %%
! 3194: foo return TOK_KEYWORD;
! 3195: foobar return TOK_KEYWORD;
! 3196:
! 3197: fooba |
! 3198: foob |
! 3199: fo {
! 3200: /* false alarm, not really a keyword */
! 3201: return TOK_ID;
! 3202: }
! 3203: .Ed
! 3204: .Pp
! 3205: Eliminating backing up among a list of keywords can also be done using a
! 3206: .Qq catch-all
! 3207: rule:
! 3208: .Bd -literal -offset indent
! 3209: %%
! 3210: foo return TOK_KEYWORD;
! 3211: foobar return TOK_KEYWORD;
! 3212:
! 3213: [a-z]+ return TOK_ID;
! 3214: .Ed
! 3215: .Pp
1.1 deraadt 3216: This is usually the best solution when appropriate.
1.16 ! jmc 3217: .Pp
1.1 deraadt 3218: Backing up messages tend to cascade.
1.16 ! jmc 3219: With a complicated set of rules it's not uncommon to get hundreds of messages.
! 3220: If one can decipher them, though,
! 3221: it often only takes a dozen or so rules to eliminate the backing up
! 3222: (though it's easy to make a mistake and have an error rule accidentally match
! 3223: a valid token; a possible future
! 3224: .Nm
1.1 deraadt 3225: feature will be to automatically add rules to eliminate backing up).
1.16 ! jmc 3226: .Pp
! 3227: It's important to keep in mind that the benefits of eliminating
! 3228: backing up are gained only if
! 3229: .Em every
! 3230: instance of backing up is eliminated.
! 3231: Leaving just one gains nothing.
! 3232: .Pp
! 3233: .Em Variable
! 3234: trailing context
! 3235: (where both the leading and trailing parts do not have a fixed length)
! 3236: entails almost the same performance loss as
! 3237: .Em REJECT
! 3238: .Pq i.e., substantial .
! 3239: So when possible a rule like:
! 3240: .Bd -literal -offset indent
! 3241: %%
! 3242: mouse|rat/(cat|dog) run();
! 3243: .Ed
! 3244: .Pp
1.1 deraadt 3245: is better written:
1.16 ! jmc 3246: .Bd -literal -offset indent
! 3247: %%
! 3248: mouse/cat|dog run();
! 3249: rat/cat|dog run();
! 3250: .Ed
! 3251: .Pp
1.1 deraadt 3252: or as
1.16 ! jmc 3253: .Bd -literal -offset indent
! 3254: %%
! 3255: mouse|rat/cat run();
! 3256: mouse|rat/dog run();
! 3257: .Ed
! 3258: .Pp
! 3259: Note that here the special
! 3260: .Sq |\&
! 3261: action does not provide any savings, and can even make things worse (see
! 3262: .Sx BUGS
! 3263: below).
! 3264: .Pp
1.1 deraadt 3265: Another area where the user can increase a scanner's performance
1.16 ! jmc 3266: .Pq and one that's easier to implement
! 3267: arises from the fact that the longer the tokens matched,
! 3268: the faster the scanner will run.
1.1 deraadt 3269: This is because with long tokens the processing of most input
1.16 ! jmc 3270: characters takes place in the
! 3271: .Pq short
! 3272: inner scanning loop, and does not often have to go through the additional work
! 3273: of setting up the scanning environment (e.g.,
! 3274: .Fa yytext )
! 3275: for the action.
! 3276: Recall the scanner for C comments:
! 3277: .Bd -literal -offset indent
! 3278: %x comment
! 3279: %%
! 3280: int line_num = 1;
! 3281:
! 3282: "/*" BEGIN(comment);
! 3283:
! 3284: <comment>[^*\en]*
! 3285: <comment>"*"+[^*/\en]*
! 3286: <comment>\en ++line_num;
! 3287: <comment>"*"+"/" BEGIN(INITIAL);
! 3288: .Ed
! 3289: .Pp
1.1 deraadt 3290: This could be sped up by writing it as:
1.16 ! jmc 3291: .Bd -literal -offset indent
! 3292: %x comment
! 3293: %%
! 3294: int line_num = 1;
! 3295:
! 3296: "/*" BEGIN(comment);
! 3297:
! 3298: <comment>[^*\en]*
! 3299: <comment>[^*\en]*\en ++line_num;
! 3300: <comment>"*"+[^*/\en]*
! 3301: <comment>"*"+[^*/\en]*\en ++line_num;
! 3302: <comment>"*"+"/" BEGIN(INITIAL);
! 3303: .Ed
! 3304: .Pp
! 3305: Now instead of each newline requiring the processing of another action,
! 3306: recognizing the newlines is
! 3307: .Qq distributed
! 3308: over the other rules to keep the matched text as long as possible.
! 3309: Note that adding rules does
! 3310: .Em not
! 3311: slow down the scanner!
! 3312: The speed of the scanner is independent of the number of rules or
! 3313: (modulo the considerations given at the beginning of this section)
! 3314: how complicated the rules are with regard to operators such as
! 3315: .Sq *
! 3316: and
! 3317: .Sq |\& .
! 3318: .Pp
! 3319: A final example in speeding up a scanner:
! 3320: scan through a file containing identifiers and keywords, one per line
! 3321: and with no other extraneous characters, and recognize all the keywords.
! 3322: A natural first approach is:
! 3323: .Bd -literal -offset indent
! 3324: %%
! 3325: asm |
! 3326: auto |
! 3327: break |
! 3328: \&... etc ...
! 3329: volatile |
! 3330: while /* it's a keyword */
! 3331:
! 3332: \&.|\en /* it's not a keyword */
! 3333: .Ed
! 3334: .Pp
1.1 deraadt 3335: To eliminate the back-tracking, introduce a catch-all rule:
1.16 ! jmc 3336: .Bd -literal -offset indent
! 3337: %%
! 3338: asm |
! 3339: auto |
! 3340: break |
! 3341: \&... etc ...
! 3342: volatile |
! 3343: while /* it's a keyword */
! 3344:
! 3345: [a-z]+ |
! 3346: \&.|\en /* it's not a keyword */
! 3347: .Ed
! 3348: .Pp
1.1 deraadt 3349: Now, if it's guaranteed that there's exactly one word per line,
3350: then we can reduce the total number of matches by a half by
1.16 ! jmc 3351: merging in the recognition of newlines with that of the other tokens:
! 3352: .Bd -literal -offset indent
! 3353: %%
! 3354: asm\en |
! 3355: auto\en |
! 3356: break\en |
! 3357: \&... etc ...
! 3358: volatile\en |
! 3359: while\en /* it's a keyword */
! 3360:
! 3361: [a-z]+\en |
! 3362: \&.|\en /* it's not a keyword */
! 3363: .Ed
! 3364: .Pp
! 3365: One has to be careful here,
! 3366: as we have now reintroduced backing up into the scanner.
! 3367: In particular, while we know that there will never be any characters
! 3368: in the input stream other than letters or newlines,
! 3369: .Nm
1.1 deraadt 3370: can't figure this out, and it will plan for possibly needing to back up
1.16 ! jmc 3371: when it has scanned a token like
! 3372: .Qq auto
! 3373: and then the next character is something other than a newline or a letter.
! 3374: Previously it would then just match the
! 3375: .Qq auto
! 3376: rule and be done, but now it has no
! 3377: .Qq auto
! 3378: rule, only an
! 3379: .Qq auto\en
! 3380: rule.
! 3381: To eliminate the possibility of backing up,
1.1 deraadt 3382: we could either duplicate all rules but without final newlines, or,
3383: since we never expect to encounter such an input and therefore don't
1.16 ! jmc 3384: how it's classified, we can introduce one more catch-all rule,
! 3385: this one which doesn't include a newline:
! 3386: .Bd -literal -offset indent
! 3387: %%
! 3388: asm\en |
! 3389: auto\en |
! 3390: break\en |
! 3391: \&... etc ...
! 3392: volatile\en |
! 3393: while\en /* it's a keyword */
! 3394:
! 3395: [a-z]+\en |
! 3396: [a-z]+ |
! 3397: \&.|\en /* it's not a keyword */
! 3398: .Ed
! 3399: .Pp
1.1 deraadt 3400: Compiled with
1.16 ! jmc 3401: .Fl Cf ,
1.1 deraadt 3402: this is about as fast as one can get a
1.16 ! jmc 3403: .Nm
1.1 deraadt 3404: scanner to go for this particular problem.
1.16 ! jmc 3405: .Pp
1.1 deraadt 3406: A final note:
1.16 ! jmc 3407: .Nm
! 3408: is slow when matching NUL's,
! 3409: particularly when a token contains multiple NUL's.
! 3410: It's best to write rules which match short
1.1 deraadt 3411: amounts of text if it's anticipated that the text will often include NUL's.
1.16 ! jmc 3412: .Pp
1.1 deraadt 3413: Another final note regarding performance: as mentioned above in the section
1.16 ! jmc 3414: .Sx HOW THE INPUT IS MATCHED ,
! 3415: dynamically resizing
! 3416: .Fa yytext
1.1 deraadt 3417: to accommodate huge tokens is a slow process because it presently requires that
1.16 ! jmc 3418: the
! 3419: .Pq huge
! 3420: token be rescanned from the beginning.
! 3421: Thus if performance is vital, it is better to attempt to match
! 3422: .Qq large
! 3423: quantities of text but not
! 3424: .Qq huge
! 3425: quantities, where the cutoff between the two is at about 8K characters/token.
! 3426: .Sh GENERATING C++ SCANNERS
! 3427: .Nm
! 3428: provides two different ways to generate scanners for use with C++.
! 3429: The first way is to simply compile a scanner generated by
! 3430: .Nm
! 3431: using a C++ compiler instead of a C compiler.
! 3432: This should not generate any compilation errors
! 3433: (please report any found to the email address given in the
! 3434: .Sx AUTHORS
! 3435: section below).
! 3436: C++ code can then be used in rule actions instead of C code.
! 3437: Note that the default input source for scanners remains
! 3438: .Fa yyin ,
1.1 deraadt 3439: and default echoing is still done to
1.16 ! jmc 3440: .Fa yyout .
1.1 deraadt 3441: Both of these remain
1.16 ! jmc 3442: .Fa FILE *
! 3443: variables and not C++ streams.
! 3444: .Pp
! 3445: .Nm
! 3446: can also be used to generate a C++ scanner class, using the
! 3447: .Fl +
1.1 deraadt 3448: option (or, equivalently,
1.16 ! jmc 3449: .Dq %option c++ ) ,
! 3450: which is automatically specified if the name of the flex executable ends in a
! 3451: .Sq + ,
! 3452: such as
! 3453: .Nm flex++ .
! 3454: When using this option,
! 3455: .Nm
! 3456: defaults to generating the scanner to the file
! 3457: .Pa lex.yy.cc
1.1 deraadt 3458: instead of
1.16 ! jmc 3459: .Pa lex.yy.c .
1.1 deraadt 3460: The generated scanner includes the header file
1.16 ! jmc 3461: .Aq Pa g++/FlexLexer.h ,
1.1 deraadt 3462: which defines the interface to two C++ classes.
1.16 ! jmc 3463: .Pp
1.1 deraadt 3464: The first class,
1.16 ! jmc 3465: .Em FlexLexer ,
! 3466: provides an abstract base class defining the general scanner class interface.
! 3467: It provides the following member functions:
! 3468: .Bl -tag -width Ds
! 3469: .It const char* YYText()
! 3470: Returns the text of the most recently matched token, the equivalent of
! 3471: .Fa yytext .
! 3472: .It int YYLeng()
! 3473: Returns the length of the most recently matched token, the equivalent of
! 3474: .Fa yyleng .
! 3475: .It int lineno() const
! 3476: Returns the current input line number
1.1 deraadt 3477: (see
1.16 ! jmc 3478: .Dq %option yylineno ) ,
! 3479: or 1 if
! 3480: .Dq %option yylineno
1.1 deraadt 3481: was not used.
1.16 ! jmc 3482: .It void set_debug(int flag)
! 3483: Sets the debugging flag for the scanner, equivalent to assigning to
! 3484: .Fa yy_flex_debug
! 3485: (see the
! 3486: .Sx OPTIONS
! 3487: section above).
! 3488: Note that the scanner must be built using
! 3489: .Dq %option debug
1.1 deraadt 3490: to include debugging information in it.
1.16 ! jmc 3491: .It int debug() const
! 3492: Returns the current setting of the debugging flag.
! 3493: .El
! 3494: .Pp
1.1 deraadt 3495: Also provided are member functions equivalent to
1.16 ! jmc 3496: .Fn yy_switch_to_buffer ,
! 3497: .Fn yy_create_buffer
1.1 deraadt 3498: (though the first argument is an
1.16 ! jmc 3499: .Fa istream*
1.1 deraadt 3500: object pointer and not a
1.16 ! jmc 3501: .Fa FILE* ) ,
! 3502: .Fn yy_flush_buffer ,
! 3503: .Fn yy_delete_buffer ,
1.1 deraadt 3504: and
1.16 ! jmc 3505: .Fn yyrestart
1.10 deraadt 3506: (again, the first argument is an
1.16 ! jmc 3507: .Fa istream*
1.1 deraadt 3508: object pointer).
1.16 ! jmc 3509: .Pp
1.1 deraadt 3510: The second class defined in
1.16 ! jmc 3511: .Aq Pa g++/FlexLexer.h
1.1 deraadt 3512: is
1.16 ! jmc 3513: .Fa yyFlexLexer ,
1.1 deraadt 3514: which is derived from
1.16 ! jmc 3515: .Fa FlexLexer .
1.1 deraadt 3516: It defines the following additional member functions:
1.16 ! jmc 3517: .Bl -tag -width Ds
! 3518: .It "yyFlexLexer(istream* arg_yyin = 0, ostream* arg_yyout = 0)"
! 3519: Constructs a
! 3520: .Fa yyFlexLexer
! 3521: object using the given streams for input and output.
! 3522: If not specified, the streams default to
! 3523: .Fa cin
1.1 deraadt 3524: and
1.16 ! jmc 3525: .Fa cout ,
1.1 deraadt 3526: respectively.
1.16 ! jmc 3527: .It virtual int yylex()
! 3528: Performs the same role as
! 3529: .Fn yylex
1.1 deraadt 3530: does for ordinary flex scanners: it scans the input stream, consuming
1.16 ! jmc 3531: tokens, until a rule's action returns a value.
! 3532: If subclass
! 3533: .Sq S
! 3534: is derived from
! 3535: .Fa yyFlexLexer ,
! 3536: in order to access the member functions and variables of
! 3537: .Sq S
1.1 deraadt 3538: inside
1.16 ! jmc 3539: .Fn yylex ,
! 3540: use
! 3541: .Dq %option yyclass="S"
1.1 deraadt 3542: to inform
1.16 ! jmc 3543: .Nm
! 3544: that the
! 3545: .Sq S
! 3546: subclass will be used instead of
! 3547: .Fa yyFlexLexer .
1.1 deraadt 3548: In this case, rather than generating
1.16 ! jmc 3549: .Dq yyFlexLexer::yylex() ,
! 3550: .Nm
1.1 deraadt 3551: generates
1.16 ! jmc 3552: .Dq S::yylex()
1.1 deraadt 3553: (and also generates a dummy
1.16 ! jmc 3554: .Dq yyFlexLexer::yylex()
1.1 deraadt 3555: that calls
1.16 ! jmc 3556: .Dq yyFlexLexer::LexerError()
1.1 deraadt 3557: if called).
1.16 ! jmc 3558: .It "virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)"
! 3559: Reassigns
! 3560: .Fa yyin
1.1 deraadt 3561: to
1.16 ! jmc 3562: .Fa new_in
! 3563: .Pq if non-nil
1.1 deraadt 3564: and
1.16 ! jmc 3565: .Fa yyout
1.1 deraadt 3566: to
1.16 ! jmc 3567: .Fa new_out
! 3568: .Pq ditto ,
! 3569: deleting the previous input buffer if
! 3570: .Fa yyin
1.1 deraadt 3571: is reassigned.
1.16 ! jmc 3572: .It int yylex(istream* new_in, ostream* new_out = 0)
! 3573: First switches the input streams via
! 3574: .Dq switch_streams(new_in, new_out)
1.1 deraadt 3575: and then returns the value of
1.16 ! jmc 3576: .Fn yylex .
! 3577: .El
! 3578: .Pp
1.1 deraadt 3579: In addition,
1.16 ! jmc 3580: .Fa yyFlexLexer
! 3581: defines the following protected virtual functions which can be redefined
1.1 deraadt 3582: in derived classes to tailor the scanner:
1.16 ! jmc 3583: .Bl -tag -width Ds
! 3584: .It virtual int LexerInput(char* buf, int max_size)
! 3585: Reads up to
! 3586: .Fa max_size
1.1 deraadt 3587: characters into
1.16 ! jmc 3588: .Fa buf
! 3589: and returns the number of characters read.
! 3590: To indicate end-of-input, return 0 characters.
! 3591: Note that
! 3592: .Qq interactive
! 3593: scanners (see the
! 3594: .Fl B
1.1 deraadt 3595: and
1.16 ! jmc 3596: .Fl I
1.1 deraadt 3597: flags) define the macro
1.16 ! jmc 3598: .Dv YY_INTERACTIVE .
! 3599: If
! 3600: .Fn LexerInput
! 3601: has been redefined, and it's necessary to take different actions depending on
! 3602: whether or not the scanner might be scanning an interactive input source,
! 3603: it's possible to test for the presence of this name via
! 3604: .Dq #ifdef .
! 3605: .It virtual void LexerOutput(const char* buf, int size)
! 3606: Writes out
! 3607: .Fa size
1.1 deraadt 3608: characters from the buffer
1.16 ! jmc 3609: .Fa buf ,
! 3610: which, while NUL-terminated, may also contain
! 3611: .Qq internal
! 3612: NUL's if the scanner's rules can match text with NUL's in them.
! 3613: .It virtual void LexerError(const char* msg)
! 3614: Reports a fatal error message.
! 3615: The default version of this function writes the message to the stream
! 3616: .Fa cerr
1.1 deraadt 3617: and exits.
1.16 ! jmc 3618: .El
! 3619: .Pp
1.1 deraadt 3620: Note that a
1.16 ! jmc 3621: .Fa yyFlexLexer
! 3622: object contains its entire scanning state.
! 3623: Thus such objects can be used to create reentrant scanners.
! 3624: Multiple instances of the same
! 3625: .Fa yyFlexLexer
! 3626: class can be instantiated, and multiple C++ scanner classes can be combined
1.1 deraadt 3627: in the same program using the
1.16 ! jmc 3628: .Fl P
1.1 deraadt 3629: option discussed above.
1.16 ! jmc 3630: .Pp
1.1 deraadt 3631: Finally, note that the
1.16 ! jmc 3632: .Dq %array
! 3633: feature is not available to C++ scanner classes;
! 3634: .Dq %pointer
! 3635: must be used
! 3636: .Pq the default .
! 3637: .Pp
1.1 deraadt 3638: Here is an example of a simple C++ scanner:
1.16 ! jmc 3639: .Bd -literal -offset indent
! 3640: // An example of using the flex C++ scanner class.
1.1 deraadt 3641:
1.16 ! jmc 3642: %{
! 3643: #include <errno.h>
! 3644: int mylineno = 0;
! 3645: %}
1.1 deraadt 3646:
1.16 ! jmc 3647: string \e"[^\en"]+\e"
1.1 deraadt 3648:
1.16 ! jmc 3649: ws [ \et]+
1.1 deraadt 3650:
1.16 ! jmc 3651: alpha [A-Za-z]
! 3652: dig [0-9]
! 3653: name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
! 3654: num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
! 3655: num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
! 3656: number {num1}|{num2}
1.1 deraadt 3657:
1.16 ! jmc 3658: %%
1.1 deraadt 3659:
1.16 ! jmc 3660: {ws} /* skip blanks and tabs */
1.1 deraadt 3661:
1.16 ! jmc 3662: "/*" {
! 3663: int c;
1.1 deraadt 3664:
1.16 ! jmc 3665: while ((c = yyinput()) != 0) {
! 3666: if(c == '\en')
1.1 deraadt 3667: ++mylineno;
1.16 ! jmc 3668: else if(c == '*') {
! 3669: if ((c = yyinput()) == '/')
1.1 deraadt 3670: break;
3671: else
3672: unput(c);
3673: }
1.16 ! jmc 3674: }
! 3675: }
1.1 deraadt 3676:
1.16 ! jmc 3677: {number} cout << "number " << YYText() << '\en';
1.1 deraadt 3678:
1.16 ! jmc 3679: \en mylineno++;
1.1 deraadt 3680:
1.16 ! jmc 3681: {name} cout << "name " << YYText() << '\en';
1.1 deraadt 3682:
1.16 ! jmc 3683: {string} cout << "string " << YYText() << '\en';
! 3684:
! 3685: %%
! 3686:
! 3687: int main(int /* argc */, char** /* argv */)
! 3688: {
! 3689: FlexLexer* lexer = new yyFlexLexer;
! 3690: while(lexer->yylex() != 0)
! 3691: ;
! 3692: return 0;
! 3693: }
! 3694: .Ed
! 3695: .Pp
! 3696: To create multiple
! 3697: .Pq different
! 3698: lexer classes, use the
! 3699: .Fl P
! 3700: flag
! 3701: (or the
! 3702: .Dq prefix=
! 3703: option)
! 3704: to rename each
! 3705: .Fa yyFlexLexer
1.1 deraadt 3706: to some other
1.16 ! jmc 3707: .Fa xxFlexLexer .
! 3708: .Aq Pa g++/FlexLexer.h
! 3709: can then be included in other sources once per lexer class, first renaming
! 3710: .Fa yyFlexLexer
1.1 deraadt 3711: as follows:
1.16 ! jmc 3712: .Bd -literal -offset indent
! 3713: #undef yyFlexLexer
! 3714: #define yyFlexLexer xxFlexLexer
! 3715: #include <g++/FlexLexer.h>
! 3716:
! 3717: #undef yyFlexLexer
! 3718: #define yyFlexLexer zzFlexLexer
! 3719: #include <g++/FlexLexer.h>
! 3720: .Ed
! 3721: .Pp
! 3722: If, for example,
! 3723: .Dq %option prefix="xx"
! 3724: is used for one scanner and
! 3725: .Dq %option prefix="zz"
! 3726: is used for the other.
! 3727: .Pp
! 3728: .Sy IMPORTANT :
! 3729: the present form of the scanning class is experimental
1.7 aaron 3730: and may change considerably between major releases.
1.16 ! jmc 3731: .Sh INCOMPATIBILITIES WITH LEX AND POSIX
! 3732: .Nm
1.1 deraadt 3733: is a rewrite of the AT&T Unix
1.16 ! jmc 3734: .Nm lex
! 3735: tool
! 3736: (the two implementations do not share any code, though),
! 3737: with some extensions and incompatibilities, both of which are of concern
! 3738: to those who wish to write scanners acceptable to either implementation.
! 3739: .Nm
! 3740: is fully compliant with the
! 3741: .Tn POSIX
! 3742: .Nm lex
1.1 deraadt 3743: specification, except that when using
1.16 ! jmc 3744: .Dq %pointer
! 3745: .Pq the default ,
! 3746: a call to
! 3747: .Fn unput
1.1 deraadt 3748: destroys the contents of
1.16 ! jmc 3749: .Fa yytext ,
! 3750: which is counter to the
! 3751: .Tn POSIX
! 3752: specification.
! 3753: .Pp
! 3754: In this section we discuss all of the known areas of incompatibility between
! 3755: .Nm ,
! 3756: AT&T
! 3757: .Nm lex ,
! 3758: and the
! 3759: .Tn POSIX
! 3760: specification.
! 3761: .Pp
! 3762: .Nm flex Ns 's
! 3763: .Fl l
1.1 deraadt 3764: option turns on maximum compatibility with the original AT&T
1.16 ! jmc 3765: .Nm lex
1.1 deraadt 3766: implementation, at the cost of a major loss in the generated scanner's
1.16 ! jmc 3767: performance.
! 3768: We note below which incompatibilities can be overcome using the
! 3769: .Fl l
1.1 deraadt 3770: option.
1.16 ! jmc 3771: .Pp
! 3772: .Nm
1.1 deraadt 3773: is fully compatible with
1.16 ! jmc 3774: .Nm lex
1.1 deraadt 3775: with the following exceptions:
1.16 ! jmc 3776: .Bl -dash
! 3777: .It
1.1 deraadt 3778: The undocumented
1.16 ! jmc 3779: .Nm lex
1.1 deraadt 3780: scanner internal variable
1.16 ! jmc 3781: .Fa yylineno
1.1 deraadt 3782: is not supported unless
1.16 ! jmc 3783: .Fl l
1.1 deraadt 3784: or
1.16 ! jmc 3785: .Dq %option yylineno
1.1 deraadt 3786: is used.
1.16 ! jmc 3787: .Pp
! 3788: .Fa yylineno
1.1 deraadt 3789: should be maintained on a per-buffer basis, rather than a per-scanner
1.16 ! jmc 3790: .Pq single global variable
! 3791: basis.
! 3792: .Pp
! 3793: .Fa yylineno
! 3794: is not part of the
! 3795: .Tn POSIX
! 3796: specification.
! 3797: .It
1.1 deraadt 3798: The
1.16 ! jmc 3799: .Fn input
1.1 deraadt 3800: routine is not redefinable, though it may be called to read characters
1.16 ! jmc 3801: following whatever has been matched by a rule.
! 3802: If
! 3803: .Fn input
! 3804: encounters an end-of-file, the normal
! 3805: .Fn yywrap
! 3806: processing is done.
! 3807: A
! 3808: .Dq real
! 3809: end-of-file is returned by
! 3810: .Fn input
1.1 deraadt 3811: as
1.16 ! jmc 3812: .Dv EOF .
! 3813: .Pp
1.1 deraadt 3814: Input is instead controlled by defining the
1.16 ! jmc 3815: .Dv YY_INPUT
1.1 deraadt 3816: macro.
1.16 ! jmc 3817: .Pp
1.1 deraadt 3818: The
1.16 ! jmc 3819: .Nm
1.1 deraadt 3820: restriction that
1.16 ! jmc 3821: .Fn input
! 3822: cannot be redefined is in accordance with the
! 3823: .Tn POSIX
! 3824: specification, which simply does not specify any way of controlling the
1.1 deraadt 3825: scanner's input other than by making an initial assignment to
1.16 ! jmc 3826: .Fa yyin .
! 3827: .It
1.1 deraadt 3828: The
1.16 ! jmc 3829: .Fn unput
! 3830: routine is not redefinable.
! 3831: This restriction is in accordance with
! 3832: .Tn POSIX .
! 3833: .It
! 3834: .Nm
1.1 deraadt 3835: scanners are not as reentrant as
1.16 ! jmc 3836: .Nm lex
! 3837: scanners.
! 3838: In particular, if a scanner is interactive and
! 3839: an interrupt handler long-jumps out of the scanner,
! 3840: and the scanner is subsequently called again,
! 3841: the following error message may be displayed:
! 3842: .Pp
! 3843: .D1 fatal flex scanner internal error--end of buffer missed
! 3844: .Pp
1.1 deraadt 3845: To reenter the scanner, first use
1.16 ! jmc 3846: .Pp
! 3847: .Dl yyrestart(yyin);
! 3848: .Pp
! 3849: Note that this call will throw away any buffered input;
! 3850: usually this isn't a problem with an interactive scanner.
! 3851: .Pp
! 3852: Also note that flex C++ scanner classes are reentrant,
! 3853: so if using C++ is an option , they should be used instead.
! 3854: See
! 3855: .Sx GENERATING C++ SCANNERS
! 3856: above for details.
! 3857: .It
! 3858: .Fn output
1.1 deraadt 3859: is not supported.
3860: Output from the
1.16 ! jmc 3861: .Em ECHO
1.1 deraadt 3862: macro is done to the file-pointer
1.16 ! jmc 3863: .Fa yyout
! 3864: .Pq default stdout .
! 3865: .Pp
! 3866: .Fn output
! 3867: is not part of the
! 3868: .Tn POSIX
! 3869: specification.
! 3870: .It
! 3871: .Nm lex
! 3872: does not support exclusive start conditions
! 3873: .Pq %x ,
! 3874: though they are in the
! 3875: .Tn POSIX
! 3876: specification.
! 3877: .It
1.1 deraadt 3878: When definitions are expanded,
1.16 ! jmc 3879: .Nm
1.1 deraadt 3880: encloses them in parentheses.
1.16 ! jmc 3881: With
! 3882: .Nm lex ,
! 3883: the following:
! 3884: .Bd -literal -offset indent
! 3885: NAME [A-Z][A-Z0-9]*
! 3886: %%
! 3887: foo{NAME}? printf("Found it\en");
! 3888: %%
! 3889: .Ed
! 3890: .Pp
! 3891: will not match the string
! 3892: .Qq foo
! 3893: because when the macro is expanded the rule is equivalent to
! 3894: .Qq foo[A-Z][A-Z0-9]*?
! 3895: and the precedence is such that the
! 3896: .Sq ?\&
! 3897: is associated with
! 3898: .Qq [A-Z0-9]* .
! 3899: With
! 3900: .Nm ,
1.1 deraadt 3901: the rule will be expanded to
1.16 ! jmc 3902: .Qq foo([A-Z][A-Z0-9]*)?
! 3903: and so the string
! 3904: .Qq foo
! 3905: will match.
! 3906: .Pp
1.1 deraadt 3907: Note that if the definition begins with
1.16 ! jmc 3908: .Sq ^
1.1 deraadt 3909: or ends with
1.16 ! jmc 3910: .Sq $
! 3911: then it is not expanded with parentheses, to allow these operators to appear in
! 3912: definitions without losing their special meanings.
! 3913: But the
! 3914: .Sq Aq s ,
! 3915: .Sq / ,
1.1 deraadt 3916: and
1.16 ! jmc 3917: .Aq Aq EOF
1.1 deraadt 3918: operators cannot be used in a
1.16 ! jmc 3919: .Nm
1.1 deraadt 3920: definition.
1.16 ! jmc 3921: .Pp
1.1 deraadt 3922: Using
1.16 ! jmc 3923: .Fl l
1.1 deraadt 3924: results in the
1.16 ! jmc 3925: .Nm lex
1.1 deraadt 3926: behavior of no parentheses around the definition.
1.16 ! jmc 3927: .Pp
! 3928: The
! 3929: .Tn POSIX
! 3930: specification is that the definition be enclosed in parentheses.
! 3931: .It
1.1 deraadt 3932: Some implementations of
1.16 ! jmc 3933: .Nm lex
! 3934: allow a rule's action to begin on a separate line,
! 3935: if the rule's pattern has trailing whitespace:
! 3936: .Bd -literal -offset indent
! 3937: %%
! 3938: foo|bar<space here>
! 3939: { foobar_action(); }
! 3940: .Ed
! 3941: .Pp
! 3942: .Nm
1.1 deraadt 3943: does not support this feature.
1.16 ! jmc 3944: .It
1.1 deraadt 3945: The
1.16 ! jmc 3946: .Nm lex
! 3947: .Sq %r
! 3948: .Pq generate a Ratfor scanner
! 3949: option is not supported.
! 3950: It is not part of the
! 3951: .Tn POSIX
! 3952: specification.
! 3953: .It
1.1 deraadt 3954: After a call to
1.16 ! jmc 3955: .Fn unput ,
! 3956: .Fa yytext
! 3957: is undefined until the next token is matched,
! 3958: unless the scanner was built using
! 3959: .Dq %array .
1.1 deraadt 3960: This is not the case with
1.16 ! jmc 3961: .Nm lex
! 3962: or the
! 3963: .Tn POSIX
! 3964: specification.
! 3965: The
! 3966: .Fl l
1.1 deraadt 3967: option does away with this incompatibility.
1.16 ! jmc 3968: .It
1.1 deraadt 3969: The precedence of the
1.16 ! jmc 3970: .Sq {}
! 3971: .Pq numeric range
! 3972: operator is different.
! 3973: .Nm lex
! 3974: interprets
! 3975: .Qq abc{1,3}
! 3976: as match one, two, or three occurrences of
! 3977: .Sq abc ,
! 3978: whereas
! 3979: .Nm
! 3980: interprets it as match
! 3981: .Sq ab
! 3982: followed by one, two, or three occurrences of
! 3983: .Sq c .
! 3984: The latter is in agreement with the
! 3985: .Tn POSIX
! 3986: specification.
! 3987: .It
1.1 deraadt 3988: The precedence of the
1.16 ! jmc 3989: .Sq ^
1.1 deraadt 3990: operator is different.
1.16 ! jmc 3991: .Nm lex
! 3992: interprets
! 3993: .Qq ^foo|bar
! 3994: as match either
! 3995: .Sq foo
! 3996: at the beginning of a line, or
! 3997: .Sq bar
! 3998: anywhere, whereas
! 3999: .Nm
! 4000: interprets it as match either
! 4001: .Sq foo
! 4002: or
! 4003: .Sq bar
! 4004: if they come at the beginning of a line.
! 4005: The latter is in agreement with the
! 4006: .Tn POSIX
! 4007: specification.
! 4008: .It
1.1 deraadt 4009: The special table-size declarations such as
1.16 ! jmc 4010: .Sq %a
1.1 deraadt 4011: supported by
1.16 ! jmc 4012: .Nm lex
1.1 deraadt 4013: are not required by
1.16 ! jmc 4014: .Nm
1.1 deraadt 4015: scanners;
1.16 ! jmc 4016: .Nm
1.1 deraadt 4017: ignores them.
1.16 ! jmc 4018: .It
1.1 deraadt 4019: The name
1.16 ! jmc 4020: .Dv FLEX_SCANNER
1.1 deraadt 4021: is #define'd so scanners may be written for use with either
1.16 ! jmc 4022: .Nm
1.1 deraadt 4023: or
1.16 ! jmc 4024: .Nm lex .
1.1 deraadt 4025: Scanners also include
1.16 ! jmc 4026: .Dv YY_FLEX_MAJOR_VERSION
1.1 deraadt 4027: and
1.16 ! jmc 4028: .Dv YY_FLEX_MINOR_VERSION
1.1 deraadt 4029: indicating which version of
1.16 ! jmc 4030: .Nm
1.1 deraadt 4031: generated the scanner
1.16 ! jmc 4032: (for example, for the 2.5 release, these defines would be 2 and 5,
1.1 deraadt 4033: respectively).
1.16 ! jmc 4034: .El
! 4035: .Pp
1.1 deraadt 4036: The following
1.16 ! jmc 4037: .Nm
1.1 deraadt 4038: features are not included in
1.16 ! jmc 4039: .Nm lex
! 4040: or the
! 4041: .Tn POSIX
! 4042: specification:
! 4043: .Bd -unfilled -offset indent
! 4044: C++ scanners
! 4045: %option
! 4046: start condition scopes
! 4047: start condition stacks
! 4048: interactive/non-interactive scanners
! 4049: yy_scan_string() and friends
! 4050: yyterminate()
! 4051: yy_set_interactive()
! 4052: yy_set_bol()
! 4053: YY_AT_BOL()
! 4054: <<EOF>>
! 4055: <*>
! 4056: YY_DECL
! 4057: YY_START
! 4058: YY_USER_ACTION
! 4059: YY_USER_INIT
! 4060: #line directives
! 4061: %{}'s around actions
! 4062: multiple actions on a line
! 4063: .Ed
! 4064: .Pp
! 4065: plus almost all of the
! 4066: .Nm
! 4067: flags.
1.1 deraadt 4068: The last feature in the list refers to the fact that with
1.16 ! jmc 4069: .Nm
! 4070: Multiple actions ican be placed on the same line,
! 4071: separated with semi-colons, while with
! 4072: .Nm lex ,
1.1 deraadt 4073: the following
1.16 ! jmc 4074: .Pp
! 4075: .Dl foo handle_foo(); ++num_foos_seen;
! 4076: .Pp
! 4077: is
! 4078: .Pq rather surprisingly
! 4079: truncated to
! 4080: .Pp
! 4081: .Dl foo handle_foo();
! 4082: .Pp
! 4083: .Nm
! 4084: does not truncate the action.
! 4085: Actions that are not enclosed in braces
! 4086: are simply terminated at the end of the line.
! 4087: .Sh FILES
! 4088: .Bl -tag -width "<g++/FlexLexer.h>"
! 4089: .It flex.skl
! 4090: Skeleton scanner.
! 4091: This file is only used when building flex, not when
! 4092: .Nm
! 4093: executes.
! 4094: .It lex.backup
! 4095: Backing-up information for the
! 4096: .Fl b
! 4097: flag (called
! 4098: .Pa lex.bck
! 4099: on some systems).
! 4100: .It lex.yy.c
! 4101: Generated scanner
! 4102: (called
! 4103: .Pa lexyy.c
! 4104: on some systems).
! 4105: .It lex.yy.cc
! 4106: Generated C++ scanner class, when using
! 4107: .Fl + .
! 4108: .It Aq g++/FlexLexer.h
! 4109: Header file defining the C++ scanner base class,
! 4110: .Fa FlexLexer ,
! 4111: and its derived class,
! 4112: .Fa yyFlexLexer .
! 4113: .It /usr/lib/libl.*
! 4114: .Nm
! 4115: libraries.
! 4116: The
! 4117: .Pa /usr/lib/libfl.*\&
! 4118: libraries are links to these.
! 4119: Scanners must be linked using either
! 4120: .Fl \&ll
! 4121: or
! 4122: .Fl lfl .
! 4123: .El
! 4124: .Sh DIAGNOSTICS
! 4125: .Bl -diag
! 4126: .It warning, rule cannot be matched
! 4127: Indicates that the given rule cannot be matched because it follows other rules
! 4128: that will always match the same text as it.
! 4129: For example, in the following
! 4130: .Dq foo
! 4131: cannot be matched because it comes after an identifier
! 4132: .Qq catch-all
! 4133: rule:
! 4134: .Bd -literal -offset indent
! 4135: [a-z]+ got_identifier();
! 4136: foo got_foo();
! 4137: .Ed
! 4138: .Pp
1.1 deraadt 4139: Using
1.16 ! jmc 4140: .Em REJECT
1.1 deraadt 4141: in a scanner suppresses this warning.
1.16 ! jmc 4142: .It "warning, \-s option given but default rule can be matched"
! 4143: Means that it is possible
! 4144: .Pq perhaps only in a particular start condition
! 4145: that the default rule
! 4146: .Pq match any single character
! 4147: is the only one that will match a particular input.
! 4148: Since
! 4149: .Fl s
1.1 deraadt 4150: was given, presumably this is not intended.
1.16 ! jmc 4151: .It reject_used_but_not_detected undefined
! 4152: .It yymore_used_but_not_detected undefined
! 4153: These errors can occur at compile time.
! 4154: They indicate that the scanner uses
! 4155: .Em REJECT
1.1 deraadt 4156: or
1.16 ! jmc 4157: .Fn yymore
1.1 deraadt 4158: but that
1.16 ! jmc 4159: .Nm
1.1 deraadt 4160: failed to notice the fact, meaning that
1.16 ! jmc 4161: .Nm
1.1 deraadt 4162: scanned the first two sections looking for occurrences of these actions
1.16 ! jmc 4163: and failed to find any, but somehow they snuck in
! 4164: .Pq via an #include file, for example .
! 4165: Use
! 4166: .Dq %option reject
! 4167: or
! 4168: .Dq %option yymore
! 4169: to indicate to
! 4170: .Nm
! 4171: that these features are really needed.
! 4172: .It flex scanner jammed
! 4173: A scanner compiled with
! 4174: .Fl s
! 4175: has encountered an input string which wasn't matched by any of its rules.
! 4176: This error can also occur due to internal problems.
! 4177: .It token too large, exceeds YYLMAX
! 4178: The scanner uses
! 4179: .Dq %array
1.1 deraadt 4180: and one of its rules matched a string longer than the
1.16 ! jmc 4181: .Dv YYLMAX
! 4182: constant
! 4183: .Pq 8K bytes by default .
! 4184: The value can be increased by #define'ing
! 4185: .Dv YYLMAX
! 4186: in the definitions section of
! 4187: .Nm
1.1 deraadt 4188: input.
1.16 ! jmc 4189: .It "scanner requires \-8 flag to use the character 'x'"
! 4190: The scanner specification includes recognizing the 8-bit character
! 4191: .Sq x
! 4192: and the
! 4193: .Fl 8
! 4194: flag was not specified, and defaulted to 7-bit because the
! 4195: .Fl Cf
! 4196: or
! 4197: .Fl CF
! 4198: table compression options were used.
! 4199: See the discussion of the
! 4200: .Fl 7
1.1 deraadt 4201: flag for details.
1.16 ! jmc 4202: .It flex scanner push-back overflow
! 4203: unput() was used to push back so much text that the scanner's buffer
! 4204: could not hold both the pushed-back text and the current token in
! 4205: .Fa yytext .
! 4206: Ideally the scanner should dynamically resize the buffer in this case,
! 4207: but at present it does not.
! 4208: .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
! 4209: The scanner was working on matching an extremely large token and needed
! 4210: to expand the input buffer.
! 4211: This doesn't work with scanners that use
! 4212: .Em REJECT .
! 4213: .It "fatal flex scanner internal error--end of buffer missed"
1.1 deraadt 4214: This can occur in an scanner which is reentered after a long-jump
1.16 ! jmc 4215: has jumped out
! 4216: .Pq or over
! 4217: the scanner's activation frame.
! 4218: Before reentering the scanner, use:
! 4219: .Pp
! 4220: .Dl yyrestart(yyin);
! 4221: .Pp
1.1 deraadt 4222: or, as noted above, switch to using the C++ scanner class.
1.16 ! jmc 4223: .It "too many start conditions in <> construct!"
! 4224: More start conditions than exist were listed in a <> construct
! 4225: (so at least one of them must have been listed twice).
! 4226: .El
! 4227: .Sh SEE ALSO
! 4228: .Xr awk 1 ,
! 4229: .Xr lex 1 ,
! 4230: .Xr sed 1 ,
! 4231: .Xr yacc 1
! 4232: .Pp
! 4233: .Rs
! 4234: .%A John Levine
! 4235: .%A Tony Mason
! 4236: .%A Doug Brown
! 4237: .%B Lex & Yacc
! 4238: .%I O'Reilly and Associates
! 4239: .%N 2nd edition
! 4240: .Re
! 4241: .Rs
! 4242: .%A M. E. Lesk
! 4243: .%A E. Schmidt
! 4244: .%B LEX \- Lexical Analyzer Generator
! 4245: .Re
! 4246: .Rs
! 4247: .%A Alfred Aho
! 4248: .%A Ravi Sethi
! 4249: .%A Jeffrey Ullman
! 4250: .%B Compilers: Principles, Techniques and Tools
! 4251: .%I Addison-Wesley
! 4252: .%D 1986
! 4253: .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
! 4254: .Re
! 4255: .Sh AUTHORS
1.1 deraadt 4256: Vern Paxson, with the help of many ideas and much inspiration from
1.16 ! jmc 4257: Van Jacobson.
! 4258: Original version by Jef Poskanzer.
! 4259: The fast table representation is a partial implementation of a design done by
! 4260: Van Jacobson.
! 4261: The implementation was done by Kevin Gong and Vern Paxson.
! 4262: .Pp
1.1 deraadt 4263: Thanks to the many
1.16 ! jmc 4264: .Nm
1.1 deraadt 4265: beta-testers, feedbackers, and contributors, especially Francois Pinard,
4266: Casey Leedom,
4267: Robert Abramovitz,
4268: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4269: Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4270: Karl Berry, Peter A. Bigot, Simon Blanchard,
4271: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4272: Brian Clapper, J.T. Conklin,
4273: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11 deraadt 4274: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1 deraadt 4275: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4276: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4277: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4278: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4279: Jan Hajic, Charles Hemphill, NORO Hideo,
4280: Jarkko Hietaniemi, Scott Hofmann,
4281: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4282: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4283: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4284: Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4285: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4286: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4287: David Loffredo, Mike Long,
4288: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4289: Bengt Martensson, Chris Metcalf,
4290: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4291: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4292: Richard Ohnemus, Karsten Pahnke,
1.16 ! jmc 4293: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
! 4294: Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
1.1 deraadt 4295: Frederic Raimbault, Pat Rankin, Rick Richardson,
4296: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4297: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4298: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4299: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4300: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4301: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
1.16 ! jmc 4302: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
! 4303: Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
! 4304: and those whose names have slipped my marginal mail-archiving skills
! 4305: but whose contributions are appreciated all the
1.1 deraadt 4306: same.
1.16 ! jmc 4307: .Pp
1.1 deraadt 4308: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4309: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4310: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4311: distribution headaches.
1.16 ! jmc 4312: .Pp
! 4313: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
! 4314: to Benson Margulies and Fred Burke for C++ support;
! 4315: to Kent Williams and Tom Epperly for C++ class support;
! 4316: to Ove Ewerlid for support of NUL's;
! 4317: and to Eric Hughes for support of multiple buffers.
! 4318: .Pp
1.1 deraadt 4319: This work was primarily done when I was with the Real Time Systems Group
1.16 ! jmc 4320: at the Lawrence Berkeley Laboratory in Berkeley, CA.
! 4321: Many thanks to all there for the support I received.
! 4322: .Pp
! 4323: Send comments to
! 4324: .Aq vern@ee.lbl.gov .
! 4325: .Sh BUGS
! 4326: Some trailing context patterns cannot be properly matched and generate
! 4327: warning messages
! 4328: .Pq "dangerous trailing context" .
! 4329: These are patterns where the ending of the first part of the rule
! 4330: matches the beginning of the second part, such as
! 4331: .Qq zx*/xy* ,
! 4332: where the
! 4333: .Sq x*
! 4334: matches the
! 4335: .Sq x
! 4336: at the beginning of the trailing context.
! 4337: (Note that the POSIX draft states that the text matched by such patterns
! 4338: is undefined.)
! 4339: .Pp
! 4340: For some trailing context rules, parts which are actually fixed-length are
! 4341: not recognized as such, leading to the above mentioned performance loss.
! 4342: In particular, parts using
! 4343: .Sq |\&
! 4344: or
! 4345: .Sq {n}
! 4346: (such as
! 4347: .Qq foo{3} )
! 4348: are always considered variable-length.
! 4349: .Pp
! 4350: Combining trailing context with the special
! 4351: .Sq |\&
! 4352: action can result in fixed trailing context being turned into
! 4353: the more expensive variable trailing context.
! 4354: For example, in the following:
! 4355: .Bd -literal -offset indent
! 4356: %%
! 4357: abc |
! 4358: xyz/def
! 4359: .Ed
! 4360: .Pp
! 4361: Use of
! 4362: .Fn unput
! 4363: invalidates yytext and yyleng, unless the
! 4364: .Dq %array
! 4365: directive
! 4366: or the
! 4367: .Fl l
! 4368: option has been used.
! 4369: .Pp
! 4370: Pattern-matching of NUL's is substantially slower than matching other
! 4371: characters.
! 4372: .Pp
! 4373: Dynamic resizing of the input buffer is slow, as it entails rescanning
! 4374: all the text matched so far by the current
! 4375: .Pq generally huge
! 4376: token.
! 4377: .Pp
! 4378: Due to both buffering of input and read-ahead,
! 4379: it is not possible to intermix calls to
! 4380: .Aq Pa stdio.h
! 4381: routines, such as, for example,
! 4382: .Fn getchar ,
! 4383: with
! 4384: .Nm
! 4385: rules and expect it to work.
! 4386: Call
! 4387: .Fn input
! 4388: instead.
! 4389: .Pp
! 4390: The total table entries listed by the
! 4391: .Fl v
! 4392: flag excludes the number of table entries needed to determine
! 4393: what rule has been matched.
! 4394: The number of entries is equal to the number of DFA states
! 4395: if the scanner does not use
! 4396: .Em REJECT ,
! 4397: and somewhat greater than the number of states if it does.
! 4398: .Pp
! 4399: .Em REJECT
! 4400: cannot be used with the
! 4401: .Fl f
! 4402: or
! 4403: .Fl F
! 4404: options.
! 4405: .Pp
! 4406: The
! 4407: .Nm
! 4408: internal algorithms need documentation.