Annotation of src/usr.bin/lex/flex.1, Revision 1.43
1.43 ! jmc 1: .\" $OpenBSD: flex.1,v 1.42 2015/09/21 09:24:13 nicm Exp $
1.16 jmc 2: .\"
1.12 jmc 3: .\" Copyright (c) 1990 The Regents of the University of California.
4: .\" All rights reserved.
1.2 deraadt 5: .\"
1.12 jmc 6: .\" This code is derived from software contributed to Berkeley by
7: .\" Vern Paxson.
8: .\"
9: .\" The United States Government has rights in this work pursuant
10: .\" to contract no. DE-AC03-76SF00098 between the United States
11: .\" Department of Energy and the University of California.
12: .\"
13: .\" Redistribution and use in source and binary forms, with or without
1.13 millert 14: .\" modification, are permitted provided that the following conditions
15: .\" are met:
16: .\"
17: .\" 1. Redistributions of source code must retain the above copyright
18: .\" notice, this list of conditions and the following disclaimer.
19: .\" 2. Redistributions in binary form must reproduce the above copyright
20: .\" notice, this list of conditions and the following disclaimer in the
21: .\" documentation and/or other materials provided with the distribution.
22: .\"
23: .\" Neither the name of the University nor the names of its contributors
24: .\" may be used to endorse or promote products derived from this software
25: .\" without specific prior written permission.
26: .\"
27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30: .\" PURPOSE.
1.16 jmc 31: .\"
1.43 ! jmc 32: .Dd $Mdocdate: September 21 2015 $
1.16 jmc 33: .Dt FLEX 1
34: .Os
35: .Sh NAME
1.42 nicm 36: .Nm flex ,
37: .Nm flex++ ,
38: .Nm lex
1.16 jmc 39: .Nd fast lexical analyzer generator
40: .Sh SYNOPSIS
41: .Nm
1.28 jmc 42: .Bk -words
1.31 jmc 43: .Op Fl 78BbdFfhIiLlnpsTtVvw+?
1.16 jmc 44: .Op Fl C Ns Op Cm aeFfmr
45: .Op Fl Fl help
46: .Op Fl Fl version
1.28 jmc 47: .Op Fl o Ns Ar output
48: .Op Fl P Ns Ar prefix
49: .Op Fl S Ns Ar skeleton
50: .Op Ar
51: .Ek
1.21 jmc 52: .Sh DESCRIPTION
53: .Nm
54: is a tool for generating
55: .Em scanners :
56: programs which recognize lexical patterns in text.
57: .Nm
58: reads the given input files, or its standard input if no file names are given,
59: for a description of a scanner to generate.
60: The description is in the form of pairs of regular expressions and C code,
61: called
62: .Em rules .
63: .Nm
64: generates as output a C source file,
65: .Pa lex.yy.c ,
66: which defines a routine
67: .Fn yylex .
68: This file is compiled and linked with the
69: .Fl lfl
70: library to produce an executable.
71: When the executable is run, it analyzes its input for occurrences
72: of the regular expressions.
73: Whenever it finds one, it executes the corresponding C code.
1.42 nicm 74: .Pp
75: .Nm lex
76: is a synonym for
77: .Nm flex .
78: .Nm flex++
79: is a synonym for
80: .Nm
81: .Fl + .
1.21 jmc 82: .Pp
1.16 jmc 83: The manual includes both tutorial and reference sections:
84: .Bl -ohang
85: .It Sy Some Simple Examples
86: .It Sy Format of the Input File
87: .It Sy Patterns
88: The extended regular expressions used by
89: .Nm .
90: .It Sy How the Input is Matched
91: The rules for determining what has been matched.
92: .It Sy Actions
93: How to specify what to do when a pattern is matched.
94: .It Sy The Generated Scanner
95: Details regarding the scanner that
96: .Nm
97: produces;
98: how to control the input source.
99: .It Sy Start Conditions
100: Introducing context into scanners, and managing
101: .Qq mini-scanners .
102: .It Sy Multiple Input Buffers
103: How to manipulate multiple input sources;
104: how to scan from strings instead of files.
105: .It Sy End-of-File Rules
106: Special rules for matching the end of the input.
107: .It Sy Miscellaneous Macros
108: A summary of macros available to the actions.
109: .It Sy Values Available to the User
110: A summary of values available to the actions.
111: .It Sy Interfacing with Yacc
112: Connecting flex scanners together with
113: .Xr yacc 1
114: parsers.
115: .It Sy Options
116: .Nm
117: command-line options, and the
118: .Dq %option
119: directive.
120: .It Sy Performance Considerations
121: How to make scanners go as fast as possible.
122: .It Sy Generating C++ Scanners
123: The
124: .Pq experimental
125: facility for generating C++ scanner classes.
126: .It Sy Incompatibilities with Lex and POSIX
127: How
128: .Nm
1.36 schwarze 129: differs from
130: .At
131: .Nm lex
132: and the
1.16 jmc 133: .Tn POSIX
1.36 schwarze 134: .Nm lex
135: standard.
1.16 jmc 136: .It Sy Files
137: Files used by
138: .Nm .
139: .It Sy Diagnostics
140: Those error messages produced by
141: .Nm
142: .Pq or scanners it generates
143: whose meanings might not be apparent.
144: .It Sy See Also
145: Other documentation, related tools.
146: .It Sy Authors
147: Includes contact information.
148: .It Sy Bugs
149: Known problems with
150: .Nm .
151: .El
152: .Sh SOME SIMPLE EXAMPLES
1.1 deraadt 153: First some simple examples to get the flavor of how one uses
1.16 jmc 154: .Nm .
1.1 deraadt 155: The following
1.16 jmc 156: .Nm
1.1 deraadt 157: input specifies a scanner which whenever it encounters the string
1.16 jmc 158: .Qq username
159: will replace it with the user's login name:
160: .Bd -literal -offset indent
161: %%
162: username printf("%s", getlogin());
163: .Ed
164: .Pp
1.1 deraadt 165: By default, any text not matched by a
1.16 jmc 166: .Nm
167: scanner is copied to the output, so the net effect of this scanner is
168: to copy its input file to its output with each occurrence of
169: .Qq username
170: expanded.
171: In this input, there is just one rule.
172: .Qq username
173: is the
174: .Em pattern
175: and the
176: .Qq printf
177: is the
178: .Em action .
179: The
180: .Qq %%
181: marks the beginning of the rules.
182: .Pp
1.1 deraadt 183: Here's another simple example:
1.16 jmc 184: .Bd -literal -offset indent
1.20 pvalchev 185: %{
1.16 jmc 186: int num_lines = 0, num_chars = 0;
1.20 pvalchev 187: %}
1.1 deraadt 188:
1.16 jmc 189: %%
190: \en ++num_lines; ++num_chars;
191: \&. ++num_chars;
192:
193: %%
194: main()
195: {
196: yylex();
197: printf("# of lines = %d, # of chars = %d\en",
198: num_lines, num_chars);
199: }
200: .Ed
201: .Pp
1.1 deraadt 202: This scanner counts the number of characters and the number
1.16 jmc 203: of lines in its input
204: (it produces no output other than the final report on the counts).
205: The first line declares two globals,
206: .Qq num_lines
207: and
208: .Qq num_chars ,
209: which are accessible both inside
210: .Fn yylex
1.1 deraadt 211: and in the
1.16 jmc 212: .Fn main
213: routine declared after the second
214: .Qq %% .
215: There are two rules, one which matches a newline
216: .Pq \&"\en\&"
217: and increments both the line count and the character count,
218: and one which matches any character other than a newline
219: (indicated by the
220: .Qq \&.
221: regular expression).
222: .Pp
1.1 deraadt 223: A somewhat more complicated example:
1.16 jmc 224: .Bd -literal -offset indent
225: /* scanner for a toy Pascal-like language */
1.1 deraadt 226:
1.16 jmc 227: %{
228: /* need this for the call to atof() below */
229: #include <math.h>
230: %}
1.1 deraadt 231:
1.16 jmc 232: DIGIT [0-9]
233: ID [a-z][a-z0-9]*
1.1 deraadt 234:
1.16 jmc 235: %%
1.1 deraadt 236:
1.16 jmc 237: {DIGIT}+ {
238: printf("An integer: %s (%d)\en", yytext,
239: atoi(yytext));
240: }
1.1 deraadt 241:
1.16 jmc 242: {DIGIT}+"."{DIGIT}* {
243: printf("A float: %s (%g)\en", yytext,
244: atof(yytext));
245: }
1.1 deraadt 246:
1.16 jmc 247: if|then|begin|end|procedure|function {
248: printf("A keyword: %s\en", yytext);
249: }
1.1 deraadt 250:
1.16 jmc 251: {ID} printf("An identifier: %s\en", yytext);
1.1 deraadt 252:
1.16 jmc 253: "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
1.1 deraadt 254:
1.16 jmc 255: "{"[^}\en]*"}" /* eat up one-line comments */
1.1 deraadt 256:
1.16 jmc 257: [ \et\en]+ /* eat up whitespace */
1.1 deraadt 258:
1.16 jmc 259: \&. printf("Unrecognized character: %s\en", yytext);
1.1 deraadt 260:
1.16 jmc 261: %%
1.1 deraadt 262:
1.16 jmc 263: main(int argc, char *argv[])
264: {
265: ++argv; --argc; /* skip over program name */
266: if (argc > 0)
267: yyin = fopen(argv[0], "r");
1.1 deraadt 268: else
269: yyin = stdin;
1.7 aaron 270:
1.1 deraadt 271: yylex();
1.16 jmc 272: }
273: .Ed
274: .Pp
275: This is the beginnings of a simple scanner for a language like Pascal.
276: It identifies different types of
277: .Em tokens
1.1 deraadt 278: and reports on what it has seen.
1.16 jmc 279: .Pp
280: The details of this example will be explained in the following sections.
281: .Sh FORMAT OF THE INPUT FILE
1.1 deraadt 282: The
1.16 jmc 283: .Nm
1.1 deraadt 284: input file consists of three sections, separated by a line with just
1.16 jmc 285: .Qq %%
1.1 deraadt 286: in it:
1.16 jmc 287: .Bd -unfilled -offset indent
288: definitions
289: %%
290: rules
291: %%
292: user code
293: .Ed
294: .Pp
1.1 deraadt 295: The
1.16 jmc 296: .Em definitions
1.1 deraadt 297: section contains declarations of simple
1.16 jmc 298: .Em name
1.1 deraadt 299: definitions to simplify the scanner specification, and declarations of
1.16 jmc 300: .Em start conditions ,
1.1 deraadt 301: which are explained in a later section.
1.16 jmc 302: .Pp
1.1 deraadt 303: Name definitions have the form:
1.16 jmc 304: .Pp
305: .D1 name definition
306: .Pp
307: The
308: .Qq name
309: is a word beginning with a letter or an underscore
310: .Pq Sq _
311: followed by zero or more letters, digits,
312: .Sq _ ,
313: or
314: .Sq -
315: .Pq dash .
1.8 aaron 316: The definition is taken to begin at the first non-whitespace character
1.1 deraadt 317: following the name and continuing to the end of the line.
1.16 jmc 318: The definition can subsequently be referred to using
319: .Qq {name} ,
320: which will expand to
321: .Qq (definition) .
322: For example:
323: .Bd -literal -offset indent
324: DIGIT [0-9]
325: ID [a-z][a-z0-9]*
326: .Ed
327: .Pp
328: This defines
329: .Qq DIGIT
330: to be a regular expression which matches a single digit, and
331: .Qq ID
332: to be a regular expression which matches a letter
1.1 deraadt 333: followed by zero-or-more letters-or-digits.
334: A subsequent reference to
1.16 jmc 335: .Pp
336: .Dl {DIGIT}+"."{DIGIT}*
337: .Pp
1.1 deraadt 338: is identical to
1.16 jmc 339: .Pp
340: .Dl ([0-9])+"."([0-9])*
341: .Pp
342: and matches one-or-more digits followed by a
343: .Sq .\&
344: followed by zero-or-more digits.
345: .Pp
1.1 deraadt 346: The
1.16 jmc 347: .Em rules
1.1 deraadt 348: section of the
1.16 jmc 349: .Nm
1.1 deraadt 350: input contains a series of rules of the form:
1.16 jmc 351: .Pp
1.35 schwarze 352: .Dl pattern action
1.16 jmc 353: .Pp
354: The pattern must be unindented and the action must begin
1.1 deraadt 355: on the same line.
1.16 jmc 356: .Pp
1.1 deraadt 357: See below for a further description of patterns and actions.
1.16 jmc 358: .Pp
1.1 deraadt 359: Finally, the user code section is simply copied to
1.16 jmc 360: .Pa lex.yy.c
1.1 deraadt 361: verbatim.
1.16 jmc 362: It is used for companion routines which call or are called by the scanner.
363: The presence of this section is optional;
1.1 deraadt 364: if it is missing, the second
1.16 jmc 365: .Qq %%
366: in the input file may be skipped too.
367: .Pp
368: In the definitions and rules sections, any indented text or text enclosed in
369: .Sq %{
1.1 deraadt 370: and
1.16 jmc 371: .Sq %}
372: is copied verbatim to the output
373: .Pq with the %{}'s removed .
1.1 deraadt 374: The %{}'s must appear unindented on lines by themselves.
1.16 jmc 375: .Pp
1.1 deraadt 376: In the rules section,
1.16 jmc 377: any indented or %{} text appearing before the first rule may be used to
378: declare variables which are local to the scanning routine and
379: .Pq after the declarations
1.1 deraadt 380: code which is to be executed whenever the scanning routine is entered.
381: Other indented or %{} text in the rule section is still copied to the output,
382: but its meaning is not well-defined and it may well cause compile-time
383: errors (this feature is present for
1.16 jmc 384: .Tn POSIX
1.1 deraadt 385: compliance; see below for other such features).
1.16 jmc 386: .Pp
387: In the definitions section
388: .Pq but not in the rules section ,
389: an unindented comment
390: (i.e., a line beginning with
391: .Qq /* )
392: is also copied verbatim to the output up to the next
393: .Qq */ .
394: .Sh PATTERNS
1.1 deraadt 395: The patterns in the input are written using an extended set of regular
1.16 jmc 396: expressions.
397: These are:
398: .Bl -tag -width "XXXXXXXX"
399: .It x
400: Match the character
401: .Sq x .
402: .It .\&
403: Any character
404: .Pq byte
405: except newline.
406: .It [xyz]
407: A
408: .Qq character class ;
409: in this case, the pattern matches either an
410: .Sq x ,
411: a
412: .Sq y ,
413: or a
414: .Sq z .
415: .It [abj-oZ]
416: A
417: .Qq character class
418: with a range in it; matches an
419: .Sq a ,
420: a
421: .Sq b ,
422: any letter from
423: .Sq j
424: through
425: .Sq o ,
426: or a
427: .Sq Z .
428: .It [^A-Z]
429: A
430: .Qq negated character class ,
431: i.e., any character but those in the class.
432: In this case, any character EXCEPT an uppercase letter.
433: .It [^A-Z\en]
434: Any character EXCEPT an uppercase letter or a newline.
435: .It r*
436: Zero or more r's, where
437: .Sq r
438: is any regular expression.
439: .It r+
440: One or more r's.
441: .It r?
442: Zero or one r's (that is,
443: .Qq an optional r ) .
444: .It r{2,5}
445: Anywhere from two to five r's.
446: .It r{2,}
447: Two or more r's.
448: .It r{4}
449: Exactly 4 r's.
450: .It {name}
451: The expansion of the
452: .Qq name
453: definition
454: .Pq see above .
455: .It \&"[xyz]\e\&"foo\&"
456: The literal string: [xyz]"foo.
457: .It \eX
458: If
459: .Sq X
460: is an
461: .Sq a ,
462: .Sq b ,
463: .Sq f ,
464: .Sq n ,
465: .Sq r ,
466: .Sq t ,
467: or
468: .Sq v ,
469: then the ANSI-C interpretation of
470: .Sq \eX .
471: Otherwise, a literal
472: .Sq X
473: (used to escape operators such as
474: .Sq * ) .
475: .It \e0
476: A NUL character
477: .Pq ASCII code 0 .
478: .It \e123
479: The character with octal value 123.
480: .It \ex2a
481: The character with hexadecimal value 2a.
482: .It (r)
483: Match an
484: .Sq r ;
485: parentheses are used to override precedence
486: .Pq see below .
487: .It rs
488: The regular expression
489: .Sq r
490: followed by the regular expression
491: .Sq s ;
492: called
493: .Qq concatenation .
494: .It r|s
495: Either an
496: .Sq r
497: or an
498: .Sq s .
499: .It r/s
500: An
501: .Sq r ,
502: but only if it is followed by an
503: .Sq s .
504: The text matched by
505: .Sq s
506: is included when determining whether this rule is the
507: .Qq longest match ,
508: but is then returned to the input before the action is executed.
509: So the action only sees the text matched by
510: .Sq r .
511: This type of pattern is called
512: .Qq trailing context .
513: (There are some combinations of r/s that
514: .Nm
515: cannot match correctly; see notes in the
516: .Sx BUGS
517: section below regarding
518: .Qq dangerous trailing context . )
519: .It ^r
520: An
521: .Sq r ,
522: but only at the beginning of a line
523: (i.e., just starting to scan, or right after a newline has been scanned).
524: .It r$
525: An
526: .Sq r ,
527: but only at the end of a line
528: .Pq i.e., just before a newline .
529: Equivalent to
530: .Qq r/\en .
531: .Pp
532: Note that
533: .Nm flex Ns 's
534: notion of
535: .Qq newline
536: is exactly whatever the C compiler used to compile
537: .Nm
538: interprets
539: .Sq \en
540: as.
541: .\" In particular, on some DOS systems you must either filter out \er's in the
542: .\" input yourself, or explicitly use r/\er\en for
543: .\" .Qq r$ .
544: .It <s>r
545: An
546: .Sq r ,
547: but only in start condition
548: .Sq s
549: .Pq see below for discussion of start conditions .
550: .It <s1,s2,s3>r
551: The same, but in any of start conditions s1, s2, or s3.
552: .It <*>r
553: An
554: .Sq r
555: in any start condition, even an exclusive one.
556: .It <<EOF>>
557: An end-of-file.
558: .It <s1,s2><<EOF>>
559: An end-of-file when in start condition s1 or s2.
560: .El
561: .Pp
1.1 deraadt 562: Note that inside of a character class, all regular expression operators
1.16 jmc 563: lose their special meaning except escape
564: .Pq Sq \e
565: and the character class operators,
566: .Sq - ,
567: .Sq ]\& ,
568: and, at the beginning of the class,
569: .Sq ^ .
570: .Pp
1.1 deraadt 571: The regular expressions listed above are grouped according to
572: precedence, from highest precedence at the top to lowest at the bottom.
1.16 jmc 573: Those grouped together have equal precedence.
574: For example,
575: .Pp
576: .D1 foo|bar*
577: .Pp
1.1 deraadt 578: is the same as
1.16 jmc 579: .Pp
580: .D1 (foo)|(ba(r*))
581: .Pp
582: since the
583: .Sq *
584: operator has higher precedence than concatenation,
585: and concatenation higher than alternation
586: .Pq Sq |\& .
587: This pattern therefore matches
588: .Em either
589: the string
590: .Qq foo
591: .Em or
592: the string
593: .Qq ba
594: followed by zero-or-more r's.
595: To match
596: .Qq foo
597: or zero-or-more "bar"'s,
598: use:
599: .Pp
600: .D1 foo|(bar)*
601: .Pp
1.1 deraadt 602: and to match zero-or-more "foo"'s-or-"bar"'s:
1.16 jmc 603: .Pp
604: .D1 (foo|bar)*
605: .Pp
1.1 deraadt 606: In addition to characters and ranges of characters, character classes
607: can also contain character class
1.16 jmc 608: .Em expressions .
1.1 deraadt 609: These are expressions enclosed inside
1.16 jmc 610: .Sq [:
611: and
612: .Sq :]
613: delimiters (which themselves must appear between the
1.26 schwarze 614: .Sq \&[
1.1 deraadt 615: and
1.16 jmc 616: .Sq ]\&
617: of the
1.1 deraadt 618: character class; other elements may occur inside the character class, too).
619: The valid expressions are:
1.16 jmc 620: .Bd -unfilled -offset indent
621: [:alnum:] [:alpha:] [:blank:]
622: [:cntrl:] [:digit:] [:graph:]
623: [:lower:] [:print:] [:punct:]
624: [:space:] [:upper:] [:xdigit:]
625: .Ed
626: .Pp
1.1 deraadt 627: These expressions all designate a set of characters equivalent to
628: the corresponding standard C
1.16 jmc 629: .Fn isXXX
630: function.
631: For example, [:alnum:] designates those characters for which
632: .Xr isalnum 3
633: returns true \- i.e., any alphabetic or numeric.
1.1 deraadt 634: Some systems don't provide
1.16 jmc 635: .Xr isblank 3 ,
636: so
637: .Nm
638: defines [:blank:] as a blank or a tab.
639: .Pp
1.1 deraadt 640: For example, the following character classes are all equivalent:
1.16 jmc 641: .Bd -unfilled -offset indent
642: [[:alnum:]]
643: [[:alpha:][:digit:]]
644: [[:alpha:]0-9]
645: [a-zA-Z0-9]
646: .Ed
647: .Pp
648: If the scanner is case-insensitive (the
649: .Fl i
650: flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
651: .Pp
1.1 deraadt 652: Some notes on patterns:
1.16 jmc 653: .Bl -dash
654: .It
655: A negated character class such as the example
656: .Qq [^A-Z]
657: above will match a newline unless "\en"
658: .Pq or an equivalent escape sequence
659: is one of the characters explicitly present in the negated character class
660: (e.g.,
661: .Qq [^A-Z\en] ) .
662: This is unlike how many other regular expression tools treat negated character
663: classes, but unfortunately the inconsistency is historically entrenched.
664: Matching newlines means that a pattern like
665: .Qq [^"]*
666: can match the entire input unless there's another quote in the input.
667: .It
668: A rule can have at most one instance of trailing context
669: (the
670: .Sq /
671: operator or the
672: .Sq $
673: operator).
674: The start condition,
675: .Sq ^ ,
676: and
677: .Qq <<EOF>>
1.40 jmc 678: patterns can only occur at the beginning of a pattern and, as well as with
1.16 jmc 679: .Sq /
680: and
681: .Sq $ ,
682: cannot be grouped inside parentheses.
683: A
684: .Sq ^
685: which does not occur at the beginning of a rule or a
686: .Sq $
687: which does not occur at the end of a rule loses its special properties
688: and is treated as a normal character.
689: .It
1.1 deraadt 690: The following are illegal:
1.16 jmc 691: .Bd -unfilled -offset indent
692: foo/bar$
693: <sc1>foo<sc2>bar
694: .Ed
695: .Pp
696: Note that the first of these, can be written
697: .Qq foo/bar\en .
698: .It
699: The following will result in
700: .Sq $
701: or
702: .Sq ^
703: being treated as a normal character:
704: .Bd -unfilled -offset indent
705: foo|(bar$)
706: foo|^bar
707: .Ed
708: .Pp
709: If what's wanted is a
710: .Qq foo
711: or a bar-followed-by-a-newline, the following could be used
712: (the special
713: .Sq |\&
714: action is explained below):
715: .Bd -unfilled -offset indent
716: foo |
717: bar$ /* action goes here */
718: .Ed
719: .Pp
1.1 deraadt 720: A similar trick will work for matching a foo or a
721: bar-at-the-beginning-of-a-line.
1.16 jmc 722: .El
723: .Sh HOW THE INPUT IS MATCHED
724: When the generated scanner is run,
725: it analyzes its input looking for strings which match any of its patterns.
726: If it finds more than one match,
727: it takes the one matching the most text
728: (for trailing context rules, this includes the length of the trailing part,
729: even though it will then be returned to the input).
730: If it finds two or more matches of the same length,
731: the rule listed first in the
732: .Nm
1.1 deraadt 733: input file is chosen.
1.16 jmc 734: .Pp
1.1 deraadt 735: Once the match is determined, the text corresponding to the match
736: (called the
1.16 jmc 737: .Em token )
1.1 deraadt 738: is made available in the global character pointer
1.16 jmc 739: .Fa yytext ,
1.1 deraadt 740: and its length in the global integer
1.16 jmc 741: .Fa yyleng .
1.1 deraadt 742: The
1.16 jmc 743: .Em action
744: corresponding to the matched pattern is then executed
745: .Pq a more detailed description of actions follows ,
746: and then the remaining input is scanned for another match.
747: .Pp
748: If no match is found, then the default rule is executed:
749: the next character in the input is considered matched and
750: copied to the standard output.
751: Thus, the simplest legal
752: .Nm
1.1 deraadt 753: input is:
1.16 jmc 754: .Pp
755: .D1 %%
756: .Pp
757: which generates a scanner that simply copies its input
758: .Pq one character at a time
759: to its output.
760: .Pp
1.1 deraadt 761: Note that
1.16 jmc 762: .Fa yytext
763: can be defined in two different ways:
764: either as a character pointer or as a character array.
765: Which definition
766: .Nm
767: uses can be controlled by including one of the special directives
768: .Dq %pointer
769: or
770: .Dq %array
771: in the first
772: .Pq definitions
773: section of flex input.
774: The default is
775: .Dq %pointer ,
776: unless the
777: .Fl l
1.36 schwarze 778: .Nm lex
779: compatibility option is used, in which case
1.16 jmc 780: .Fa yytext
1.1 deraadt 781: will be an array.
782: The advantage of using
1.16 jmc 783: .Dq %pointer
1.1 deraadt 784: is substantially faster scanning and no buffer overflow when matching
1.16 jmc 785: very large tokens
786: .Pq unless not enough dynamic memory is available .
787: The disadvantage is that actions are restricted in how they can modify
788: .Fa yytext
789: .Pq see the next section ,
790: and calls to the
791: .Fn unput
1.10 deraadt 792: function destroy the present contents of
1.16 jmc 793: .Fa yytext ,
1.1 deraadt 794: which can be a considerable porting headache when moving between different
1.16 jmc 795: .Nm lex
1.1 deraadt 796: versions.
1.16 jmc 797: .Pp
1.1 deraadt 798: The advantage of
1.16 jmc 799: .Dq %array
800: is that
801: .Fa yytext
802: can be modified as much as wanted, and calls to
803: .Fn unput
1.1 deraadt 804: do not destroy
1.16 jmc 805: .Fa yytext
806: .Pq see below .
807: Furthermore, existing
808: .Nm lex
1.1 deraadt 809: programs sometimes access
1.16 jmc 810: .Fa yytext
1.1 deraadt 811: externally using declarations of the form:
1.16 jmc 812: .Pp
813: .D1 extern char yytext[];
814: .Pp
1.1 deraadt 815: This definition is erroneous when used with
1.16 jmc 816: .Dq %pointer ,
1.1 deraadt 817: but correct for
1.16 jmc 818: .Dq %array .
819: .Pp
820: .Dq %array
1.1 deraadt 821: defines
1.16 jmc 822: .Fa yytext
1.1 deraadt 823: to be an array of
1.16 jmc 824: .Dv YYLMAX
825: characters, which defaults to a fairly large value.
826: The size can be changed by simply #define'ing
827: .Dv YYLMAX
828: to a different value in the first section of
829: .Nm
830: input.
831: As mentioned above, with
832: .Dq %pointer
833: yytext grows dynamically to accommodate large tokens.
834: While this means a
835: .Dq %pointer
836: scanner can accommodate very large tokens
837: .Pq such as matching entire blocks of comments ,
838: bear in mind that each time the scanner must resize
839: .Fa yytext
1.1 deraadt 840: it also must rescan the entire token from the beginning, so matching such
841: tokens can prove slow.
1.16 jmc 842: .Fa yytext
843: presently does not dynamically grow if a call to
844: .Fn unput
1.1 deraadt 845: results in too much text being pushed back; instead, a run-time error results.
1.16 jmc 846: .Pp
847: Also note that
848: .Dq %array
849: cannot be used with C++ scanner classes
850: .Pq the c++ option; see below .
851: .Sh ACTIONS
852: Each pattern in a rule has a corresponding action,
853: which can be any arbitrary C statement.
854: The pattern ends at the first non-escaped whitespace character;
855: the remainder of the line is its action.
856: If the action is empty,
857: then when the pattern is matched the input token is simply discarded.
858: For example, here is the specification for a program
859: which deletes all occurrences of
860: .Qq zap me
861: from its input:
862: .Bd -literal -offset indent
863: %%
864: "zap me"
865: .Ed
866: .Pp
1.1 deraadt 867: (It will copy all other characters in the input to the output since
868: they will be matched by the default rule.)
1.16 jmc 869: .Pp
1.1 deraadt 870: Here is a program which compresses multiple blanks and tabs down to
871: a single blank, and throws away whitespace found at the end of a line:
1.16 jmc 872: .Bd -literal -offset indent
873: %%
874: [ \et]+ putchar(' ');
875: [ \et]+$ /* ignore this token */
876: .Ed
877: .Pp
878: If the action contains a
879: .Sq { ,
880: then the action spans till the balancing
881: .Sq }
1.1 deraadt 882: is found, and the action may cross multiple lines.
1.16 jmc 883: .Nm
1.1 deraadt 884: knows about C strings and comments and won't be fooled by braces found
885: within them, but also allows actions to begin with
1.16 jmc 886: .Sq %{
1.1 deraadt 887: and will consider the action to be all the text up to the next
1.16 jmc 888: .Sq %}
889: .Pq regardless of ordinary braces inside the action .
890: .Pp
891: An action consisting solely of a vertical bar
892: .Pq Sq |\&
893: means
894: .Qq same as the action for the next rule .
895: See below for an illustration.
896: .Pp
897: Actions can include arbitrary C code,
898: including return statements to return a value to whatever routine called
899: .Fn yylex .
1.1 deraadt 900: Each time
1.16 jmc 901: .Fn yylex
902: is called, it continues processing tokens from where it last left off
903: until it either reaches the end of the file or executes a return.
904: .Pp
1.1 deraadt 905: Actions are free to modify
1.16 jmc 906: .Fa yytext
907: except for lengthening it
908: (adding characters to its end \- these will overwrite later characters in the
909: input stream).
910: This, however, does not apply when using
911: .Dq %array
912: .Pq see above ;
913: in that case,
914: .Fa yytext
1.1 deraadt 915: may be freely modified in any way.
1.16 jmc 916: .Pp
1.1 deraadt 917: Actions are free to modify
1.16 jmc 918: .Fa yyleng
1.1 deraadt 919: except they should not do so if the action also includes use of
1.16 jmc 920: .Fn yymore
921: .Pq see below .
922: .Pp
1.1 deraadt 923: There are a number of special directives which can be included within
924: an action:
1.16 jmc 925: .Bl -tag -width Ds
926: .It ECHO
927: Copies
928: .Fa yytext
929: to the scanner's output.
930: .It BEGIN
931: Followed by the name of a start condition, places the scanner in the
932: corresponding start condition
933: .Pq see below .
934: .It REJECT
935: Directs the scanner to proceed on to the
936: .Qq second best
937: rule which matched the input
938: .Pq or a prefix of the input .
939: The rule is chosen as described above in
940: .Sx HOW THE INPUT IS MATCHED ,
941: and
942: .Fa yytext
1.1 deraadt 943: and
1.16 jmc 944: .Fa yyleng
1.1 deraadt 945: set up appropriately.
946: It may either be one which matched as much text
947: as the originally chosen rule but came later in the
1.16 jmc 948: .Nm
1.1 deraadt 949: input file, or one which matched less text.
950: For example, the following will both count the
1.16 jmc 951: words in the input and call the routine
952: .Fn special
953: whenever
954: .Qq frob
955: is seen:
956: .Bd -literal -offset indent
957: int word_count = 0;
958: %%
959:
960: frob special(); REJECT;
961: [^ \et\en]+ ++word_count;
962: .Ed
963: .Pp
1.1 deraadt 964: Without the
1.16 jmc 965: .Em REJECT ,
966: any "frob"'s in the input would not be counted as words,
967: since the scanner normally executes only one action per token.
1.1 deraadt 968: Multiple
1.16 jmc 969: .Em REJECT Ns 's
970: are allowed,
971: each one finding the next best choice to the currently active rule.
972: For example, when the following scanner scans the token
973: .Qq abcd ,
974: it will write
975: .Qq abcdabcaba
976: to the output:
977: .Bd -literal -offset indent
978: %%
979: a |
980: ab |
981: abc |
982: abcd ECHO; REJECT;
983: \&.|\en /* eat up any unmatched character */
984: .Ed
985: .Pp
1.1 deraadt 986: (The first three rules share the fourth's action since they use
1.16 jmc 987: the special
988: .Sq |\&
989: action.)
990: .Em REJECT
1.1 deraadt 991: is a particularly expensive feature in terms of scanner performance;
1.16 jmc 992: if it is used in any of the scanner's actions it will slow down
993: all of the scanner's matching.
994: Furthermore,
995: .Em REJECT
1.1 deraadt 996: cannot be used with the
1.16 jmc 997: .Fl Cf
1.1 deraadt 998: or
1.16 jmc 999: .Fl CF
1000: options
1001: .Pq see below .
1002: .Pp
1.1 deraadt 1003: Note also that unlike the other special actions,
1.16 jmc 1004: .Em REJECT
1.1 deraadt 1005: is a
1.16 jmc 1006: .Em branch ;
1007: code immediately following it in the action will not be executed.
1008: .It yymore()
1009: Tells the scanner that the next time it matches a rule, the corresponding
1010: token should be appended onto the current value of
1011: .Fa yytext
1012: rather than replacing it.
1013: For example, given the input
1014: .Qq mega-kludge
1015: the following will write
1016: .Qq mega-mega-kludge
1017: to the output:
1018: .Bd -literal -offset indent
1019: %%
1020: mega- ECHO; yymore();
1021: kludge ECHO;
1022: .Ed
1023: .Pp
1024: First
1025: .Qq mega-
1026: is matched and echoed to the output.
1027: Then
1028: .Qq kludge
1029: is matched, but the previous
1030: .Qq mega-
1031: is still hanging around at the beginning of
1032: .Fa yytext
1.1 deraadt 1033: so the
1.16 jmc 1034: .Em ECHO
1035: for the
1036: .Qq kludge
1037: rule will actually write
1038: .Qq mega-kludge .
1039: .Pp
1.1 deraadt 1040: Two notes regarding use of
1.16 jmc 1041: .Fn yymore :
1.1 deraadt 1042: First,
1.16 jmc 1043: .Fn yymore
1.1 deraadt 1044: depends on the value of
1.16 jmc 1045: .Fa yyleng
1046: correctly reflecting the size of the current token, so
1047: .Fa yyleng
1048: must not be modified when using
1049: .Fn yymore .
1.1 deraadt 1050: Second, the presence of
1.16 jmc 1051: .Fn yymore
1.1 deraadt 1052: in the scanner's action entails a minor performance penalty in the
1053: scanner's matching speed.
1.16 jmc 1054: .It yyless(n)
1055: Returns all but the first
1056: .Ar n
1.1 deraadt 1057: characters of the current token back to the input stream, where they
1058: will be rescanned when the scanner looks for the next match.
1.16 jmc 1059: .Fa yytext
1.1 deraadt 1060: and
1.16 jmc 1061: .Fa yyleng
1.1 deraadt 1062: are adjusted appropriately (e.g.,
1.16 jmc 1063: .Fa yyleng
1.1 deraadt 1064: will now be equal to
1.16 jmc 1065: .Ar n ) .
1066: For example, on the input
1067: .Qq foobar
1068: the following will write out
1069: .Qq foobarbar :
1070: .Bd -literal -offset indent
1071: %%
1072: foobar ECHO; yyless(3);
1073: [a-z]+ ECHO;
1074: .Ed
1075: .Pp
1.1 deraadt 1076: An argument of 0 to
1.16 jmc 1077: .Fa yyless
1078: will cause the entire current input string to be scanned again.
1079: Unless how the scanner will subsequently process its input has been changed
1080: (using
1081: .Em BEGIN ,
1082: for example),
1083: this will result in an endless loop.
1084: .Pp
1.1 deraadt 1085: Note that
1.16 jmc 1086: .Fa yyless
1087: is a macro and can only be used in the
1088: .Nm
1089: input file, not from other source files.
1090: .It unput(c)
1091: Puts the character
1092: .Ar c
1093: back into the input stream.
1094: It will be the next character scanned.
1.1 deraadt 1095: The following action will take the current token and cause it
1096: to be rescanned enclosed in parentheses.
1.16 jmc 1097: .Bd -literal -offset indent
1098: {
1099: int i;
1100: char *yycopy;
1101:
1102: /* Copy yytext because unput() trashes yytext */
1103: if ((yycopy = strdup(yytext)) == NULL)
1104: err(1, NULL);
1105: unput(')');
1106: for (i = yyleng - 1; i >= 0; --i)
1107: unput(yycopy[i]);
1108: unput('(');
1109: free(yycopy);
1110: }
1111: .Ed
1112: .Pp
1.1 deraadt 1113: Note that since each
1.16 jmc 1114: .Fn unput
1115: puts the given character back at the beginning of the input stream,
1116: pushing back strings must be done back-to-front.
1117: .Pp
1.1 deraadt 1118: An important potential problem when using
1.16 jmc 1119: .Fn unput
1120: is that if using
1121: .Dq %pointer
1122: .Pq the default ,
1123: a call to
1124: .Fn unput
1125: destroys the contents of
1126: .Fa yytext ,
1.1 deraadt 1127: starting with its rightmost character and devouring one character to
1.16 jmc 1128: the left with each call.
1129: If the value of
1130: .Fa yytext
1131: should be preserved after a call to
1132: .Fn unput
1133: .Pq as in the above example ,
1134: it must either first be copied elsewhere, or the scanner must be built using
1135: .Dq %array
1136: instead (see
1137: .Sx HOW THE INPUT IS MATCHED ) .
1138: .Pp
1139: Finally, note that EOF cannot be put back
1.1 deraadt 1140: to attempt to mark the input stream with an end-of-file.
1.16 jmc 1141: .It input()
1142: Reads the next character from the input stream.
1143: For example, the following is one way to eat up C comments:
1144: .Bd -literal -offset indent
1145: %%
1146: "/*" {
1147: int c;
1148:
1149: for (;;) {
1150: while ((c = input()) != '*' && c != EOF)
1151: ; /* eat up text of comment */
1152:
1153: if (c == '*') {
1154: while ((c = input()) == '*')
1155: ;
1156: if (c == '/')
1157: break; /* found the end */
1158: }
1159:
1160: if (c == EOF) {
1161: errx(1, "EOF in comment");
1.1 deraadt 1162: break;
1163: }
1.16 jmc 1164: }
1165: }
1166: .Ed
1167: .Pp
1168: (Note that if the scanner is compiled using C++, then
1169: .Fn input
1.1 deraadt 1170: is instead referred to as
1.16 jmc 1171: .Fn yyinput ,
1172: in order to avoid a name clash with the C++ stream by the name of input.)
1173: .It YY_FLUSH_BUFFER
1174: Flushes the scanner's internal buffer
1175: so that the next time the scanner attempts to match a token,
1176: it will first refill the buffer using
1177: .Dv YY_INPUT
1178: (see
1179: .Sx THE GENERATED SCANNER ,
1180: below).
1181: This action is a special case of the more general
1182: .Fn yy_flush_buffer
1183: function, described below in the section
1184: .Sx MULTIPLE INPUT BUFFERS .
1185: .It yyterminate()
1186: Can be used in lieu of a return statement in an action.
1187: It terminates the scanner and returns a 0 to the scanner's caller, indicating
1188: .Qq all done .
1.1 deraadt 1189: By default,
1.16 jmc 1190: .Fn yyterminate
1191: is also called when an end-of-file is encountered.
1192: It is a macro and may be redefined.
1193: .El
1194: .Sh THE GENERATED SCANNER
1.1 deraadt 1195: The output of
1.16 jmc 1196: .Nm
1.1 deraadt 1197: is the file
1.16 jmc 1198: .Pa lex.yy.c ,
1.1 deraadt 1199: which contains the scanning routine
1.16 jmc 1200: .Fn yylex ,
1201: a number of tables used by it for matching tokens,
1202: and a number of auxiliary routines and macros.
1203: By default,
1204: .Fn yylex
1.1 deraadt 1205: is declared as follows:
1.16 jmc 1206: .Bd -unfilled -offset indent
1207: int yylex()
1208: {
1209: ... various definitions and the actions in here ...
1210: }
1211: .Ed
1212: .Pp
1213: (If the environment supports function prototypes, then it will
1214: be "int yylex(void)".)
1215: This definition may be changed by defining the
1216: .Dv YY_DECL
1217: macro.
1218: For example:
1219: .Bd -literal -offset indent
1220: #define YY_DECL float lexscan(a, b) float a, b;
1221: .Ed
1222: .Pp
1223: would give the scanning routine the name
1224: .Em lexscan ,
1225: returning a float, and taking two floats as arguments.
1226: Note that if arguments are given to the scanning routine using a
1227: K&R-style/non-prototyped function declaration,
1228: the definition must be terminated with a semi-colon
1229: .Pq Sq ;\& .
1230: .Pp
1.1 deraadt 1231: Whenever
1.16 jmc 1232: .Fn yylex
1.1 deraadt 1233: is called, it scans tokens from the global input file
1.16 jmc 1234: .Pa yyin
1235: .Pq which defaults to stdin .
1236: It continues until it either reaches an end-of-file
1237: .Pq at which point it returns the value 0
1238: or one of its actions executes a
1239: .Em return
1.1 deraadt 1240: statement.
1.16 jmc 1241: .Pp
1.1 deraadt 1242: If the scanner reaches an end-of-file, subsequent calls are undefined
1243: unless either
1.16 jmc 1244: .Em yyin
1245: is pointed at a new input file
1246: .Pq in which case scanning continues from that file ,
1247: or
1248: .Fn yyrestart
1.1 deraadt 1249: is called.
1.16 jmc 1250: .Fn yyrestart
1.1 deraadt 1251: takes one argument, a
1.16 jmc 1252: .Fa FILE *
1253: pointer (which can be nil, if
1254: .Dv YY_INPUT
1255: has been set up to scan from a source other than
1256: .Em yyin ) ,
1.1 deraadt 1257: and initializes
1.16 jmc 1258: .Em yyin
1259: for scanning from that file.
1260: Essentially there is no difference between just assigning
1261: .Em yyin
1.1 deraadt 1262: to a new input file or using
1.16 jmc 1263: .Fn yyrestart
1264: to do so; the latter is available for compatibility with previous versions of
1265: .Nm ,
1.1 deraadt 1266: and because it can be used to switch input files in the middle of scanning.
1.16 jmc 1267: It can also be used to throw away the current input buffer,
1268: by calling it with an argument of
1269: .Em yyin ;
1.1 deraadt 1270: but better is to use
1.16 jmc 1271: .Dv YY_FLUSH_BUFFER
1272: .Pq see above .
1.1 deraadt 1273: Note that
1.16 jmc 1274: .Fn yyrestart
1275: does not reset the start condition to
1276: .Em INITIAL
1277: (see
1278: .Sx START CONDITIONS ,
1279: below).
1280: .Pp
1.1 deraadt 1281: If
1.16 jmc 1282: .Fn yylex
1.1 deraadt 1283: stops scanning due to executing a
1.16 jmc 1284: .Em return
1.1 deraadt 1285: statement in one of the actions, the scanner may then be called again and it
1286: will resume scanning where it left off.
1.16 jmc 1287: .Pp
1288: By default
1289: .Pq and for purposes of efficiency ,
1290: the scanner uses block-reads rather than simple
1291: .Xr getc 3
1.1 deraadt 1292: calls to read characters from
1.16 jmc 1293: .Em yyin .
1.1 deraadt 1294: The nature of how it gets its input can be controlled by defining the
1.16 jmc 1295: .Dv YY_INPUT
1.1 deraadt 1296: macro.
1.16 jmc 1297: .Dv YY_INPUT Ns 's
1298: calling sequence is
1299: .Qq YY_INPUT(buf,result,max_size) .
1300: Its action is to place up to
1301: .Dv max_size
1.1 deraadt 1302: characters in the character array
1.16 jmc 1303: .Em buf
1.1 deraadt 1304: and return in the integer variable
1.16 jmc 1305: .Em result
1306: either the number of characters read or the constant
1307: .Dv YY_NULL
1308: (0 on
1309: .Ux
1310: systems)
1311: to indicate
1312: .Dv EOF .
1313: The default
1314: .Dv YY_INPUT
1315: reads from the global file-pointer
1316: .Qq yyin .
1317: .Pp
1318: A sample definition of
1319: .Dv YY_INPUT
1320: .Pq in the definitions section of the input file :
1321: .Bd -unfilled -offset indent
1322: %{
1323: #define YY_INPUT(buf,result,max_size) \e
1324: { \e
1325: int c = getchar(); \e
1326: result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1327: }
1328: %}
1329: .Ed
1330: .Pp
1.1 deraadt 1331: This definition will change the input processing to occur
1332: one character at a time.
1.16 jmc 1333: .Pp
1334: When the scanner receives an end-of-file indication from
1335: .Dv YY_INPUT ,
1.1 deraadt 1336: it then checks the
1.16 jmc 1337: .Fn yywrap
1338: function.
1339: If
1340: .Fn yywrap
1341: returns false
1342: .Pq zero ,
1343: then it is assumed that the function has gone ahead and set up
1344: .Em yyin
1345: to point to another input file, and scanning continues.
1346: If it returns true
1347: .Pq non-zero ,
1348: then the scanner terminates, returning 0 to its caller.
1349: Note that in either case, the start condition remains unchanged;
1350: it does not revert to
1351: .Em INITIAL .
1352: .Pp
1.1 deraadt 1353: If you do not supply your own version of
1.16 jmc 1354: .Fn yywrap ,
1.1 deraadt 1355: then you must either use
1.16 jmc 1356: .Dq %option noyywrap
1.1 deraadt 1357: (in which case the scanner behaves as though
1.16 jmc 1358: .Fn yywrap
1.1 deraadt 1359: returned 1), or you must link with
1.16 jmc 1360: .Fl lfl
1.1 deraadt 1361: to obtain the default version of the routine, which always returns 1.
1.16 jmc 1362: .Pp
1.1 deraadt 1363: Three routines are available for scanning from in-memory buffers rather
1364: than files:
1.16 jmc 1365: .Fn yy_scan_string ,
1366: .Fn yy_scan_bytes ,
1.1 deraadt 1367: and
1.16 jmc 1368: .Fn yy_scan_buffer .
1369: See the discussion of them below in the section
1370: .Sx MULTIPLE INPUT BUFFERS .
1371: .Pp
1.1 deraadt 1372: The scanner writes its
1.16 jmc 1373: .Em ECHO
1.1 deraadt 1374: output to the
1.16 jmc 1375: .Em yyout
1376: global
1377: .Pq default, stdout ,
1378: which may be redefined by the user simply by assigning it to some other
1379: .Va FILE
1.1 deraadt 1380: pointer.
1.16 jmc 1381: .Sh START CONDITIONS
1382: .Nm
1383: provides a mechanism for conditionally activating rules.
1384: Any rule whose pattern is prefixed with
1385: .Qq Aq sc
1386: will only be active when the scanner is in the start condition named
1387: .Qq sc .
1388: For example,
1389: .Bd -literal -offset indent
1390: <STRING>[^"]* { /* eat up the string body ... */
1391: ...
1392: }
1393: .Ed
1394: .Pp
1395: will be active only when the scanner is in the
1396: .Qq STRING
1397: start condition, and
1398: .Bd -literal -offset indent
1399: <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1400: ...
1401: }
1402: .Ed
1403: .Pp
1404: will be active only when the current start condition is either
1405: .Qq INITIAL ,
1406: .Qq STRING ,
1407: or
1408: .Qq QUOTE .
1409: .Pp
1410: Start conditions are declared in the definitions
1411: .Pq first
1412: section of the input using unindented lines beginning with either
1413: .Sq %s
1.1 deraadt 1414: or
1.16 jmc 1415: .Sq %x
1.1 deraadt 1416: followed by a list of names.
1417: The former declares
1.16 jmc 1418: .Em inclusive
1.1 deraadt 1419: start conditions, the latter
1.16 jmc 1420: .Em exclusive
1421: start conditions.
1422: A start condition is activated using the
1423: .Em BEGIN
1424: action.
1425: Until the next
1426: .Em BEGIN
1427: action is executed, rules with the given start condition will be active and
1.1 deraadt 1428: rules with other start conditions will be inactive.
1.16 jmc 1429: If the start condition is inclusive,
1.1 deraadt 1430: then rules with no start conditions at all will also be active.
1.16 jmc 1431: If it is exclusive,
1432: then only rules qualified with the start condition will be active.
1.1 deraadt 1433: A set of rules contingent on the same exclusive start condition
1434: describe a scanner which is independent of any of the other rules in the
1.16 jmc 1435: .Nm
1436: input.
1437: Because of this, exclusive start conditions make it easy to specify
1438: .Qq mini-scanners
1.1 deraadt 1439: which scan portions of the input that are syntactically different
1.16 jmc 1440: from the rest
1441: .Pq e.g., comments .
1442: .Pp
1.1 deraadt 1443: If the distinction between inclusive and exclusive start conditions
1444: is still a little vague, here's a simple example illustrating the
1.16 jmc 1445: connection between the two.
1446: The set of rules:
1447: .Bd -literal -offset indent
1448: %s example
1449: %%
1450:
1451: <example>foo do_something();
1452:
1453: bar something_else();
1454: .Ed
1455: .Pp
1.1 deraadt 1456: is equivalent to
1.16 jmc 1457: .Bd -literal -offset indent
1458: %x example
1459: %%
1460:
1461: <example>foo do_something();
1462:
1463: <INITIAL,example>bar something_else();
1464: .Ed
1465: .Pp
1.1 deraadt 1466: Without the
1.16 jmc 1467: .Aq INITIAL,example
1.1 deraadt 1468: qualifier, the
1.16 jmc 1469: .Dq bar
1470: pattern in the second example wouldn't be active
1471: .Pq i.e., couldn't match
1.1 deraadt 1472: when in start condition
1.16 jmc 1473: .Dq example .
1.1 deraadt 1474: If we just used
1.16 jmc 1475: .Aq example
1.1 deraadt 1476: to qualify
1.16 jmc 1477: .Dq bar ,
1.1 deraadt 1478: though, then it would only be active in
1.16 jmc 1479: .Dq example
1.1 deraadt 1480: and not in
1.16 jmc 1481: .Em INITIAL ,
1482: while in the first example it's active in both,
1483: because in the first example the
1484: .Dq example
1485: start condition is an inclusive
1486: .Pq Sq %s
1.1 deraadt 1487: start condition.
1.16 jmc 1488: .Pp
1.1 deraadt 1489: Also note that the special start-condition specifier
1.16 jmc 1490: .Sq Aq *
1491: matches every start condition.
1492: Thus, the above example could also have been written:
1493: .Bd -literal -offset indent
1494: %x example
1495: %%
1496:
1497: <example>foo do_something();
1498:
1499: <*>bar something_else();
1500: .Ed
1501: .Pp
1.1 deraadt 1502: The default rule (to
1.16 jmc 1503: .Em ECHO
1504: any unmatched character) remains active in start conditions.
1505: It is equivalent to:
1506: .Bd -literal -offset indent
1507: <*>.|\en ECHO;
1508: .Ed
1509: .Pp
1510: .Dq BEGIN(0)
1.1 deraadt 1511: returns to the original state where only the rules with
1.16 jmc 1512: no start conditions are active.
1513: This state can also be referred to as the start-condition
1514: .Em INITIAL ,
1515: so
1516: .Dq BEGIN(INITIAL)
1.1 deraadt 1517: is equivalent to
1.16 jmc 1518: .Dq BEGIN(0) .
1.1 deraadt 1519: (The parentheses around the start condition name are not required but
1520: are considered good style.)
1.16 jmc 1521: .Pp
1522: .Em BEGIN
1.1 deraadt 1523: actions can also be given as indented code at the beginning
1.16 jmc 1524: of the rules section.
1525: For example, the following will cause the scanner to enter the
1526: .Qq SPECIAL
1527: start condition whenever
1528: .Fn yylex
1.1 deraadt 1529: is called and the global variable
1.16 jmc 1530: .Fa enter_special
1.1 deraadt 1531: is true:
1.16 jmc 1532: .Bd -literal -offset indent
1533: int enter_special;
1.1 deraadt 1534:
1.16 jmc 1535: %x SPECIAL
1536: %%
1537: if (enter_special)
1.1 deraadt 1538: BEGIN(SPECIAL);
1539:
1.16 jmc 1540: <SPECIAL>blahblahblah
1541: \&...more rules follow...
1542: .Ed
1543: .Pp
1.1 deraadt 1544: To illustrate the uses of start conditions,
1545: here is a scanner which provides two different interpretations
1.16 jmc 1546: of a string like
1547: .Qq 123.456 .
1548: By default it will treat it as three tokens: the integer
1549: .Qq 123 ,
1550: a dot
1551: .Pq Sq .\& ,
1552: and the integer
1553: .Qq 456 .
1.1 deraadt 1554: But if the string is preceded earlier in the line by the string
1.16 jmc 1555: .Qq expect-floats
1556: it will treat it as a single token, the floating-point number 123.456:
1557: .Bd -literal -offset indent
1558: %{
1559: #include <math.h>
1560: %}
1561: %s expect
1562:
1563: %%
1564: expect-floats BEGIN(expect);
1565:
1566: <expect>[0-9]+"."[0-9]+ {
1567: printf("found a float, = %f\en",
1568: atof(yytext));
1569: }
1570: <expect>\en {
1571: /*
1572: * That's the end of the line, so
1573: * we need another "expect-number"
1574: * before we'll recognize any more
1575: * numbers.
1576: */
1577: BEGIN(INITIAL);
1578: }
1579:
1580: [0-9]+ {
1581: printf("found an integer, = %d\en",
1582: atoi(yytext));
1583: }
1584:
1585: "." printf("found a dot\en");
1586: .Ed
1587: .Pp
1588: Here is a scanner which recognizes
1589: .Pq and discards
1590: C comments while maintaining a count of the current input line:
1591: .Bd -literal -offset indent
1592: %x comment
1593: %%
1594: int line_num = 1;
1595:
1596: "/*" BEGIN(comment);
1597:
1598: <comment>[^*\en]* /* eat anything that's not a '*' */
1599: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1600: <comment>\en ++line_num;
1601: <comment>"*"+"/" BEGIN(INITIAL);
1602: .Ed
1603: .Pp
1.1 deraadt 1604: This scanner goes to a bit of trouble to match as much
1.16 jmc 1605: text as possible with each rule.
1606: In general, when attempting to write a high-speed scanner
1607: try to match as much as possible in each rule, as it's a big win.
1608: .Pp
1.10 deraadt 1609: Note that start-condition names are really integer values and
1.16 jmc 1610: can be stored as such.
1611: Thus, the above could be extended in the following fashion:
1612: .Bd -literal -offset indent
1613: %x comment foo
1614: %%
1615: int line_num = 1;
1616: int comment_caller;
1617:
1618: "/*" {
1619: comment_caller = INITIAL;
1620: BEGIN(comment);
1621: }
1622:
1623: \&...
1624:
1625: <foo>"/*" {
1626: comment_caller = foo;
1627: BEGIN(comment);
1628: }
1629:
1630: <comment>[^*\en]* /* eat anything that's not a '*' */
1631: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1632: <comment>\en ++line_num;
1633: <comment>"*"+"/" BEGIN(comment_caller);
1634: .Ed
1635: .Pp
1636: Furthermore, the current start condition can be accessed by using
1.1 deraadt 1637: the integer-valued
1.16 jmc 1638: .Dv YY_START
1639: macro.
1640: For example, the above assignments to
1641: .Em comment_caller
1.1 deraadt 1642: could instead be written
1.16 jmc 1643: .Pp
1644: .Dl comment_caller = YY_START;
1645: .Pp
1.1 deraadt 1646: Flex provides
1.16 jmc 1647: .Dv YYSTATE
1.1 deraadt 1648: as an alias for
1.16 jmc 1649: .Dv YY_START
1.36 schwarze 1650: (since that is what's used by
1651: .At
1.16 jmc 1652: .Nm lex ) .
1653: .Pp
1654: Note that start conditions do not have their own name-space;
1655: %s's and %x's declare names in the same fashion as #define's.
1656: .Pp
1.1 deraadt 1657: Finally, here's an example of how to match C-style quoted strings using
1.16 jmc 1658: exclusive start conditions, including expanded escape sequences
1659: (but not including checking for a string that's too long):
1660: .Bd -literal -offset indent
1661: %x str
1662:
1663: %%
1664: #define MAX_STR_CONST 1024
1665: char string_buf[MAX_STR_CONST];
1666: char *string_buf_ptr;
1667:
1668: \e" string_buf_ptr = string_buf; BEGIN(str);
1669:
1670: <str>\e" { /* saw closing quote - all done */
1671: BEGIN(INITIAL);
1672: *string_buf_ptr = '\e0';
1673: /*
1674: * return string constant token type and
1675: * value to parser
1676: */
1677: }
1678:
1679: <str>\en {
1680: /* error - unterminated string constant */
1681: /* generate error message */
1682: }
1683:
1684: <str>\e\e[0-7]{1,3} {
1685: /* octal escape sequence */
1686: int result;
1687:
1688: (void) sscanf(yytext + 1, "%o", &result);
1689:
1690: if (result > 0xff) {
1691: /* error, constant is out-of-bounds */
1692: } else
1693: *string_buf_ptr++ = result;
1694: }
1695:
1696: <str>\e\e[0-9]+ {
1697: /*
1698: * generate error - bad escape sequence; something
1699: * like '\e48' or '\e0777777'
1700: */
1701: }
1702:
1703: <str>\e\en *string_buf_ptr++ = '\en';
1704: <str>\e\et *string_buf_ptr++ = '\et';
1705: <str>\e\er *string_buf_ptr++ = '\er';
1706: <str>\e\eb *string_buf_ptr++ = '\eb';
1707: <str>\e\ef *string_buf_ptr++ = '\ef';
1708:
1709: <str>\e\e(.|\en) *string_buf_ptr++ = yytext[1];
1710:
1711: <str>[^\e\e\en\e"]+ {
1712: char *yptr = yytext;
1713:
1714: while (*yptr)
1715: *string_buf_ptr++ = *yptr++;
1716: }
1717: .Ed
1718: .Pp
1719: Often, such as in some of the examples above,
1720: a whole bunch of rules are all preceded by the same start condition(s).
1721: .Nm
1.1 deraadt 1722: makes this a little easier and cleaner by introducing a notion of
1723: start condition
1.16 jmc 1724: .Em scope .
1.1 deraadt 1725: A start condition scope is begun with:
1.16 jmc 1726: .Pp
1727: .Dl <SCs>{
1728: .Pp
1.1 deraadt 1729: where
1.16 jmc 1730: .Dq SCs
1731: is a list of one or more start conditions.
1732: Inside the start condition scope, every rule automatically has the prefix
1733: .Aq SCs
1.1 deraadt 1734: applied to it, until a
1.16 jmc 1735: .Sq }
1.1 deraadt 1736: which matches the initial
1.16 jmc 1737: .Sq { .
1.1 deraadt 1738: So, for example,
1.16 jmc 1739: .Bd -literal -offset indent
1740: <ESC>{
1741: "\e\en" return '\en';
1742: "\e\er" return '\er';
1743: "\e\ef" return '\ef';
1744: "\e\e0" return '\e0';
1745: }
1746: .Ed
1747: .Pp
1.1 deraadt 1748: is equivalent to:
1.16 jmc 1749: .Bd -literal -offset indent
1750: <ESC>"\e\en" return '\en';
1751: <ESC>"\e\er" return '\er';
1752: <ESC>"\e\ef" return '\ef';
1753: <ESC>"\e\e0" return '\e0';
1754: .Ed
1755: .Pp
1.1 deraadt 1756: Start condition scopes may be nested.
1.16 jmc 1757: .Pp
1.1 deraadt 1758: Three routines are available for manipulating stacks of start conditions:
1.16 jmc 1759: .Bl -tag -width Ds
1760: .It void yy_push_state(int new_state)
1761: Pushes the current start condition onto the top of the start condition
1.1 deraadt 1762: stack and switches to
1.16 jmc 1763: .Fa new_state
1764: as though
1765: .Dq BEGIN new_state
1766: had been used
1767: .Pq recall that start condition names are also integers .
1768: .It void yy_pop_state()
1769: Pops the top of the stack and switches to it via
1770: .Em BEGIN .
1771: .It int yy_top_state()
1772: Returns the top of the stack without altering the stack's contents.
1773: .El
1774: .Pp
1.1 deraadt 1775: The start condition stack grows dynamically and so has no built-in
1.16 jmc 1776: size limitation.
1777: If memory is exhausted, program execution aborts.
1778: .Pp
1779: To use start condition stacks, scanners must include a
1780: .Dq %option stack
1781: directive (see
1782: .Sx OPTIONS
1783: below).
1784: .Sh MULTIPLE INPUT BUFFERS
1785: Some scanners
1786: (such as those which support
1787: .Qq include
1788: files)
1789: require reading from several input streams.
1790: As
1791: .Nm
1.1 deraadt 1792: scanners do a large amount of buffering, one cannot control
1793: where the next input will be read from by simply writing a
1.16 jmc 1794: .Dv YY_INPUT
1.1 deraadt 1795: which is sensitive to the scanning context.
1.16 jmc 1796: .Dv YY_INPUT
1.1 deraadt 1797: is only called when the scanner reaches the end of its buffer, which
1.16 jmc 1798: may be a long time after scanning a statement such as an
1799: .Qq include
1.1 deraadt 1800: which requires switching the input source.
1.16 jmc 1801: .Pp
1.1 deraadt 1802: To negotiate these sorts of problems,
1.16 jmc 1803: .Nm
1.1 deraadt 1804: provides a mechanism for creating and switching between multiple
1.16 jmc 1805: input buffers.
1806: An input buffer is created by using:
1807: .Pp
1808: .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1809: .Pp
1.1 deraadt 1810: which takes a
1.16 jmc 1811: .Fa FILE
1812: pointer and a
1813: .Fa size
1814: and creates a buffer associated with the given file and large enough to hold
1815: .Fa size
1.1 deraadt 1816: characters (when in doubt, use
1.16 jmc 1817: .Dv YY_BUF_SIZE
1818: for the size).
1819: It returns a
1820: .Dv YY_BUFFER_STATE
1821: handle, which may then be passed to other routines
1822: .Pq see below .
1823: The
1824: .Dv YY_BUFFER_STATE
1.1 deraadt 1825: type is a pointer to an opaque
1.16 jmc 1826: .Dq struct yy_buffer_state
1827: structure, so
1828: .Dv YY_BUFFER_STATE
1829: variables may be safely initialized to
1830: .Dq ((YY_BUFFER_STATE) 0)
1831: if desired, and the opaque structure can also be referred to in order to
1832: correctly declare input buffers in source files other than that of scanners.
1833: Note that the
1834: .Fa FILE
1.1 deraadt 1835: pointer in the call to
1.16 jmc 1836: .Fn yy_create_buffer
1.1 deraadt 1837: is only used as the value of
1.16 jmc 1838: .Fa yyin
1.1 deraadt 1839: seen by
1.16 jmc 1840: .Dv YY_INPUT ;
1841: if
1842: .Dv YY_INPUT
1843: is redefined so that it no longer uses
1844: .Fa yyin ,
1845: then a nil
1846: .Fa FILE
1847: pointer can safely be passed to
1848: .Fn yy_create_buffer .
1849: To select a particular buffer to scan:
1850: .Pp
1851: .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1852: .Pp
1853: It switches the scanner's input buffer so subsequent tokens will
1.1 deraadt 1854: come from
1.16 jmc 1855: .Fa new_buffer .
1.1 deraadt 1856: Note that
1.16 jmc 1857: .Fn yy_switch_to_buffer
1858: may be used by
1859: .Fn yywrap
1860: to set things up for continued scanning,
1861: instead of opening a new file and pointing
1862: .Fa yyin
1863: at it.
1864: Note also that switching input sources via either
1865: .Fn yy_switch_to_buffer
1866: or
1867: .Fn yywrap
1868: does not change the start condition.
1869: .Pp
1870: .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1871: .Pp
1872: is used to reclaim the storage associated with a buffer.
1873: .Pf ( Fa buffer
1.1 deraadt 1874: can be nil, in which case the routine does nothing.)
1.16 jmc 1875: To clear the current contents of a buffer:
1876: .Pp
1877: .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1878: .Pp
1.1 deraadt 1879: This function discards the buffer's contents,
1.16 jmc 1880: so the next time the scanner attempts to match a token from the buffer,
1881: it will first fill the buffer anew using
1882: .Dv YY_INPUT .
1883: .Pp
1884: .Fn yy_new_buffer
1.1 deraadt 1885: is an alias for
1.16 jmc 1886: .Fn yy_create_buffer ,
1.1 deraadt 1887: provided for compatibility with the C++ use of
1.16 jmc 1888: .Em new
1.1 deraadt 1889: and
1.16 jmc 1890: .Em delete
1.1 deraadt 1891: for creating and destroying dynamic objects.
1.16 jmc 1892: .Pp
1.1 deraadt 1893: Finally, the
1.16 jmc 1894: .Dv YY_CURRENT_BUFFER
1.1 deraadt 1895: macro returns a
1.16 jmc 1896: .Dv YY_BUFFER_STATE
1.1 deraadt 1897: handle to the current buffer.
1.16 jmc 1898: .Pp
1.1 deraadt 1899: Here is an example of using these features for writing a scanner
1900: which expands include files (the
1.16 jmc 1901: .Aq Aq EOF
1.1 deraadt 1902: feature is discussed below):
1.16 jmc 1903: .Bd -literal -offset indent
1904: /*
1905: * the "incl" state is used for picking up the name
1906: * of an include file
1907: */
1908: %x incl
1909:
1910: %{
1911: #define MAX_INCLUDE_DEPTH 10
1912: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1913: int include_stack_ptr = 0;
1914: %}
1915:
1916: %%
1917: include BEGIN(incl);
1918:
1919: [a-z]+ ECHO;
1920: [^a-z\en]*\en? ECHO;
1921:
1922: <incl>[ \et]* /* eat the whitespace */
1923: <incl>[^ \et\en]+ { /* got the include file name */
1924: if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1925: errx(1, "Includes nested too deeply");
1926:
1927: include_stack[include_stack_ptr++] =
1928: YY_CURRENT_BUFFER;
1929:
1930: yyin = fopen(yytext, "r");
1931:
1932: if (yyin == NULL)
1933: err(1, NULL);
1.1 deraadt 1934:
1.16 jmc 1935: yy_switch_to_buffer(
1936: yy_create_buffer(yyin, YY_BUF_SIZE));
1.1 deraadt 1937:
1.16 jmc 1938: BEGIN(INITIAL);
1939: }
1.1 deraadt 1940:
1.16 jmc 1941: <<EOF>> {
1942: if (--include_stack_ptr < 0)
1.1 deraadt 1943: yyterminate();
1.16 jmc 1944: else {
1945: yy_delete_buffer(YY_CURRENT_BUFFER);
1.1 deraadt 1946: yy_switch_to_buffer(
1.16 jmc 1947: include_stack[include_stack_ptr]);
1948: }
1949: }
1950: .Ed
1951: .Pp
1.1 deraadt 1952: Three routines are available for setting up input buffers for
1.16 jmc 1953: scanning in-memory strings instead of files.
1954: All of them create a new input buffer for scanning the string,
1955: and return a corresponding
1956: .Dv YY_BUFFER_STATE
1957: handle (which should be deleted afterwards using
1958: .Fn yy_delete_buffer ) .
1959: They also switch to the new buffer using
1960: .Fn yy_switch_to_buffer ,
1.1 deraadt 1961: so the next call to
1.16 jmc 1962: .Fn yylex
1.1 deraadt 1963: will start scanning the string.
1.16 jmc 1964: .Bl -tag -width Ds
1965: .It yy_scan_string(const char *str)
1966: Scans a NUL-terminated string.
1967: .It yy_scan_bytes(const char *bytes, int len)
1968: Scans
1969: .Fa len
1970: bytes
1971: .Pq including possibly NUL's
1.1 deraadt 1972: starting at location
1.16 jmc 1973: .Fa bytes .
1974: .El
1975: .Pp
1976: Note that both of these functions create and scan a copy
1977: of the string or bytes.
1978: (This may be desirable, since
1979: .Fn yylex
1980: modifies the contents of the buffer it is scanning.)
1981: The copy can be avoided by using:
1982: .Bl -tag -width Ds
1983: .It yy_scan_buffer(char *base, yy_size_t size)
1984: Which scans the buffer starting at
1985: .Fa base ,
1.1 deraadt 1986: consisting of
1.16 jmc 1987: .Fa size
1988: bytes, the last two bytes of which must be
1989: .Dv YY_END_OF_BUFFER_CHAR
1990: .Pq ASCII NUL .
1991: These last two bytes are not scanned; thus, scanning consists of
1992: base[0] through base[size-2], inclusive.
1993: .Pp
1994: If
1995: .Fa base
1996: is not set up in this manner
1997: (i.e., forget the final two
1998: .Dv YY_END_OF_BUFFER_CHAR
1.1 deraadt 1999: bytes), then
1.16 jmc 2000: .Fn yy_scan_buffer
1.1 deraadt 2001: returns a nil pointer instead of creating a new input buffer.
1.16 jmc 2002: .Pp
1.1 deraadt 2003: The type
1.16 jmc 2004: .Fa yy_size_t
2005: is an integral type which can be cast to an integer expression
1.1 deraadt 2006: reflecting the size of the buffer.
1.16 jmc 2007: .El
2008: .Sh END-OF-FILE RULES
2009: The special rule
2010: .Qq Aq Aq EOF
2011: indicates actions which are to be taken when an end-of-file is encountered and
2012: .Fn yywrap
2013: returns non-zero
2014: .Pq i.e., indicates no further files to process .
2015: The action must finish by doing one of four things:
2016: .Bl -dash
2017: .It
2018: Assigning
2019: .Em yyin
2020: to a new input file
2021: (in previous versions of
2022: .Nm ,
2023: after doing the assignment, it was necessary to call the special action
2024: .Dv YY_NEW_FILE ;
2025: this is no longer necessary).
2026: .It
2027: Executing a
2028: .Em return
2029: statement.
2030: .It
2031: Executing the special
2032: .Fn yyterminate
2033: action.
2034: .It
2035: Switching to a new buffer using
2036: .Fn yy_switch_to_buffer
1.1 deraadt 2037: as shown in the example above.
1.16 jmc 2038: .El
2039: .Pp
2040: .Aq Aq EOF
2041: rules may not be used with other patterns;
2042: they may only be qualified with a list of start conditions.
2043: If an unqualified
2044: .Aq Aq EOF
2045: rule is given, it applies to all start conditions which do not already have
2046: .Aq Aq EOF
2047: actions.
2048: To specify an
2049: .Aq Aq EOF
2050: rule for only the initial start condition, use
2051: .Pp
2052: .Dl <INITIAL><<EOF>>
2053: .Pp
1.1 deraadt 2054: These rules are useful for catching things like unclosed comments.
2055: An example:
1.16 jmc 2056: .Bd -literal -offset indent
2057: %x quote
2058: %%
2059:
2060: \&...other rules for dealing with quotes...
2061:
2062: <quote><<EOF>> {
2063: error("unterminated quote");
2064: yyterminate();
2065: }
2066: <<EOF>> {
2067: if (*++filelist)
2068: yyin = fopen(*filelist, "r");
2069: else
2070: yyterminate();
2071: }
2072: .Ed
2073: .Sh MISCELLANEOUS MACROS
1.1 deraadt 2074: The macro
1.16 jmc 2075: .Dv YY_USER_ACTION
1.1 deraadt 2076: can be defined to provide an action
1.16 jmc 2077: which is always executed prior to the matched rule's action.
2078: For example,
1.1 deraadt 2079: it could be #define'd to call a routine to convert yytext to lower-case.
2080: When
1.16 jmc 2081: .Dv YY_USER_ACTION
1.1 deraadt 2082: is invoked, the variable
1.16 jmc 2083: .Fa yy_act
2084: gives the number of the matched rule
2085: .Pq rules are numbered starting with 1 .
2086: For example, to profile how often each rule is matched,
2087: the following would do the trick:
2088: .Pp
2089: .Dl #define YY_USER_ACTION ++ctr[yy_act]
2090: .Pp
1.1 deraadt 2091: where
1.16 jmc 2092: .Fa ctr
2093: is an array to hold the counts for the different rules.
2094: Note that the macro
2095: .Dv YY_NUM_RULES
2096: gives the total number of rules
2097: (including the default rule, even if
2098: .Fl s
2099: is used),
1.1 deraadt 2100: so a correct declaration for
1.16 jmc 2101: .Fa ctr
1.1 deraadt 2102: is:
1.16 jmc 2103: .Pp
2104: .Dl int ctr[YY_NUM_RULES];
2105: .Pp
1.1 deraadt 2106: The macro
1.16 jmc 2107: .Dv YY_USER_INIT
1.1 deraadt 2108: may be defined to provide an action which is always executed before
1.16 jmc 2109: the first scan
2110: .Pq and before the scanner's internal initializations are done .
1.1 deraadt 2111: For example, it could be used to call a routine to read
2112: in a data table or open a logging file.
1.16 jmc 2113: .Pp
1.1 deraadt 2114: The macro
1.16 jmc 2115: .Dv yy_set_interactive(is_interactive)
1.1 deraadt 2116: can be used to control whether the current buffer is considered
1.16 jmc 2117: .Em interactive .
1.1 deraadt 2118: An interactive buffer is processed more slowly,
2119: but must be used when the scanner's input source is indeed
2120: interactive to avoid problems due to waiting to fill buffers
2121: (see the discussion of the
1.16 jmc 2122: .Fl I
2123: flag below).
2124: A non-zero value in the macro invocation marks the buffer as interactive,
2125: a zero value as non-interactive.
2126: Note that use of this macro overrides
2127: .Dq %option always-interactive
2128: or
2129: .Dq %option never-interactive
2130: (see
2131: .Sx OPTIONS
2132: below).
2133: .Fn yy_set_interactive
1.1 deraadt 2134: must be invoked prior to beginning to scan the buffer that is
1.16 jmc 2135: .Pq or is not
2136: to be considered interactive.
2137: .Pp
1.1 deraadt 2138: The macro
1.16 jmc 2139: .Dv yy_set_bol(at_bol)
1.1 deraadt 2140: can be used to control whether the current buffer's scanning
2141: context for the next token match is done as though at the
1.16 jmc 2142: beginning of a line.
2143: A non-zero macro argument makes rules anchored with
2144: .Sq ^
2145: active, while a zero argument makes
2146: .Sq ^
2147: rules inactive.
2148: .Pp
1.1 deraadt 2149: The macro
1.16 jmc 2150: .Dv YY_AT_BOL
2151: returns true if the next token scanned from the current buffer will have
2152: .Sq ^
2153: rules active, false otherwise.
2154: .Pp
1.1 deraadt 2155: In the generated scanner, the actions are all gathered in one large
2156: switch statement and separated using
1.16 jmc 2157: .Dv YY_BREAK ,
2158: which may be redefined.
2159: By default, it is simply a
2160: .Qq break ,
2161: to separate each rule's action from the following rules.
1.1 deraadt 2162: Redefining
1.16 jmc 2163: .Dv YY_BREAK
1.1 deraadt 2164: allows, for example, C++ users to
1.16 jmc 2165: .Dq #define YY_BREAK
2166: to do nothing
2167: (while being very careful that every rule ends with a
2168: .Qq break
2169: or a
2170: .Qq return ! )
2171: to avoid suffering from unreachable statement warnings where because a rule's
2172: action ends with
2173: .Dq return ,
2174: the
2175: .Dv YY_BREAK
1.1 deraadt 2176: is inaccessible.
1.16 jmc 2177: .Sh VALUES AVAILABLE TO THE USER
1.1 deraadt 2178: This section summarizes the various values available to the user
2179: in the rule actions.
1.16 jmc 2180: .Bl -tag -width Ds
2181: .It char *yytext
2182: Holds the text of the current token.
2183: It may be modified but not lengthened
2184: .Pq characters cannot be appended to the end .
2185: .Pp
1.1 deraadt 2186: If the special directive
1.16 jmc 2187: .Dq %array
1.1 deraadt 2188: appears in the first section of the scanner description, then
1.16 jmc 2189: .Fa yytext
1.1 deraadt 2190: is instead declared
1.16 jmc 2191: .Dq char yytext[YYLMAX] ,
1.1 deraadt 2192: where
1.16 jmc 2193: .Dv YYLMAX
2194: is a macro definition that can be redefined in the first section
2195: to change the default value
2196: .Pq generally 8KB .
2197: Using
2198: .Dq %array
1.1 deraadt 2199: results in somewhat slower scanners, but the value of
1.16 jmc 2200: .Fa yytext
1.1 deraadt 2201: becomes immune to calls to
1.16 jmc 2202: .Fn input
1.1 deraadt 2203: and
1.16 jmc 2204: .Fn unput ,
1.1 deraadt 2205: which potentially destroy its value when
1.16 jmc 2206: .Fa yytext
2207: is a character pointer.
2208: The opposite of
2209: .Dq %array
1.1 deraadt 2210: is
1.16 jmc 2211: .Dq %pointer ,
1.1 deraadt 2212: which is the default.
1.16 jmc 2213: .Pp
2214: .Dq %array
2215: cannot be used when generating C++ scanner classes
1.1 deraadt 2216: (the
1.16 jmc 2217: .Fl +
1.1 deraadt 2218: flag).
1.16 jmc 2219: .It int yyleng
2220: Holds the length of the current token.
2221: .It FILE *yyin
2222: Is the file which by default
2223: .Nm
2224: reads from.
2225: It may be redefined, but doing so only makes sense before
2226: scanning begins or after an
2227: .Dv EOF
2228: has been encountered.
2229: Changing it in the midst of scanning will have unexpected results since
2230: .Nm
1.1 deraadt 2231: buffers its input; use
1.16 jmc 2232: .Fn yyrestart
1.1 deraadt 2233: instead.
2234: Once scanning terminates because an end-of-file
1.16 jmc 2235: has been seen,
2236: .Fa yyin
2237: can be assigned as the new input file
2238: and the scanner can be called again to continue scanning.
2239: .It void yyrestart(FILE *new_file)
2240: May be called to point
2241: .Fa yyin
2242: at the new input file.
2243: The switch-over to the new file is immediate
2244: .Pq any previously buffered-up input is lost .
2245: Note that calling
2246: .Fn yyrestart
1.1 deraadt 2247: with
1.16 jmc 2248: .Fa yyin
1.1 deraadt 2249: as an argument thus throws away the current input buffer and continues
2250: scanning the same input file.
1.16 jmc 2251: .It FILE *yyout
2252: Is the file to which
2253: .Em ECHO
2254: actions are done.
2255: It can be reassigned by the user.
2256: .It YY_CURRENT_BUFFER
2257: Returns a
2258: .Dv YY_BUFFER_STATE
1.1 deraadt 2259: handle to the current buffer.
1.16 jmc 2260: .It YY_START
2261: Returns an integer value corresponding to the current start condition.
2262: This value can subsequently be used with
2263: .Em BEGIN
1.1 deraadt 2264: to return to that start condition.
1.16 jmc 2265: .El
2266: .Sh INTERFACING WITH YACC
1.1 deraadt 2267: One of the main uses of
1.16 jmc 2268: .Nm
1.1 deraadt 2269: is as a companion to the
1.16 jmc 2270: .Xr yacc 1
1.1 deraadt 2271: parser-generator.
1.16 jmc 2272: yacc parsers expect to call a routine named
2273: .Fn yylex
2274: to find the next input token.
2275: The routine is supposed to return the type of the next token
2276: as well as putting any associated value in the global
1.17 jmc 2277: .Fa yylval ,
2278: which is defined externally,
2279: and can be a union or any other complex data structure.
1.1 deraadt 2280: To use
1.16 jmc 2281: .Nm
2282: with yacc, one specifies the
2283: .Fl d
2284: option to yacc to instruct it to generate the file
2285: .Pa y.tab.h
1.1 deraadt 2286: containing definitions of all the
1.16 jmc 2287: .Dq %tokens
2288: appearing in the yacc input.
2289: This file is then included in the
2290: .Nm
2291: scanner.
2292: For example, if one of the tokens is
2293: .Qq TOK_NUMBER ,
1.1 deraadt 2294: part of the scanner might look like:
1.16 jmc 2295: .Bd -literal -offset indent
2296: %{
2297: #include "y.tab.h"
2298: %}
2299:
2300: %%
2301:
2302: [0-9]+ yylval = atoi(yytext); return TOK_NUMBER;
2303: .Ed
2304: .Sh OPTIONS
2305: .Nm
1.1 deraadt 2306: has the following options:
1.16 jmc 2307: .Bl -tag -width Ds
2308: .It Fl 7
2309: Instructs
2310: .Nm
2311: to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2312: characters in its input.
2313: The advantage of using
2314: .Fl 7
1.1 deraadt 2315: is that the scanner's tables can be up to half the size of those generated
2316: using the
1.16 jmc 2317: .Fl 8
2318: option
2319: .Pq see below .
2320: The disadvantage is that such scanners often hang
1.1 deraadt 2321: or crash if their input contains an 8-bit character.
1.16 jmc 2322: .Pp
2323: Note, however, that unless generating a scanner using the
2324: .Fl Cf
1.1 deraadt 2325: or
1.16 jmc 2326: .Fl CF
1.1 deraadt 2327: table compression options, use of
1.16 jmc 2328: .Fl 7
2329: will save only a small amount of table space,
2330: and make the scanner considerably less portable.
2331: .Nm flex Ns 's
2332: default behavior is to generate an 8-bit scanner unless
2333: .Fl Cf
2334: or
2335: .Fl CF
2336: is specified, in which case
2337: .Nm
2338: defaults to generating 7-bit scanners unless it was
2339: configured to generate 8-bit scanners
2340: (as will often be the case with non-USA sites).
2341: It is possible tell whether
2342: .Nm
2343: generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2344: .Fl v
2345: output as described below.
2346: .Pp
2347: Note that if
2348: .Fl Cfe
2349: or
2350: .Fl CFe
2351: are used
2352: (the table compression options, but also using equivalence classes as
2353: discussed below),
2354: .Nm
2355: still defaults to generating an 8-bit scanner,
2356: since usually with these compression options full 8-bit tables
1.1 deraadt 2357: are not much more expensive than 7-bit tables.
1.16 jmc 2358: .It Fl 8
2359: Instructs
2360: .Nm
1.1 deraadt 2361: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1.16 jmc 2362: characters.
2363: This flag is only needed for scanners generated using
2364: .Fl Cf
1.1 deraadt 2365: or
1.16 jmc 2366: .Fl CF ,
2367: as otherwise
2368: .Nm
2369: defaults to generating an 8-bit scanner anyway.
2370: .Pp
1.1 deraadt 2371: See the discussion of
1.16 jmc 2372: .Fl 7
2373: above for
2374: .Nm flex Ns 's
2375: default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2376: .It Fl B
2377: Instructs
2378: .Nm
2379: to generate a
2380: .Em batch
2381: scanner, the opposite of
2382: .Em interactive
2383: scanners generated by
2384: .Fl I
2385: .Pq see below .
2386: In general,
2387: .Fl B
2388: is used when the scanner will never be used interactively,
2389: and you want to squeeze a little more performance out of it.
2390: If the aim is instead to squeeze out a lot more performance,
2391: use the
2392: .Fl Cf
2393: or
2394: .Fl CF
2395: options
2396: .Pq discussed below ,
2397: which turn on
2398: .Fl B
2399: automatically anyway.
2400: .It Fl b
2401: Generate backing-up information to
2402: .Pa lex.backup .
2403: This is a list of scanner states which require backing up
2404: and the input characters on which they do so.
2405: By adding rules one can remove backing-up states.
2406: If all backing-up states are eliminated and
2407: .Fl Cf
2408: or
2409: .Fl CF
2410: is used, the generated scanner will run faster (see the
2411: .Fl p
2412: flag).
2413: Only users who wish to squeeze every last cycle out of their
2414: scanners need worry about this option.
2415: (See the section on
2416: .Sx PERFORMANCE CONSIDERATIONS
2417: below.)
2418: .It Fl C Ns Op Cm aeFfmr
2419: Controls the degree of table compression and, more generally, trade-offs
1.1 deraadt 2420: between small scanners and fast scanners.
1.16 jmc 2421: .Bl -tag -width Ds
2422: .It Fl Ca
2423: Instructs
2424: .Nm
2425: to trade off larger tables in the generated scanner for faster performance
2426: because the elements of the tables are better aligned for memory access
2427: and computation.
2428: On some
2429: .Tn RISC
2430: architectures, fetching and manipulating longwords is more efficient
2431: than with smaller-sized units such as shortwords.
2432: This option can double the size of the tables used by the scanner.
2433: .It Fl Ce
2434: Directs
2435: .Nm
1.1 deraadt 2436: to construct
1.16 jmc 2437: .Em equivalence classes ,
2438: i.e., sets of characters which have identical lexical properties
2439: (for example, if the only appearance of digits in the
2440: .Nm
1.1 deraadt 2441: input is in the character class
1.16 jmc 2442: .Qq [0-9]
2443: then the digits
2444: .Sq 0 ,
2445: .Sq 1 ,
2446: .Sq ... ,
2447: .Sq 9
2448: will all be put in the same equivalence class).
2449: Equivalence classes usually give dramatic reductions in the final
2450: table/object file sizes
2451: .Pq typically a factor of 2\-5
2452: and are pretty cheap performance-wise
2453: .Pq one array look-up per character scanned .
2454: .It Fl CF
2455: Specifies that the alternate fast scanner representation
2456: (described below under the
2457: .Fl F
2458: option)
2459: should be used.
2460: This option cannot be used with
2461: .Fl + .
2462: .It Fl Cf
2463: Specifies that the
2464: .Em full
2465: scanner tables should be generated \-
2466: .Nm
2467: should not compress the tables by taking advantage of
2468: similar transition functions for different states.
2469: .It Fl \&Cm
2470: Directs
2471: .Nm
1.1 deraadt 2472: to construct
1.16 jmc 2473: .Em meta-equivalence classes ,
2474: which are sets of equivalence classes
2475: (or characters, if equivalence classes are not being used)
2476: that are commonly used together.
2477: Meta-equivalence classes are often a big win when using compressed tables,
2478: but they have a moderate performance impact
2479: (one or two
2480: .Qq if
2481: tests and one array look-up per character scanned).
2482: .It Fl Cr
2483: Causes the generated scanner to
2484: .Em bypass
2485: use of the standard I/O library
2486: .Pq stdio
2487: for input.
2488: Instead of calling
2489: .Xr fread 3
1.1 deraadt 2490: or
1.16 jmc 2491: .Xr getc 3 ,
1.1 deraadt 2492: the scanner will use the
1.16 jmc 2493: .Xr read 2
2494: system call,
2495: resulting in a performance gain which varies from system to system,
2496: but in general is probably negligible unless
2497: .Fl Cf
1.1 deraadt 2498: or
1.16 jmc 2499: .Fl CF
2500: are being used.
1.1 deraadt 2501: Using
1.16 jmc 2502: .Fl Cr
2503: can cause strange behavior if, for example, reading from
2504: .Fa yyin
2505: using stdio prior to calling the scanner
2506: (because the scanner will miss whatever text previous reads left
2507: in the stdio input buffer).
2508: .Pp
2509: .Fl Cr
2510: has no effect if
2511: .Dv YY_INPUT
2512: is defined
2513: (see
2514: .Sx THE GENERATED SCANNER
2515: above).
2516: .El
2517: .Pp
1.1 deraadt 2518: A lone
1.16 jmc 2519: .Fl C
1.1 deraadt 2520: specifies that the scanner tables should be compressed but neither
2521: equivalence classes nor meta-equivalence classes should be used.
1.16 jmc 2522: .Pp
1.1 deraadt 2523: The options
1.16 jmc 2524: .Fl Cf
1.1 deraadt 2525: or
1.16 jmc 2526: .Fl CF
1.1 deraadt 2527: and
1.16 jmc 2528: .Fl \&Cm
2529: do not make sense together \- there is no opportunity for meta-equivalence
2530: classes if the table is not being compressed.
2531: Otherwise the options may be freely mixed, and are cumulative.
2532: .Pp
1.1 deraadt 2533: The default setting is
1.16 jmc 2534: .Fl Cem
1.1 deraadt 2535: which specifies that
1.16 jmc 2536: .Nm
2537: should generate equivalence classes and meta-equivalence classes.
2538: This setting provides the highest degree of table compression.
2539: It is possible to trade off faster-executing scanners at the cost of
2540: larger tables with the following generally being true:
2541: .Bd -unfilled -offset indent
2542: slowest & smallest
2543: -Cem
2544: -Cm
2545: -Ce
2546: -C
2547: -C{f,F}e
2548: -C{f,F}
2549: -C{f,F}a
2550: fastest & largest
2551: .Ed
2552: .Pp
1.1 deraadt 2553: Note that scanners with the smallest tables are usually generated and
1.16 jmc 2554: compiled the quickest,
2555: so during development the default is usually best,
2556: maximal compression.
2557: .Pp
2558: .Fl Cfe
2559: is often a good compromise between speed and size for production scanners.
2560: .It Fl d
2561: Makes the generated scanner run in debug mode.
2562: Whenever a pattern is recognized and the global
2563: .Fa yy_flex_debug
2564: is non-zero
2565: .Pq which is the default ,
2566: the scanner will write to stderr a line of the form:
2567: .Pp
2568: .D1 --accepting rule at line 53 ("the matched text")
2569: .Pp
2570: The line number refers to the location of the rule in the file
2571: defining the scanner
2572: (i.e., the file that was fed to
2573: .Nm ) .
2574: Messages are also generated when the scanner backs up,
2575: accepts the default rule,
2576: reaches the end of its input buffer
2577: (or encounters a NUL;
2578: at this point, the two look the same as far as the scanner's concerned),
2579: or reaches an end-of-file.
2580: .It Fl F
2581: Specifies that the fast scanner table representation should be used
2582: .Pq and stdio bypassed .
2583: This representation is about as fast as the full table representation
2584: .Pq Fl f ,
2585: and for some sets of patterns will be considerably smaller
2586: .Pq and for others, larger .
2587: In general, if the pattern set contains both
2588: .Qq keywords
2589: and a catch-all,
2590: .Qq identifier
2591: rule, such as in the set:
2592: .Bd -unfilled -offset indent
2593: "case" return TOK_CASE;
2594: "switch" return TOK_SWITCH;
2595: \&...
2596: "default" return TOK_DEFAULT;
2597: [a-z]+ return TOK_ID;
2598: .Ed
2599: .Pp
2600: then it's better to use the full table representation.
2601: If only the
2602: .Qq identifier
2603: rule is present and a hash table or some such is used to detect the keywords,
2604: it's better to use
2605: .Fl F .
2606: .Pp
2607: This option is equivalent to
2608: .Fl CFr
2609: .Pq see above .
2610: It cannot be used with
2611: .Fl + .
2612: .It Fl f
2613: Specifies
2614: .Em fast scanner .
2615: No table compression is done and stdio is bypassed.
2616: The result is large but fast.
2617: This option is equivalent to
2618: .Fl Cfr
2619: .Pq see above .
2620: .It Fl h
2621: Generates a help summary of
2622: .Nm flex Ns 's
2623: options to stdout and then exits.
2624: .Fl ?\&
2625: and
2626: .Fl Fl help
2627: are synonyms for
2628: .Fl h .
2629: .It Fl I
2630: Instructs
2631: .Nm
2632: to generate an
2633: .Em interactive
2634: scanner.
2635: An interactive scanner is one that only looks ahead to decide
2636: what token has been matched if it absolutely must.
2637: It turns out that always looking one extra character ahead,
2638: even if the scanner has already seen enough text
2639: to disambiguate the current token, is a bit faster than
2640: only looking ahead when necessary.
2641: But scanners that always look ahead give dreadful interactive performance;
2642: for example, when a user types a newline,
2643: it is not recognized as a newline token until they enter
2644: .Em another
2645: token, which often means typing in another whole line.
2646: .Pp
2647: .Nm
2648: scanners default to
2649: .Em interactive
2650: unless
2651: .Fl Cf
2652: or
2653: .Fl CF
2654: table-compression options are specified
2655: .Pq see above .
2656: That's because if high-performance is most important,
2657: one of these options should be used,
2658: so if they weren't,
2659: .Nm
1.24 sobrado 2660: assumes it is preferable to trade off a bit of run-time performance for
1.16 jmc 2661: intuitive interactive behavior.
2662: Note also that
2663: .Fl I
2664: cannot be used in conjunction with
2665: .Fl Cf
2666: or
2667: .Fl CF .
2668: Thus, this option is not really needed; it is on by default for all those
2669: cases in which it is allowed.
2670: .Pp
2671: A scanner can be forced to not be interactive by using
2672: .Fl B
2673: .Pq see above .
2674: .It Fl i
2675: Instructs
2676: .Nm
2677: to generate a case-insensitive scanner.
2678: The case of letters given in the
2679: .Nm
2680: input patterns will be ignored,
2681: and tokens in the input will be matched regardless of case.
2682: The matched text given in
2683: .Fa yytext
2684: will have the preserved case
2685: .Pq i.e., it will not be folded .
2686: .It Fl L
2687: Instructs
2688: .Nm
2689: not to generate
2690: .Dq #line
2691: directives.
2692: Without this option,
2693: .Nm
2694: peppers the generated scanner with #line directives so error messages
2695: in the actions will be correctly located with respect to either the original
2696: .Nm
2697: input file
2698: (if the errors are due to code in the input file),
2699: or
2700: .Pa lex.yy.c
2701: (if the errors are
2702: .Nm flex Ns 's
2703: fault \- these sorts of errors should be reported to the email address
2704: given below).
2705: .It Fl l
1.36 schwarze 2706: Turns on maximum compatibility with the original
2707: .At
1.16 jmc 2708: .Nm lex
2709: implementation.
2710: Note that this does not mean full compatibility.
2711: Use of this option costs a considerable amount of performance,
2712: and it cannot be used with the
2713: .Fl + , f , F , Cf ,
2714: or
2715: .Fl CF
2716: options.
2717: For details on the compatibilities it provides, see the section
2718: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
2719: below.
2720: This option also results in the name
2721: .Dv YY_FLEX_LEX_COMPAT
2722: being #define'd in the generated scanner.
2723: .It Fl n
2724: Another do-nothing, deprecated option included only for
2725: .Tn POSIX
2726: compliance.
2727: .It Fl o Ns Ar output
2728: Directs
2729: .Nm
2730: to write the scanner to the file
2731: .Ar output
1.1 deraadt 2732: instead of
1.16 jmc 2733: .Pa lex.yy.c .
2734: If
2735: .Fl o
2736: is combined with the
2737: .Fl t
2738: option, then the scanner is written to stdout but its
2739: .Dq #line
2740: directives
2741: (see the
2742: .Fl L
2743: option above)
2744: refer to the file
2745: .Ar output .
2746: .It Fl P Ns Ar prefix
2747: Changes the default
2748: .Qq yy
1.1 deraadt 2749: prefix used by
1.16 jmc 2750: .Nm
1.6 aaron 2751: for all globally visible variable and function names to instead be
1.16 jmc 2752: .Ar prefix .
1.1 deraadt 2753: For example,
1.16 jmc 2754: .Fl P Ns Ar foo
1.1 deraadt 2755: changes the name of
1.16 jmc 2756: .Fa yytext
1.1 deraadt 2757: to
1.16 jmc 2758: .Fa footext .
1.1 deraadt 2759: It also changes the name of the default output file from
1.16 jmc 2760: .Pa lex.yy.c
1.1 deraadt 2761: to
1.16 jmc 2762: .Pa lex.foo.c .
1.1 deraadt 2763: Here are all of the names affected:
1.16 jmc 2764: .Bd -unfilled -offset indent
2765: yy_create_buffer
2766: yy_delete_buffer
2767: yy_flex_debug
2768: yy_init_buffer
2769: yy_flush_buffer
2770: yy_load_buffer_state
2771: yy_switch_to_buffer
2772: yyin
2773: yyleng
2774: yylex
2775: yylineno
2776: yyout
2777: yyrestart
2778: yytext
2779: yywrap
2780: .Ed
2781: .Pp
2782: (If using a C++ scanner, then only
2783: .Fa yywrap
1.1 deraadt 2784: and
1.16 jmc 2785: .Fa yyFlexLexer
1.1 deraadt 2786: are affected.)
1.16 jmc 2787: Within the scanner itself, it is still possible to refer to the global variables
1.1 deraadt 2788: and functions using either version of their name; but externally, they
2789: have the modified name.
1.16 jmc 2790: .Pp
2791: This option allows multiple
2792: .Nm
2793: programs to be easily linked together into the same executable.
2794: Note, though, that using this option also renames
2795: .Fn yywrap ,
2796: so now either an
2797: .Pq appropriately named
2798: version of the routine for the scanner must be supplied, or
2799: .Dq %option noyywrap
2800: must be used, as linking with
2801: .Fl lfl
2802: no longer provides one by default.
2803: .It Fl p
2804: Generates a performance report to stderr.
2805: The report consists of comments regarding features of the
2806: .Nm
2807: input file which will cause a serious loss of performance in the resulting
2808: scanner.
2809: If the flag is specified twice,
2810: comments regarding features that lead to minor performance losses
2811: will also be reported>
2812: .Pp
2813: Note that the use of
2814: .Em REJECT ,
2815: .Dq %option yylineno ,
2816: and variable trailing context
2817: (see the
2818: .Sx BUGS
2819: section below)
2820: entails a substantial performance penalty; use of
2821: .Fn yymore ,
2822: the
2823: .Sq ^
2824: operator, and the
2825: .Fl I
2826: flag entail minor performance penalties.
2827: .It Fl S Ns Ar skeleton
2828: Overrides the default skeleton file from which
2829: .Nm
2830: constructs its scanners.
2831: This option is needed only for
2832: .Nm
1.1 deraadt 2833: maintenance or development.
1.16 jmc 2834: .It Fl s
2835: Causes the default rule
2836: .Pq that unmatched scanner input is echoed to stdout
2837: to be suppressed.
2838: If the scanner encounters input that does not
2839: match any of its rules, it aborts with an error.
2840: This option is useful for finding holes in a scanner's rule set.
2841: .It Fl T
2842: Makes
2843: .Nm
2844: run in
2845: .Em trace
2846: mode.
2847: It will generate a lot of messages to stderr concerning
2848: the form of the input and the resultant non-deterministic and deterministic
2849: finite automata.
2850: This option is mostly for use in maintaining
2851: .Nm .
2852: .It Fl t
2853: Instructs
2854: .Nm
2855: to write the scanner it generates to standard output instead of
2856: .Pa lex.yy.c .
2857: .It Fl V
2858: Prints the version number to stdout and exits.
2859: .Fl Fl version
2860: is a synonym for
2861: .Fl V .
2862: .It Fl v
2863: Specifies that
2864: .Nm
2865: should write to stderr
2866: a summary of statistics regarding the scanner it generates.
2867: Most of the statistics are meaningless to the casual
2868: .Nm
2869: user, but the first line identifies the version of
2870: .Nm
2871: (same as reported by
2872: .Fl V ) ,
2873: and the next line the flags used when generating the scanner,
2874: including those that are on by default.
2875: .It Fl w
2876: Suppresses warning messages.
2877: .It Fl +
2878: Specifies that
2879: .Nm
2880: should generate a C++ scanner class.
2881: See the section on
2882: .Sx GENERATING C++ SCANNERS
2883: below for details.
2884: .El
2885: .Pp
2886: .Nm
1.1 deraadt 2887: also provides a mechanism for controlling options within the
1.16 jmc 2888: scanner specification itself, rather than from the
2889: .Nm
1.33 jmc 2890: command line.
1.1 deraadt 2891: This is done by including
1.16 jmc 2892: .Dq %option
1.1 deraadt 2893: directives in the first section of the scanner specification.
1.16 jmc 2894: Multiple options can be specified with a single
2895: .Dq %option
2896: directive, and multiple directives in the first section of the
2897: .Nm
2898: input file.
2899: .Pp
2900: Most options are given simply as names, optionally preceded by the word
2901: .Qq no
2902: .Pq with no intervening whitespace
2903: to negate their meaning.
2904: A number are equivalent to
2905: .Nm
2906: flags or their negation:
2907: .Bd -unfilled -offset indent
2908: 7bit -7 option
2909: 8bit -8 option
2910: align -Ca option
2911: backup -b option
2912: batch -B option
2913: c++ -+ option
2914:
2915: caseful or
2916: case-sensitive opposite of -i (default)
2917:
2918: case-insensitive or
2919: caseless -i option
2920:
2921: debug -d option
2922: default opposite of -s option
2923: ecs -Ce option
2924: fast -F option
2925: full -f option
2926: interactive -I option
2927: lex-compat -l option
2928: meta-ecs -Cm option
2929: perf-report -p option
2930: read -Cr option
2931: stdout -t option
2932: verbose -v option
2933: warn opposite of -w option
2934: (use "%option nowarn" for -w)
2935:
2936: array equivalent to "%array"
2937: pointer equivalent to "%pointer" (default)
2938: .Ed
2939: .Pp
2940: Some %option's provide features otherwise not available:
2941: .Bl -tag -width Ds
2942: .It always-interactive
2943: Instructs
2944: .Nm
2945: to generate a scanner which always considers its input
2946: .Qq interactive .
2947: Normally, on each new input file the scanner calls
2948: .Fn isatty
2949: in an attempt to determine whether the scanner's input source is interactive
2950: and thus should be read a character at a time.
2951: When this option is used, however, no such call is made.
2952: .It main
2953: Directs
2954: .Nm
2955: to provide a default
2956: .Fn main
1.1 deraadt 2957: program for the scanner, which simply calls
1.16 jmc 2958: .Fn yylex .
1.1 deraadt 2959: This option implies
1.16 jmc 2960: .Dq noyywrap
2961: .Pq see below .
2962: .It never-interactive
2963: Instructs
2964: .Nm
2965: to generate a scanner which never considers its input
2966: .Qq interactive
2967: (again, no call made to
2968: .Fn isatty ) .
1.1 deraadt 2969: This is the opposite of
1.16 jmc 2970: .Dq always-interactive .
2971: .It stack
2972: Enables the use of start condition stacks
2973: (see
2974: .Sx START CONDITIONS
2975: above).
2976: .It stdinit
2977: If set (i.e.,
2978: .Dq %option stdinit ) ,
1.1 deraadt 2979: initializes
1.16 jmc 2980: .Fa yyin
1.1 deraadt 2981: and
1.16 jmc 2982: .Fa yyout
2983: to stdin and stdout, instead of the default of
2984: .Dq nil .
1.1 deraadt 2985: Some existing
1.16 jmc 2986: .Nm lex
2987: programs depend on this behavior, even though it is not compliant with ANSI C,
2988: which does not require stdin and stdout to be compile-time constant.
2989: .It yylineno
2990: Directs
2991: .Nm
1.1 deraadt 2992: to generate a scanner that maintains the number of the current line
2993: read from its input in the global variable
1.16 jmc 2994: .Fa yylineno .
1.1 deraadt 2995: This option is implied by
1.16 jmc 2996: .Dq %option lex-compat .
2997: .It yywrap
2998: If unset (i.e.,
2999: .Dq %option noyywrap ) ,
1.1 deraadt 3000: makes the scanner not call
1.16 jmc 3001: .Fn yywrap
3002: upon an end-of-file, but simply assume that there are no more files to scan
3003: (until the user points
3004: .Fa yyin
1.1 deraadt 3005: at a new file and calls
1.16 jmc 3006: .Fn yylex
1.1 deraadt 3007: again).
1.16 jmc 3008: .El
3009: .Pp
3010: .Nm
3011: scans rule actions to determine whether the
3012: .Em REJECT
3013: or
3014: .Fn yymore
3015: features are being used.
3016: The
3017: .Dq reject
1.1 deraadt 3018: and
1.16 jmc 3019: .Dq yymore
3020: options are available to override its decision as to whether to use the
1.1 deraadt 3021: options, either by setting them (e.g.,
1.16 jmc 3022: .Dq %option reject )
3023: to indicate the feature is indeed used,
3024: or unsetting them to indicate it actually is not used
1.1 deraadt 3025: (e.g.,
1.16 jmc 3026: .Dq %option noyymore ) .
3027: .Pp
3028: Three options take string-delimited values, offset with
3029: .Sq = :
3030: .Pp
3031: .D1 %option outfile="ABC"
3032: .Pp
1.1 deraadt 3033: is equivalent to
1.16 jmc 3034: .Fl o Ns Ar ABC ,
1.1 deraadt 3035: and
1.16 jmc 3036: .Pp
3037: .D1 %option prefix="XYZ"
3038: .Pp
1.1 deraadt 3039: is equivalent to
1.16 jmc 3040: .Fl P Ns Ar XYZ .
1.1 deraadt 3041: Finally,
1.16 jmc 3042: .Pp
3043: .D1 %option yyclass="foo"
3044: .Pp
3045: only applies when generating a C++ scanner
3046: .Pf ( Fl +
3047: option).
3048: It informs
3049: .Nm
3050: that
3051: .Dq foo
3052: has been derived as a subclass of yyFlexLexer, so
3053: .Nm
3054: will place actions in the member function
3055: .Dq foo::yylex()
1.1 deraadt 3056: instead of
1.16 jmc 3057: .Dq yyFlexLexer::yylex() .
1.1 deraadt 3058: It also generates a
1.16 jmc 3059: .Dq yyFlexLexer::yylex()
1.1 deraadt 3060: member function that emits a run-time error (by invoking
1.16 jmc 3061: .Dq yyFlexLexer::LexerError() )
1.1 deraadt 3062: if called.
1.16 jmc 3063: See
3064: .Sx GENERATING C++ SCANNERS ,
3065: below, for additional information.
3066: .Pp
3067: A number of options are available for
1.32 jmc 3068: lint
1.16 jmc 3069: purists who want to suppress the appearance of unneeded routines
3070: in the generated scanner.
3071: Each of the following, if unset
1.1 deraadt 3072: (e.g.,
1.16 jmc 3073: .Dq %option nounput ) ,
3074: results in the corresponding routine not appearing in the generated scanner:
3075: .Bd -unfilled -offset indent
3076: input, unput
3077: yy_push_state, yy_pop_state, yy_top_state
3078: yy_scan_buffer, yy_scan_bytes, yy_scan_string
3079: .Ed
3080: .Pp
1.1 deraadt 3081: (though
1.16 jmc 3082: .Fn yy_push_state
3083: and friends won't appear anyway unless
3084: .Dq %option stack
3085: is being used).
3086: .Sh PERFORMANCE CONSIDERATIONS
1.1 deraadt 3087: The main design goal of
1.16 jmc 3088: .Nm
3089: is that it generate high-performance scanners.
3090: It has been optimized for dealing well with large sets of rules.
3091: Aside from the effects on scanner speed of the table compression
3092: .Fl C
1.1 deraadt 3093: options outlined above,
1.16 jmc 3094: there are a number of options/actions which degrade performance.
3095: These are, from most expensive to least:
3096: .Bd -unfilled -offset indent
3097: REJECT
3098: %option yylineno
3099: arbitrary trailing context
3100:
3101: pattern sets that require backing up
3102: %array
3103: %option interactive
3104: %option always-interactive
3105:
3106: \&'^' beginning-of-line operator
3107: yymore()
3108: .Ed
3109: .Pp
3110: with the first three all being quite expensive
3111: and the last two being quite cheap.
3112: Note also that
3113: .Fn unput
3114: is implemented as a routine call that potentially does quite a bit of work,
3115: while
3116: .Fn yyless
3117: is a quite-cheap macro; so if just putting back some excess text,
3118: use
3119: .Fn yyless .
3120: .Pp
3121: .Em REJECT
1.1 deraadt 3122: should be avoided at all costs when performance is important.
3123: It is a particularly expensive option.
1.16 jmc 3124: .Pp
1.1 deraadt 3125: Getting rid of backing up is messy and often may be an enormous
1.16 jmc 3126: amount of work for a complicated scanner.
3127: In principal, one begins by using the
3128: .Fl b
1.1 deraadt 3129: flag to generate a
1.16 jmc 3130: .Pa lex.backup
3131: file.
3132: For example, on the input
3133: .Bd -literal -offset indent
3134: %%
3135: foo return TOK_KEYWORD;
3136: foobar return TOK_KEYWORD;
3137: .Ed
3138: .Pp
1.1 deraadt 3139: the file looks like:
1.16 jmc 3140: .Bd -literal -offset indent
3141: State #6 is non-accepting -
3142: associated rule line numbers:
3143: 2 3
3144: out-transitions: [ o ]
3145: jam-transitions: EOF [ \e001-n p-\e177 ]
3146:
3147: State #8 is non-accepting -
3148: associated rule line numbers:
3149: 3
3150: out-transitions: [ a ]
3151: jam-transitions: EOF [ \e001-` b-\e177 ]
3152:
3153: State #9 is non-accepting -
3154: associated rule line numbers:
3155: 3
3156: out-transitions: [ r ]
3157: jam-transitions: EOF [ \e001-q s-\e177 ]
3158:
3159: Compressed tables always back up.
3160: .Ed
3161: .Pp
1.1 deraadt 3162: The first few lines tell us that there's a scanner state in
1.16 jmc 3163: which it can make a transition on an
3164: .Sq o
3165: but not on any other character,
3166: and that in that state the currently scanned text does not match any rule.
3167: The state occurs when trying to match the rules found
1.1 deraadt 3168: at lines 2 and 3 in the input file.
1.16 jmc 3169: If the scanner is in that state and then reads something other than an
3170: .Sq o ,
3171: it will have to back up to find a rule which is matched.
3172: With a bit of headscratching one can see that this must be the
3173: state it's in when it has seen
3174: .Sq fo .
3175: When this has happened, if anything other than another
3176: .Sq o
3177: is seen, the scanner will have to back up to simply match the
3178: .Sq f
3179: .Pq by the default rule .
3180: .Pp
3181: The comment regarding State #8 indicates there's a problem when
3182: .Qq foob
3183: has been scanned.
3184: Indeed, on any character other than an
3185: .Sq a ,
3186: the scanner will have to back up to accept
3187: .Qq foo .
3188: Similarly, the comment for State #9 concerns when
3189: .Qq fooba
3190: has been scanned and an
3191: .Sq r
3192: does not follow.
3193: .Pp
1.1 deraadt 3194: The final comment reminds us that there's no point going to
1.16 jmc 3195: all the trouble of removing backing up from the rules unless we're using
3196: .Fl Cf
1.1 deraadt 3197: or
1.16 jmc 3198: .Fl CF ,
1.1 deraadt 3199: since there's no performance gain doing so with compressed scanners.
1.16 jmc 3200: .Pp
3201: The way to remove the backing up is to add
3202: .Qq error
3203: rules:
3204: .Bd -literal -offset indent
3205: %%
3206: foo return TOK_KEYWORD;
3207: foobar return TOK_KEYWORD;
3208:
3209: fooba |
3210: foob |
3211: fo {
3212: /* false alarm, not really a keyword */
3213: return TOK_ID;
3214: }
3215: .Ed
3216: .Pp
3217: Eliminating backing up among a list of keywords can also be done using a
3218: .Qq catch-all
3219: rule:
3220: .Bd -literal -offset indent
3221: %%
3222: foo return TOK_KEYWORD;
3223: foobar return TOK_KEYWORD;
3224:
3225: [a-z]+ return TOK_ID;
3226: .Ed
3227: .Pp
1.1 deraadt 3228: This is usually the best solution when appropriate.
1.16 jmc 3229: .Pp
1.1 deraadt 3230: Backing up messages tend to cascade.
1.16 jmc 3231: With a complicated set of rules it's not uncommon to get hundreds of messages.
3232: If one can decipher them, though,
3233: it often only takes a dozen or so rules to eliminate the backing up
3234: (though it's easy to make a mistake and have an error rule accidentally match
3235: a valid token; a possible future
3236: .Nm
1.1 deraadt 3237: feature will be to automatically add rules to eliminate backing up).
1.16 jmc 3238: .Pp
3239: It's important to keep in mind that the benefits of eliminating
3240: backing up are gained only if
3241: .Em every
3242: instance of backing up is eliminated.
3243: Leaving just one gains nothing.
3244: .Pp
3245: .Em Variable
3246: trailing context
3247: (where both the leading and trailing parts do not have a fixed length)
3248: entails almost the same performance loss as
3249: .Em REJECT
3250: .Pq i.e., substantial .
3251: So when possible a rule like:
3252: .Bd -literal -offset indent
3253: %%
3254: mouse|rat/(cat|dog) run();
3255: .Ed
3256: .Pp
1.1 deraadt 3257: is better written:
1.16 jmc 3258: .Bd -literal -offset indent
3259: %%
3260: mouse/cat|dog run();
3261: rat/cat|dog run();
3262: .Ed
3263: .Pp
1.1 deraadt 3264: or as
1.16 jmc 3265: .Bd -literal -offset indent
3266: %%
3267: mouse|rat/cat run();
3268: mouse|rat/dog run();
3269: .Ed
3270: .Pp
3271: Note that here the special
3272: .Sq |\&
3273: action does not provide any savings, and can even make things worse (see
3274: .Sx BUGS
3275: below).
3276: .Pp
1.1 deraadt 3277: Another area where the user can increase a scanner's performance
1.16 jmc 3278: .Pq and one that's easier to implement
3279: arises from the fact that the longer the tokens matched,
3280: the faster the scanner will run.
1.1 deraadt 3281: This is because with long tokens the processing of most input
1.16 jmc 3282: characters takes place in the
3283: .Pq short
3284: inner scanning loop, and does not often have to go through the additional work
3285: of setting up the scanning environment (e.g.,
3286: .Fa yytext )
3287: for the action.
3288: Recall the scanner for C comments:
3289: .Bd -literal -offset indent
3290: %x comment
3291: %%
3292: int line_num = 1;
3293:
3294: "/*" BEGIN(comment);
3295:
3296: <comment>[^*\en]*
3297: <comment>"*"+[^*/\en]*
3298: <comment>\en ++line_num;
3299: <comment>"*"+"/" BEGIN(INITIAL);
3300: .Ed
3301: .Pp
1.1 deraadt 3302: This could be sped up by writing it as:
1.16 jmc 3303: .Bd -literal -offset indent
3304: %x comment
3305: %%
3306: int line_num = 1;
3307:
3308: "/*" BEGIN(comment);
3309:
3310: <comment>[^*\en]*
3311: <comment>[^*\en]*\en ++line_num;
3312: <comment>"*"+[^*/\en]*
3313: <comment>"*"+[^*/\en]*\en ++line_num;
3314: <comment>"*"+"/" BEGIN(INITIAL);
3315: .Ed
3316: .Pp
3317: Now instead of each newline requiring the processing of another action,
3318: recognizing the newlines is
3319: .Qq distributed
3320: over the other rules to keep the matched text as long as possible.
3321: Note that adding rules does
3322: .Em not
3323: slow down the scanner!
3324: The speed of the scanner is independent of the number of rules or
3325: (modulo the considerations given at the beginning of this section)
3326: how complicated the rules are with regard to operators such as
3327: .Sq *
3328: and
3329: .Sq |\& .
3330: .Pp
3331: A final example in speeding up a scanner:
3332: scan through a file containing identifiers and keywords, one per line
3333: and with no other extraneous characters, and recognize all the keywords.
3334: A natural first approach is:
3335: .Bd -literal -offset indent
3336: %%
3337: asm |
3338: auto |
3339: break |
3340: \&... etc ...
3341: volatile |
3342: while /* it's a keyword */
3343:
3344: \&.|\en /* it's not a keyword */
3345: .Ed
3346: .Pp
1.1 deraadt 3347: To eliminate the back-tracking, introduce a catch-all rule:
1.16 jmc 3348: .Bd -literal -offset indent
3349: %%
3350: asm |
3351: auto |
3352: break |
3353: \&... etc ...
3354: volatile |
3355: while /* it's a keyword */
3356:
3357: [a-z]+ |
3358: \&.|\en /* it's not a keyword */
3359: .Ed
3360: .Pp
1.1 deraadt 3361: Now, if it's guaranteed that there's exactly one word per line,
3362: then we can reduce the total number of matches by a half by
1.16 jmc 3363: merging in the recognition of newlines with that of the other tokens:
3364: .Bd -literal -offset indent
3365: %%
3366: asm\en |
3367: auto\en |
3368: break\en |
3369: \&... etc ...
3370: volatile\en |
3371: while\en /* it's a keyword */
3372:
3373: [a-z]+\en |
3374: \&.|\en /* it's not a keyword */
3375: .Ed
3376: .Pp
3377: One has to be careful here,
3378: as we have now reintroduced backing up into the scanner.
3379: In particular, while we know that there will never be any characters
3380: in the input stream other than letters or newlines,
3381: .Nm
1.1 deraadt 3382: can't figure this out, and it will plan for possibly needing to back up
1.16 jmc 3383: when it has scanned a token like
3384: .Qq auto
3385: and then the next character is something other than a newline or a letter.
3386: Previously it would then just match the
3387: .Qq auto
3388: rule and be done, but now it has no
3389: .Qq auto
3390: rule, only an
3391: .Qq auto\en
3392: rule.
3393: To eliminate the possibility of backing up,
1.40 jmc 3394: we could either duplicate all rules but without final newlines or,
1.1 deraadt 3395: since we never expect to encounter such an input and therefore don't
1.16 jmc 3396: how it's classified, we can introduce one more catch-all rule,
3397: this one which doesn't include a newline:
3398: .Bd -literal -offset indent
3399: %%
3400: asm\en |
3401: auto\en |
3402: break\en |
3403: \&... etc ...
3404: volatile\en |
3405: while\en /* it's a keyword */
3406:
3407: [a-z]+\en |
3408: [a-z]+ |
3409: \&.|\en /* it's not a keyword */
3410: .Ed
3411: .Pp
1.1 deraadt 3412: Compiled with
1.16 jmc 3413: .Fl Cf ,
1.1 deraadt 3414: this is about as fast as one can get a
1.16 jmc 3415: .Nm
1.1 deraadt 3416: scanner to go for this particular problem.
1.16 jmc 3417: .Pp
1.1 deraadt 3418: A final note:
1.16 jmc 3419: .Nm
3420: is slow when matching NUL's,
3421: particularly when a token contains multiple NUL's.
3422: It's best to write rules which match short
1.1 deraadt 3423: amounts of text if it's anticipated that the text will often include NUL's.
1.16 jmc 3424: .Pp
1.1 deraadt 3425: Another final note regarding performance: as mentioned above in the section
1.16 jmc 3426: .Sx HOW THE INPUT IS MATCHED ,
3427: dynamically resizing
3428: .Fa yytext
1.1 deraadt 3429: to accommodate huge tokens is a slow process because it presently requires that
1.16 jmc 3430: the
3431: .Pq huge
3432: token be rescanned from the beginning.
3433: Thus if performance is vital, it is better to attempt to match
3434: .Qq large
3435: quantities of text but not
3436: .Qq huge
3437: quantities, where the cutoff between the two is at about 8K characters/token.
3438: .Sh GENERATING C++ SCANNERS
3439: .Nm
3440: provides two different ways to generate scanners for use with C++.
3441: The first way is to simply compile a scanner generated by
3442: .Nm
3443: using a C++ compiler instead of a C compiler.
3444: This should not generate any compilation errors
3445: (please report any found to the email address given in the
3446: .Sx AUTHORS
3447: section below).
3448: C++ code can then be used in rule actions instead of C code.
3449: Note that the default input source for scanners remains
3450: .Fa yyin ,
1.1 deraadt 3451: and default echoing is still done to
1.16 jmc 3452: .Fa yyout .
1.1 deraadt 3453: Both of these remain
1.16 jmc 3454: .Fa FILE *
3455: variables and not C++ streams.
3456: .Pp
3457: .Nm
3458: can also be used to generate a C++ scanner class, using the
3459: .Fl +
1.1 deraadt 3460: option (or, equivalently,
1.16 jmc 3461: .Dq %option c++ ) ,
3462: which is automatically specified if the name of the flex executable ends in a
3463: .Sq + ,
3464: such as
3465: .Nm flex++ .
3466: When using this option,
3467: .Nm
3468: defaults to generating the scanner to the file
3469: .Pa lex.yy.cc
1.1 deraadt 3470: instead of
1.16 jmc 3471: .Pa lex.yy.c .
1.1 deraadt 3472: The generated scanner includes the header file
1.38 bentley 3473: .In g++/FlexLexer.h ,
1.1 deraadt 3474: which defines the interface to two C++ classes.
1.16 jmc 3475: .Pp
1.1 deraadt 3476: The first class,
1.16 jmc 3477: .Em FlexLexer ,
3478: provides an abstract base class defining the general scanner class interface.
3479: It provides the following member functions:
3480: .Bl -tag -width Ds
3481: .It const char* YYText()
3482: Returns the text of the most recently matched token, the equivalent of
3483: .Fa yytext .
3484: .It int YYLeng()
3485: Returns the length of the most recently matched token, the equivalent of
3486: .Fa yyleng .
3487: .It int lineno() const
3488: Returns the current input line number
1.1 deraadt 3489: (see
1.16 jmc 3490: .Dq %option yylineno ) ,
3491: or 1 if
3492: .Dq %option yylineno
1.1 deraadt 3493: was not used.
1.16 jmc 3494: .It void set_debug(int flag)
3495: Sets the debugging flag for the scanner, equivalent to assigning to
3496: .Fa yy_flex_debug
3497: (see the
3498: .Sx OPTIONS
3499: section above).
3500: Note that the scanner must be built using
3501: .Dq %option debug
1.1 deraadt 3502: to include debugging information in it.
1.16 jmc 3503: .It int debug() const
3504: Returns the current setting of the debugging flag.
3505: .El
3506: .Pp
1.1 deraadt 3507: Also provided are member functions equivalent to
1.16 jmc 3508: .Fn yy_switch_to_buffer ,
3509: .Fn yy_create_buffer
1.1 deraadt 3510: (though the first argument is an
1.18 espie 3511: .Fa std::istream*
1.1 deraadt 3512: object pointer and not a
1.16 jmc 3513: .Fa FILE* ) ,
3514: .Fn yy_flush_buffer ,
3515: .Fn yy_delete_buffer ,
1.1 deraadt 3516: and
1.16 jmc 3517: .Fn yyrestart
1.10 deraadt 3518: (again, the first argument is an
1.18 espie 3519: .Fa std::istream*
1.1 deraadt 3520: object pointer).
1.16 jmc 3521: .Pp
1.1 deraadt 3522: The second class defined in
1.38 bentley 3523: .In g++/FlexLexer.h
1.1 deraadt 3524: is
1.16 jmc 3525: .Fa yyFlexLexer ,
1.1 deraadt 3526: which is derived from
1.16 jmc 3527: .Fa FlexLexer .
1.1 deraadt 3528: It defines the following additional member functions:
1.16 jmc 3529: .Bl -tag -width Ds
1.18 espie 3530: .It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
1.16 jmc 3531: Constructs a
3532: .Fa yyFlexLexer
3533: object using the given streams for input and output.
3534: If not specified, the streams default to
3535: .Fa cin
1.1 deraadt 3536: and
1.16 jmc 3537: .Fa cout ,
1.1 deraadt 3538: respectively.
1.16 jmc 3539: .It virtual int yylex()
3540: Performs the same role as
3541: .Fn yylex
1.1 deraadt 3542: does for ordinary flex scanners: it scans the input stream, consuming
1.16 jmc 3543: tokens, until a rule's action returns a value.
3544: If subclass
3545: .Sq S
3546: is derived from
3547: .Fa yyFlexLexer ,
3548: in order to access the member functions and variables of
3549: .Sq S
1.1 deraadt 3550: inside
1.16 jmc 3551: .Fn yylex ,
3552: use
3553: .Dq %option yyclass="S"
1.1 deraadt 3554: to inform
1.16 jmc 3555: .Nm
3556: that the
3557: .Sq S
3558: subclass will be used instead of
3559: .Fa yyFlexLexer .
1.1 deraadt 3560: In this case, rather than generating
1.16 jmc 3561: .Dq yyFlexLexer::yylex() ,
3562: .Nm
1.1 deraadt 3563: generates
1.16 jmc 3564: .Dq S::yylex()
1.1 deraadt 3565: (and also generates a dummy
1.16 jmc 3566: .Dq yyFlexLexer::yylex()
1.1 deraadt 3567: that calls
1.16 jmc 3568: .Dq yyFlexLexer::LexerError()
1.1 deraadt 3569: if called).
1.18 espie 3570: .It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
1.16 jmc 3571: Reassigns
3572: .Fa yyin
1.1 deraadt 3573: to
1.16 jmc 3574: .Fa new_in
3575: .Pq if non-nil
1.1 deraadt 3576: and
1.16 jmc 3577: .Fa yyout
1.1 deraadt 3578: to
1.16 jmc 3579: .Fa new_out
3580: .Pq ditto ,
3581: deleting the previous input buffer if
3582: .Fa yyin
1.1 deraadt 3583: is reassigned.
1.18 espie 3584: .It int yylex(std::istream* new_in, std::ostream* new_out = 0)
1.16 jmc 3585: First switches the input streams via
3586: .Dq switch_streams(new_in, new_out)
1.1 deraadt 3587: and then returns the value of
1.16 jmc 3588: .Fn yylex .
3589: .El
3590: .Pp
1.1 deraadt 3591: In addition,
1.16 jmc 3592: .Fa yyFlexLexer
3593: defines the following protected virtual functions which can be redefined
1.1 deraadt 3594: in derived classes to tailor the scanner:
1.16 jmc 3595: .Bl -tag -width Ds
3596: .It virtual int LexerInput(char* buf, int max_size)
3597: Reads up to
3598: .Fa max_size
1.1 deraadt 3599: characters into
1.16 jmc 3600: .Fa buf
3601: and returns the number of characters read.
3602: To indicate end-of-input, return 0 characters.
3603: Note that
3604: .Qq interactive
3605: scanners (see the
3606: .Fl B
1.1 deraadt 3607: and
1.16 jmc 3608: .Fl I
1.1 deraadt 3609: flags) define the macro
1.16 jmc 3610: .Dv YY_INTERACTIVE .
3611: If
3612: .Fn LexerInput
3613: has been redefined, and it's necessary to take different actions depending on
3614: whether or not the scanner might be scanning an interactive input source,
3615: it's possible to test for the presence of this name via
3616: .Dq #ifdef .
3617: .It virtual void LexerOutput(const char* buf, int size)
3618: Writes out
3619: .Fa size
1.1 deraadt 3620: characters from the buffer
1.16 jmc 3621: .Fa buf ,
3622: which, while NUL-terminated, may also contain
3623: .Qq internal
3624: NUL's if the scanner's rules can match text with NUL's in them.
3625: .It virtual void LexerError(const char* msg)
3626: Reports a fatal error message.
3627: The default version of this function writes the message to the stream
3628: .Fa cerr
1.1 deraadt 3629: and exits.
1.16 jmc 3630: .El
3631: .Pp
1.1 deraadt 3632: Note that a
1.16 jmc 3633: .Fa yyFlexLexer
3634: object contains its entire scanning state.
3635: Thus such objects can be used to create reentrant scanners.
3636: Multiple instances of the same
3637: .Fa yyFlexLexer
3638: class can be instantiated, and multiple C++ scanner classes can be combined
1.1 deraadt 3639: in the same program using the
1.16 jmc 3640: .Fl P
1.1 deraadt 3641: option discussed above.
1.16 jmc 3642: .Pp
1.1 deraadt 3643: Finally, note that the
1.16 jmc 3644: .Dq %array
3645: feature is not available to C++ scanner classes;
3646: .Dq %pointer
3647: must be used
3648: .Pq the default .
3649: .Pp
1.1 deraadt 3650: Here is an example of a simple C++ scanner:
1.16 jmc 3651: .Bd -literal -offset indent
3652: // An example of using the flex C++ scanner class.
1.1 deraadt 3653:
1.16 jmc 3654: %{
3655: #include <errno.h>
3656: int mylineno = 0;
3657: %}
1.1 deraadt 3658:
1.16 jmc 3659: string \e"[^\en"]+\e"
1.1 deraadt 3660:
1.16 jmc 3661: ws [ \et]+
1.1 deraadt 3662:
1.16 jmc 3663: alpha [A-Za-z]
3664: dig [0-9]
3665: name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3666: num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3667: num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3668: number {num1}|{num2}
1.1 deraadt 3669:
1.16 jmc 3670: %%
1.1 deraadt 3671:
1.16 jmc 3672: {ws} /* skip blanks and tabs */
1.1 deraadt 3673:
1.16 jmc 3674: "/*" {
3675: int c;
1.1 deraadt 3676:
1.16 jmc 3677: while ((c = yyinput()) != 0) {
3678: if(c == '\en')
1.1 deraadt 3679: ++mylineno;
1.16 jmc 3680: else if(c == '*') {
3681: if ((c = yyinput()) == '/')
1.1 deraadt 3682: break;
3683: else
3684: unput(c);
3685: }
1.16 jmc 3686: }
3687: }
1.1 deraadt 3688:
1.16 jmc 3689: {number} cout << "number " << YYText() << '\en';
1.1 deraadt 3690:
1.16 jmc 3691: \en mylineno++;
1.1 deraadt 3692:
1.16 jmc 3693: {name} cout << "name " << YYText() << '\en';
1.1 deraadt 3694:
1.16 jmc 3695: {string} cout << "string " << YYText() << '\en';
3696:
3697: %%
3698:
3699: int main(int /* argc */, char** /* argv */)
3700: {
3701: FlexLexer* lexer = new yyFlexLexer;
3702: while(lexer->yylex() != 0)
3703: ;
3704: return 0;
3705: }
3706: .Ed
3707: .Pp
3708: To create multiple
3709: .Pq different
3710: lexer classes, use the
3711: .Fl P
3712: flag
3713: (or the
3714: .Dq prefix=
3715: option)
3716: to rename each
3717: .Fa yyFlexLexer
1.1 deraadt 3718: to some other
1.16 jmc 3719: .Fa xxFlexLexer .
1.38 bentley 3720: .In g++/FlexLexer.h
1.16 jmc 3721: can then be included in other sources once per lexer class, first renaming
3722: .Fa yyFlexLexer
1.1 deraadt 3723: as follows:
1.16 jmc 3724: .Bd -literal -offset indent
3725: #undef yyFlexLexer
3726: #define yyFlexLexer xxFlexLexer
3727: #include <g++/FlexLexer.h>
3728:
3729: #undef yyFlexLexer
3730: #define yyFlexLexer zzFlexLexer
3731: #include <g++/FlexLexer.h>
3732: .Ed
3733: .Pp
3734: If, for example,
3735: .Dq %option prefix="xx"
3736: is used for one scanner and
3737: .Dq %option prefix="zz"
3738: is used for the other.
3739: .Pp
3740: .Sy IMPORTANT :
3741: the present form of the scanning class is experimental
1.7 aaron 3742: and may change considerably between major releases.
1.16 jmc 3743: .Sh INCOMPATIBILITIES WITH LEX AND POSIX
3744: .Nm
1.25 sobrado 3745: is a rewrite of the
3746: .At
1.16 jmc 3747: .Nm lex
3748: tool
3749: (the two implementations do not share any code, though),
3750: with some extensions and incompatibilities, both of which are of concern
3751: to those who wish to write scanners acceptable to either implementation.
3752: .Nm
3753: is fully compliant with the
3754: .Tn POSIX
3755: .Nm lex
1.1 deraadt 3756: specification, except that when using
1.16 jmc 3757: .Dq %pointer
3758: .Pq the default ,
3759: a call to
3760: .Fn unput
1.1 deraadt 3761: destroys the contents of
1.16 jmc 3762: .Fa yytext ,
3763: which is counter to the
3764: .Tn POSIX
3765: specification.
3766: .Pp
3767: In this section we discuss all of the known areas of incompatibility between
3768: .Nm ,
1.36 schwarze 3769: .At
1.16 jmc 3770: .Nm lex ,
3771: and the
3772: .Tn POSIX
3773: specification.
3774: .Pp
3775: .Nm flex Ns 's
3776: .Fl l
1.36 schwarze 3777: option turns on maximum compatibility with the original
3778: .At
1.16 jmc 3779: .Nm lex
1.1 deraadt 3780: implementation, at the cost of a major loss in the generated scanner's
1.16 jmc 3781: performance.
3782: We note below which incompatibilities can be overcome using the
3783: .Fl l
1.1 deraadt 3784: option.
1.16 jmc 3785: .Pp
3786: .Nm
1.1 deraadt 3787: is fully compatible with
1.16 jmc 3788: .Nm lex
1.1 deraadt 3789: with the following exceptions:
1.16 jmc 3790: .Bl -dash
3791: .It
1.1 deraadt 3792: The undocumented
1.16 jmc 3793: .Nm lex
1.1 deraadt 3794: scanner internal variable
1.16 jmc 3795: .Fa yylineno
1.1 deraadt 3796: is not supported unless
1.16 jmc 3797: .Fl l
1.1 deraadt 3798: or
1.16 jmc 3799: .Dq %option yylineno
1.1 deraadt 3800: is used.
1.16 jmc 3801: .Pp
3802: .Fa yylineno
1.1 deraadt 3803: should be maintained on a per-buffer basis, rather than a per-scanner
1.16 jmc 3804: .Pq single global variable
3805: basis.
3806: .Pp
3807: .Fa yylineno
3808: is not part of the
3809: .Tn POSIX
3810: specification.
3811: .It
1.1 deraadt 3812: The
1.16 jmc 3813: .Fn input
1.1 deraadt 3814: routine is not redefinable, though it may be called to read characters
1.16 jmc 3815: following whatever has been matched by a rule.
3816: If
3817: .Fn input
3818: encounters an end-of-file, the normal
3819: .Fn yywrap
3820: processing is done.
3821: A
3822: .Dq real
3823: end-of-file is returned by
3824: .Fn input
1.1 deraadt 3825: as
1.16 jmc 3826: .Dv EOF .
3827: .Pp
1.1 deraadt 3828: Input is instead controlled by defining the
1.16 jmc 3829: .Dv YY_INPUT
1.1 deraadt 3830: macro.
1.16 jmc 3831: .Pp
1.1 deraadt 3832: The
1.16 jmc 3833: .Nm
1.1 deraadt 3834: restriction that
1.16 jmc 3835: .Fn input
3836: cannot be redefined is in accordance with the
3837: .Tn POSIX
3838: specification, which simply does not specify any way of controlling the
1.1 deraadt 3839: scanner's input other than by making an initial assignment to
1.16 jmc 3840: .Fa yyin .
3841: .It
1.1 deraadt 3842: The
1.16 jmc 3843: .Fn unput
3844: routine is not redefinable.
3845: This restriction is in accordance with
3846: .Tn POSIX .
3847: .It
3848: .Nm
1.1 deraadt 3849: scanners are not as reentrant as
1.16 jmc 3850: .Nm lex
3851: scanners.
3852: In particular, if a scanner is interactive and
3853: an interrupt handler long-jumps out of the scanner,
3854: and the scanner is subsequently called again,
3855: the following error message may be displayed:
3856: .Pp
3857: .D1 fatal flex scanner internal error--end of buffer missed
3858: .Pp
1.1 deraadt 3859: To reenter the scanner, first use
1.16 jmc 3860: .Pp
3861: .Dl yyrestart(yyin);
3862: .Pp
3863: Note that this call will throw away any buffered input;
3864: usually this isn't a problem with an interactive scanner.
3865: .Pp
3866: Also note that flex C++ scanner classes are reentrant,
3867: so if using C++ is an option , they should be used instead.
3868: See
3869: .Sx GENERATING C++ SCANNERS
3870: above for details.
3871: .It
3872: .Fn output
1.1 deraadt 3873: is not supported.
3874: Output from the
1.16 jmc 3875: .Em ECHO
1.1 deraadt 3876: macro is done to the file-pointer
1.16 jmc 3877: .Fa yyout
3878: .Pq default stdout .
3879: .Pp
3880: .Fn output
3881: is not part of the
3882: .Tn POSIX
3883: specification.
3884: .It
3885: .Nm lex
3886: does not support exclusive start conditions
3887: .Pq %x ,
3888: though they are in the
3889: .Tn POSIX
3890: specification.
3891: .It
1.1 deraadt 3892: When definitions are expanded,
1.16 jmc 3893: .Nm
1.1 deraadt 3894: encloses them in parentheses.
1.16 jmc 3895: With
3896: .Nm lex ,
3897: the following:
3898: .Bd -literal -offset indent
3899: NAME [A-Z][A-Z0-9]*
3900: %%
3901: foo{NAME}? printf("Found it\en");
3902: %%
3903: .Ed
3904: .Pp
3905: will not match the string
3906: .Qq foo
3907: because when the macro is expanded the rule is equivalent to
3908: .Qq foo[A-Z][A-Z0-9]*?
3909: and the precedence is such that the
3910: .Sq ?\&
3911: is associated with
3912: .Qq [A-Z0-9]* .
3913: With
3914: .Nm ,
1.1 deraadt 3915: the rule will be expanded to
1.16 jmc 3916: .Qq foo([A-Z][A-Z0-9]*)?
3917: and so the string
3918: .Qq foo
3919: will match.
3920: .Pp
1.1 deraadt 3921: Note that if the definition begins with
1.16 jmc 3922: .Sq ^
1.1 deraadt 3923: or ends with
1.16 jmc 3924: .Sq $
3925: then it is not expanded with parentheses, to allow these operators to appear in
3926: definitions without losing their special meanings.
3927: But the
3928: .Sq Aq s ,
3929: .Sq / ,
1.1 deraadt 3930: and
1.16 jmc 3931: .Aq Aq EOF
1.1 deraadt 3932: operators cannot be used in a
1.16 jmc 3933: .Nm
1.1 deraadt 3934: definition.
1.16 jmc 3935: .Pp
1.1 deraadt 3936: Using
1.16 jmc 3937: .Fl l
1.1 deraadt 3938: results in the
1.16 jmc 3939: .Nm lex
1.1 deraadt 3940: behavior of no parentheses around the definition.
1.16 jmc 3941: .Pp
3942: The
3943: .Tn POSIX
3944: specification is that the definition be enclosed in parentheses.
3945: .It
1.1 deraadt 3946: Some implementations of
1.16 jmc 3947: .Nm lex
3948: allow a rule's action to begin on a separate line,
3949: if the rule's pattern has trailing whitespace:
3950: .Bd -literal -offset indent
3951: %%
3952: foo|bar<space here>
3953: { foobar_action(); }
3954: .Ed
3955: .Pp
3956: .Nm
1.1 deraadt 3957: does not support this feature.
1.16 jmc 3958: .It
1.1 deraadt 3959: The
1.16 jmc 3960: .Nm lex
3961: .Sq %r
3962: .Pq generate a Ratfor scanner
3963: option is not supported.
3964: It is not part of the
3965: .Tn POSIX
3966: specification.
3967: .It
1.1 deraadt 3968: After a call to
1.16 jmc 3969: .Fn unput ,
3970: .Fa yytext
3971: is undefined until the next token is matched,
3972: unless the scanner was built using
3973: .Dq %array .
1.1 deraadt 3974: This is not the case with
1.16 jmc 3975: .Nm lex
3976: or the
3977: .Tn POSIX
3978: specification.
3979: The
3980: .Fl l
1.1 deraadt 3981: option does away with this incompatibility.
1.16 jmc 3982: .It
1.1 deraadt 3983: The precedence of the
1.16 jmc 3984: .Sq {}
3985: .Pq numeric range
3986: operator is different.
3987: .Nm lex
3988: interprets
3989: .Qq abc{1,3}
3990: as match one, two, or three occurrences of
3991: .Sq abc ,
3992: whereas
3993: .Nm
3994: interprets it as match
3995: .Sq ab
3996: followed by one, two, or three occurrences of
3997: .Sq c .
3998: The latter is in agreement with the
3999: .Tn POSIX
4000: specification.
4001: .It
1.1 deraadt 4002: The precedence of the
1.16 jmc 4003: .Sq ^
1.1 deraadt 4004: operator is different.
1.16 jmc 4005: .Nm lex
4006: interprets
4007: .Qq ^foo|bar
4008: as match either
4009: .Sq foo
4010: at the beginning of a line, or
4011: .Sq bar
4012: anywhere, whereas
4013: .Nm
4014: interprets it as match either
4015: .Sq foo
4016: or
4017: .Sq bar
4018: if they come at the beginning of a line.
4019: The latter is in agreement with the
4020: .Tn POSIX
4021: specification.
4022: .It
1.1 deraadt 4023: The special table-size declarations such as
1.16 jmc 4024: .Sq %a
1.1 deraadt 4025: supported by
1.16 jmc 4026: .Nm lex
1.1 deraadt 4027: are not required by
1.16 jmc 4028: .Nm
1.1 deraadt 4029: scanners;
1.16 jmc 4030: .Nm
1.1 deraadt 4031: ignores them.
1.16 jmc 4032: .It
1.1 deraadt 4033: The name
1.16 jmc 4034: .Dv FLEX_SCANNER
1.1 deraadt 4035: is #define'd so scanners may be written for use with either
1.16 jmc 4036: .Nm
1.1 deraadt 4037: or
1.16 jmc 4038: .Nm lex .
1.1 deraadt 4039: Scanners also include
1.16 jmc 4040: .Dv YY_FLEX_MAJOR_VERSION
1.1 deraadt 4041: and
1.16 jmc 4042: .Dv YY_FLEX_MINOR_VERSION
1.1 deraadt 4043: indicating which version of
1.16 jmc 4044: .Nm
1.1 deraadt 4045: generated the scanner
1.16 jmc 4046: (for example, for the 2.5 release, these defines would be 2 and 5,
1.1 deraadt 4047: respectively).
1.16 jmc 4048: .El
4049: .Pp
1.1 deraadt 4050: The following
1.16 jmc 4051: .Nm
1.1 deraadt 4052: features are not included in
1.16 jmc 4053: .Nm lex
4054: or the
4055: .Tn POSIX
4056: specification:
4057: .Bd -unfilled -offset indent
4058: C++ scanners
4059: %option
4060: start condition scopes
4061: start condition stacks
4062: interactive/non-interactive scanners
4063: yy_scan_string() and friends
4064: yyterminate()
4065: yy_set_interactive()
4066: yy_set_bol()
4067: YY_AT_BOL()
4068: <<EOF>>
4069: <*>
4070: YY_DECL
4071: YY_START
4072: YY_USER_ACTION
4073: YY_USER_INIT
4074: #line directives
4075: %{}'s around actions
4076: multiple actions on a line
4077: .Ed
4078: .Pp
4079: plus almost all of the
4080: .Nm
4081: flags.
1.1 deraadt 4082: The last feature in the list refers to the fact that with
1.16 jmc 4083: .Nm
1.37 jmc 4084: multiple actions can be placed on the same line,
1.16 jmc 4085: separated with semi-colons, while with
4086: .Nm lex ,
1.1 deraadt 4087: the following
1.16 jmc 4088: .Pp
4089: .Dl foo handle_foo(); ++num_foos_seen;
4090: .Pp
4091: is
4092: .Pq rather surprisingly
4093: truncated to
4094: .Pp
4095: .Dl foo handle_foo();
4096: .Pp
4097: .Nm
4098: does not truncate the action.
4099: Actions that are not enclosed in braces
4100: are simply terminated at the end of the line.
4101: .Sh FILES
4102: .Bl -tag -width "<g++/FlexLexer.h>"
1.41 sobrado 4103: .It Pa flex.skl
1.16 jmc 4104: Skeleton scanner.
4105: This file is only used when building flex, not when
4106: .Nm
4107: executes.
1.41 sobrado 4108: .It Pa lex.backup
1.16 jmc 4109: Backing-up information for the
4110: .Fl b
4111: flag (called
4112: .Pa lex.bck
4113: on some systems).
1.41 sobrado 4114: .It Pa lex.yy.c
1.16 jmc 4115: Generated scanner
4116: (called
4117: .Pa lexyy.c
4118: on some systems).
1.41 sobrado 4119: .It Pa lex.yy.cc
1.16 jmc 4120: Generated C++ scanner class, when using
4121: .Fl + .
1.38 bentley 4122: .It In g++/FlexLexer.h
1.16 jmc 4123: Header file defining the C++ scanner base class,
4124: .Fa FlexLexer ,
4125: and its derived class,
4126: .Fa yyFlexLexer .
1.41 sobrado 4127: .It Pa /usr/lib/libl.*
1.16 jmc 4128: .Nm
4129: libraries.
4130: The
4131: .Pa /usr/lib/libfl.*\&
4132: libraries are links to these.
4133: Scanners must be linked using either
4134: .Fl \&ll
4135: or
4136: .Fl lfl .
4137: .El
1.29 jmc 4138: .Sh EXIT STATUS
4139: .Ex -std flex
1.16 jmc 4140: .Sh DIAGNOSTICS
4141: .Bl -diag
4142: .It warning, rule cannot be matched
4143: Indicates that the given rule cannot be matched because it follows other rules
4144: that will always match the same text as it.
4145: For example, in the following
4146: .Dq foo
4147: cannot be matched because it comes after an identifier
4148: .Qq catch-all
4149: rule:
4150: .Bd -literal -offset indent
4151: [a-z]+ got_identifier();
4152: foo got_foo();
4153: .Ed
4154: .Pp
1.1 deraadt 4155: Using
1.16 jmc 4156: .Em REJECT
1.1 deraadt 4157: in a scanner suppresses this warning.
1.16 jmc 4158: .It "warning, \-s option given but default rule can be matched"
4159: Means that it is possible
4160: .Pq perhaps only in a particular start condition
4161: that the default rule
4162: .Pq match any single character
4163: is the only one that will match a particular input.
4164: Since
4165: .Fl s
1.1 deraadt 4166: was given, presumably this is not intended.
1.16 jmc 4167: .It reject_used_but_not_detected undefined
4168: .It yymore_used_but_not_detected undefined
4169: These errors can occur at compile time.
4170: They indicate that the scanner uses
4171: .Em REJECT
1.1 deraadt 4172: or
1.16 jmc 4173: .Fn yymore
1.1 deraadt 4174: but that
1.16 jmc 4175: .Nm
1.1 deraadt 4176: failed to notice the fact, meaning that
1.16 jmc 4177: .Nm
1.1 deraadt 4178: scanned the first two sections looking for occurrences of these actions
1.16 jmc 4179: and failed to find any, but somehow they snuck in
4180: .Pq via an #include file, for example .
4181: Use
4182: .Dq %option reject
4183: or
4184: .Dq %option yymore
4185: to indicate to
4186: .Nm
4187: that these features are really needed.
4188: .It flex scanner jammed
4189: A scanner compiled with
4190: .Fl s
4191: has encountered an input string which wasn't matched by any of its rules.
4192: This error can also occur due to internal problems.
4193: .It token too large, exceeds YYLMAX
4194: The scanner uses
4195: .Dq %array
1.1 deraadt 4196: and one of its rules matched a string longer than the
1.16 jmc 4197: .Dv YYLMAX
4198: constant
4199: .Pq 8K bytes by default .
4200: The value can be increased by #define'ing
4201: .Dv YYLMAX
4202: in the definitions section of
4203: .Nm
1.1 deraadt 4204: input.
1.16 jmc 4205: .It "scanner requires \-8 flag to use the character 'x'"
4206: The scanner specification includes recognizing the 8-bit character
4207: .Sq x
4208: and the
4209: .Fl 8
4210: flag was not specified, and defaulted to 7-bit because the
4211: .Fl Cf
4212: or
4213: .Fl CF
4214: table compression options were used.
4215: See the discussion of the
4216: .Fl 7
1.1 deraadt 4217: flag for details.
1.16 jmc 4218: .It flex scanner push-back overflow
4219: unput() was used to push back so much text that the scanner's buffer
4220: could not hold both the pushed-back text and the current token in
4221: .Fa yytext .
4222: Ideally the scanner should dynamically resize the buffer in this case,
4223: but at present it does not.
4224: .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4225: The scanner was working on matching an extremely large token and needed
4226: to expand the input buffer.
4227: This doesn't work with scanners that use
4228: .Em REJECT .
4229: .It "fatal flex scanner internal error--end of buffer missed"
1.1 deraadt 4230: This can occur in an scanner which is reentered after a long-jump
1.16 jmc 4231: has jumped out
4232: .Pq or over
4233: the scanner's activation frame.
4234: Before reentering the scanner, use:
4235: .Pp
4236: .Dl yyrestart(yyin);
4237: .Pp
1.1 deraadt 4238: or, as noted above, switch to using the C++ scanner class.
1.16 jmc 4239: .It "too many start conditions in <> construct!"
4240: More start conditions than exist were listed in a <> construct
4241: (so at least one of them must have been listed twice).
4242: .El
4243: .Sh SEE ALSO
4244: .Xr awk 1 ,
4245: .Xr sed 1 ,
4246: .Xr yacc 1
4247: .Rs
4248: .%A John Levine
4249: .%A Tony Mason
4250: .%A Doug Brown
4251: .%B Lex & Yacc
4252: .%I O'Reilly and Associates
4253: .%N 2nd edition
4254: .Re
4255: .Rs
4256: .%A Alfred Aho
4257: .%A Ravi Sethi
4258: .%A Jeffrey Ullman
4259: .%B Compilers: Principles, Techniques and Tools
4260: .%I Addison-Wesley
4261: .%D 1986
4262: .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4263: .Re
1.23 jmc 4264: .Sh STANDARDS
4265: The
4266: .Nm lex
4267: utility is compliant with the
4268: .St -p1003.1-2008
4269: specification,
4270: though its presence is optional.
4271: .Pp
4272: The flags
1.31 jmc 4273: .Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
1.23 jmc 4274: .Op Fl -help ,
4275: and
4276: .Op Fl -version
4277: are extensions to that specification.
1.37 jmc 4278: .Pp
4279: See also the
4280: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
4281: section, above.
1.16 jmc 4282: .Sh AUTHORS
1.1 deraadt 4283: Vern Paxson, with the help of many ideas and much inspiration from
1.16 jmc 4284: Van Jacobson.
4285: Original version by Jef Poskanzer.
4286: The fast table representation is a partial implementation of a design done by
4287: Van Jacobson.
4288: The implementation was done by Kevin Gong and Vern Paxson.
4289: .Pp
1.1 deraadt 4290: Thanks to the many
1.16 jmc 4291: .Nm
1.1 deraadt 4292: beta-testers, feedbackers, and contributors, especially Francois Pinard,
4293: Casey Leedom,
4294: Robert Abramovitz,
4295: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
1.39 bentley 4296: Neal Becker, Nelson H.F. Beebe,
4297: .Mt benson@odi.com ,
1.1 deraadt 4298: Karl Berry, Peter A. Bigot, Simon Blanchard,
4299: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4300: Brian Clapper, J.T. Conklin,
4301: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11 deraadt 4302: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1 deraadt 4303: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4304: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4305: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4306: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4307: Jan Hajic, Charles Hemphill, NORO Hideo,
4308: Jarkko Hietaniemi, Scott Hofmann,
4309: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4310: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4311: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
1.39 bentley 4312: Amir Katz,
4313: .Mt ken@ken.hilco.com ,
4314: Kevin B. Kenny,
1.1 deraadt 4315: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4316: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4317: David Loffredo, Mike Long,
4318: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4319: Bengt Martensson, Chris Metcalf,
4320: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4321: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4322: Richard Ohnemus, Karsten Pahnke,
1.16 jmc 4323: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4324: Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
1.1 deraadt 4325: Frederic Raimbault, Pat Rankin, Rick Richardson,
4326: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4327: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4328: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4329: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4330: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4331: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
1.16 jmc 4332: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4333: Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4334: and those whose names have slipped my marginal mail-archiving skills
4335: but whose contributions are appreciated all the
1.1 deraadt 4336: same.
1.16 jmc 4337: .Pp
1.1 deraadt 4338: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4339: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4340: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4341: distribution headaches.
1.16 jmc 4342: .Pp
4343: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4344: to Benson Margulies and Fred Burke for C++ support;
4345: to Kent Williams and Tom Epperly for C++ class support;
4346: to Ove Ewerlid for support of NUL's;
4347: and to Eric Hughes for support of multiple buffers.
4348: .Pp
1.1 deraadt 4349: This work was primarily done when I was with the Real Time Systems Group
1.16 jmc 4350: at the Lawrence Berkeley Laboratory in Berkeley, CA.
4351: Many thanks to all there for the support I received.
4352: .Pp
4353: Send comments to
1.34 schwarze 4354: .Aq Mt vern@ee.lbl.gov .
1.16 jmc 4355: .Sh BUGS
4356: Some trailing context patterns cannot be properly matched and generate
4357: warning messages
4358: .Pq "dangerous trailing context" .
4359: These are patterns where the ending of the first part of the rule
4360: matches the beginning of the second part, such as
4361: .Qq zx*/xy* ,
4362: where the
4363: .Sq x*
4364: matches the
4365: .Sq x
4366: at the beginning of the trailing context.
4367: (Note that the POSIX draft states that the text matched by such patterns
4368: is undefined.)
4369: .Pp
4370: For some trailing context rules, parts which are actually fixed-length are
4371: not recognized as such, leading to the above mentioned performance loss.
4372: In particular, parts using
4373: .Sq |\&
4374: or
4375: .Sq {n}
4376: (such as
4377: .Qq foo{3} )
4378: are always considered variable-length.
4379: .Pp
4380: Combining trailing context with the special
4381: .Sq |\&
4382: action can result in fixed trailing context being turned into
4383: the more expensive variable trailing context.
4384: For example, in the following:
4385: .Bd -literal -offset indent
4386: %%
4387: abc |
4388: xyz/def
4389: .Ed
4390: .Pp
4391: Use of
4392: .Fn unput
4393: invalidates yytext and yyleng, unless the
4394: .Dq %array
4395: directive
4396: or the
4397: .Fl l
4398: option has been used.
4399: .Pp
4400: Pattern-matching of NUL's is substantially slower than matching other
4401: characters.
4402: .Pp
4403: Dynamic resizing of the input buffer is slow, as it entails rescanning
4404: all the text matched so far by the current
4405: .Pq generally huge
4406: token.
4407: .Pp
4408: Due to both buffering of input and read-ahead,
4409: it is not possible to intermix calls to
1.38 bentley 4410: .In stdio.h
1.16 jmc 4411: routines, such as, for example,
4412: .Fn getchar ,
4413: with
4414: .Nm
4415: rules and expect it to work.
4416: Call
4417: .Fn input
4418: instead.
4419: .Pp
4420: The total table entries listed by the
4421: .Fl v
4422: flag excludes the number of table entries needed to determine
4423: what rule has been matched.
4424: The number of entries is equal to the number of DFA states
4425: if the scanner does not use
4426: .Em REJECT ,
4427: and somewhat greater than the number of states if it does.
4428: .Pp
4429: .Em REJECT
4430: cannot be used with the
4431: .Fl f
4432: or
4433: .Fl F
4434: options.
4435: .Pp
4436: The
4437: .Nm
4438: internal algorithms need documentation.