Annotation of src/usr.bin/lex/flex.1, Revision 1.42
1.42 ! nicm 1: .\" $OpenBSD: flex.1,v 1.41 2015/09/07 15:28:06 sobrado Exp $
1.16 jmc 2: .\"
1.12 jmc 3: .\" Copyright (c) 1990 The Regents of the University of California.
4: .\" All rights reserved.
1.2 deraadt 5: .\"
1.12 jmc 6: .\" This code is derived from software contributed to Berkeley by
7: .\" Vern Paxson.
8: .\"
9: .\" The United States Government has rights in this work pursuant
10: .\" to contract no. DE-AC03-76SF00098 between the United States
11: .\" Department of Energy and the University of California.
12: .\"
13: .\" Redistribution and use in source and binary forms, with or without
1.13 millert 14: .\" modification, are permitted provided that the following conditions
15: .\" are met:
16: .\"
17: .\" 1. Redistributions of source code must retain the above copyright
18: .\" notice, this list of conditions and the following disclaimer.
19: .\" 2. Redistributions in binary form must reproduce the above copyright
20: .\" notice, this list of conditions and the following disclaimer in the
21: .\" documentation and/or other materials provided with the distribution.
22: .\"
23: .\" Neither the name of the University nor the names of its contributors
24: .\" may be used to endorse or promote products derived from this software
25: .\" without specific prior written permission.
26: .\"
27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30: .\" PURPOSE.
1.16 jmc 31: .\"
1.42 ! nicm 32: .Dd $Mdocdate: September 7 2015 $
1.16 jmc 33: .Dt FLEX 1
34: .Os
35: .Sh NAME
1.42 ! nicm 36: .Nm flex ,
! 37: .Nm flex++ ,
! 38: .Nm lex
1.16 jmc 39: .Nd fast lexical analyzer generator
40: .Sh SYNOPSIS
41: .Nm
1.28 jmc 42: .Bk -words
1.31 jmc 43: .Op Fl 78BbdFfhIiLlnpsTtVvw+?
1.16 jmc 44: .Op Fl C Ns Op Cm aeFfmr
45: .Op Fl Fl help
46: .Op Fl Fl version
1.28 jmc 47: .Op Fl o Ns Ar output
48: .Op Fl P Ns Ar prefix
49: .Op Fl S Ns Ar skeleton
50: .Op Ar
51: .Ek
1.21 jmc 52: .Sh DESCRIPTION
53: .Nm
54: is a tool for generating
55: .Em scanners :
56: programs which recognize lexical patterns in text.
57: .Nm
58: reads the given input files, or its standard input if no file names are given,
59: for a description of a scanner to generate.
60: The description is in the form of pairs of regular expressions and C code,
61: called
62: .Em rules .
63: .Nm
64: generates as output a C source file,
65: .Pa lex.yy.c ,
66: which defines a routine
67: .Fn yylex .
68: This file is compiled and linked with the
69: .Fl lfl
70: library to produce an executable.
71: When the executable is run, it analyzes its input for occurrences
72: of the regular expressions.
73: Whenever it finds one, it executes the corresponding C code.
1.42 ! nicm 74: .Pp
! 75: .Nm lex
! 76: is a synonym for
! 77: .Nm flex .
! 78: .Pp
! 79: .Nm flex++
! 80: is a synonym for
! 81: .Nm
! 82: .Fl + .
1.21 jmc 83: .Pp
1.16 jmc 84: The manual includes both tutorial and reference sections:
85: .Bl -ohang
86: .It Sy Some Simple Examples
87: .It Sy Format of the Input File
88: .It Sy Patterns
89: The extended regular expressions used by
90: .Nm .
91: .It Sy How the Input is Matched
92: The rules for determining what has been matched.
93: .It Sy Actions
94: How to specify what to do when a pattern is matched.
95: .It Sy The Generated Scanner
96: Details regarding the scanner that
97: .Nm
98: produces;
99: how to control the input source.
100: .It Sy Start Conditions
101: Introducing context into scanners, and managing
102: .Qq mini-scanners .
103: .It Sy Multiple Input Buffers
104: How to manipulate multiple input sources;
105: how to scan from strings instead of files.
106: .It Sy End-of-File Rules
107: Special rules for matching the end of the input.
108: .It Sy Miscellaneous Macros
109: A summary of macros available to the actions.
110: .It Sy Values Available to the User
111: A summary of values available to the actions.
112: .It Sy Interfacing with Yacc
113: Connecting flex scanners together with
114: .Xr yacc 1
115: parsers.
116: .It Sy Options
117: .Nm
118: command-line options, and the
119: .Dq %option
120: directive.
121: .It Sy Performance Considerations
122: How to make scanners go as fast as possible.
123: .It Sy Generating C++ Scanners
124: The
125: .Pq experimental
126: facility for generating C++ scanner classes.
127: .It Sy Incompatibilities with Lex and POSIX
128: How
129: .Nm
1.36 schwarze 130: differs from
131: .At
132: .Nm lex
133: and the
1.16 jmc 134: .Tn POSIX
1.36 schwarze 135: .Nm lex
136: standard.
1.16 jmc 137: .It Sy Files
138: Files used by
139: .Nm .
140: .It Sy Diagnostics
141: Those error messages produced by
142: .Nm
143: .Pq or scanners it generates
144: whose meanings might not be apparent.
145: .It Sy See Also
146: Other documentation, related tools.
147: .It Sy Authors
148: Includes contact information.
149: .It Sy Bugs
150: Known problems with
151: .Nm .
152: .El
153: .Sh SOME SIMPLE EXAMPLES
1.1 deraadt 154: First some simple examples to get the flavor of how one uses
1.16 jmc 155: .Nm .
1.1 deraadt 156: The following
1.16 jmc 157: .Nm
1.1 deraadt 158: input specifies a scanner which whenever it encounters the string
1.16 jmc 159: .Qq username
160: will replace it with the user's login name:
161: .Bd -literal -offset indent
162: %%
163: username printf("%s", getlogin());
164: .Ed
165: .Pp
1.1 deraadt 166: By default, any text not matched by a
1.16 jmc 167: .Nm
168: scanner is copied to the output, so the net effect of this scanner is
169: to copy its input file to its output with each occurrence of
170: .Qq username
171: expanded.
172: In this input, there is just one rule.
173: .Qq username
174: is the
175: .Em pattern
176: and the
177: .Qq printf
178: is the
179: .Em action .
180: The
181: .Qq %%
182: marks the beginning of the rules.
183: .Pp
1.1 deraadt 184: Here's another simple example:
1.16 jmc 185: .Bd -literal -offset indent
1.20 pvalchev 186: %{
1.16 jmc 187: int num_lines = 0, num_chars = 0;
1.20 pvalchev 188: %}
1.1 deraadt 189:
1.16 jmc 190: %%
191: \en ++num_lines; ++num_chars;
192: \&. ++num_chars;
193:
194: %%
195: main()
196: {
197: yylex();
198: printf("# of lines = %d, # of chars = %d\en",
199: num_lines, num_chars);
200: }
201: .Ed
202: .Pp
1.1 deraadt 203: This scanner counts the number of characters and the number
1.16 jmc 204: of lines in its input
205: (it produces no output other than the final report on the counts).
206: The first line declares two globals,
207: .Qq num_lines
208: and
209: .Qq num_chars ,
210: which are accessible both inside
211: .Fn yylex
1.1 deraadt 212: and in the
1.16 jmc 213: .Fn main
214: routine declared after the second
215: .Qq %% .
216: There are two rules, one which matches a newline
217: .Pq \&"\en\&"
218: and increments both the line count and the character count,
219: and one which matches any character other than a newline
220: (indicated by the
221: .Qq \&.
222: regular expression).
223: .Pp
1.1 deraadt 224: A somewhat more complicated example:
1.16 jmc 225: .Bd -literal -offset indent
226: /* scanner for a toy Pascal-like language */
1.1 deraadt 227:
1.16 jmc 228: %{
229: /* need this for the call to atof() below */
230: #include <math.h>
231: %}
1.1 deraadt 232:
1.16 jmc 233: DIGIT [0-9]
234: ID [a-z][a-z0-9]*
1.1 deraadt 235:
1.16 jmc 236: %%
1.1 deraadt 237:
1.16 jmc 238: {DIGIT}+ {
239: printf("An integer: %s (%d)\en", yytext,
240: atoi(yytext));
241: }
1.1 deraadt 242:
1.16 jmc 243: {DIGIT}+"."{DIGIT}* {
244: printf("A float: %s (%g)\en", yytext,
245: atof(yytext));
246: }
1.1 deraadt 247:
1.16 jmc 248: if|then|begin|end|procedure|function {
249: printf("A keyword: %s\en", yytext);
250: }
1.1 deraadt 251:
1.16 jmc 252: {ID} printf("An identifier: %s\en", yytext);
1.1 deraadt 253:
1.16 jmc 254: "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
1.1 deraadt 255:
1.16 jmc 256: "{"[^}\en]*"}" /* eat up one-line comments */
1.1 deraadt 257:
1.16 jmc 258: [ \et\en]+ /* eat up whitespace */
1.1 deraadt 259:
1.16 jmc 260: \&. printf("Unrecognized character: %s\en", yytext);
1.1 deraadt 261:
1.16 jmc 262: %%
1.1 deraadt 263:
1.16 jmc 264: main(int argc, char *argv[])
265: {
266: ++argv; --argc; /* skip over program name */
267: if (argc > 0)
268: yyin = fopen(argv[0], "r");
1.1 deraadt 269: else
270: yyin = stdin;
1.7 aaron 271:
1.1 deraadt 272: yylex();
1.16 jmc 273: }
274: .Ed
275: .Pp
276: This is the beginnings of a simple scanner for a language like Pascal.
277: It identifies different types of
278: .Em tokens
1.1 deraadt 279: and reports on what it has seen.
1.16 jmc 280: .Pp
281: The details of this example will be explained in the following sections.
282: .Sh FORMAT OF THE INPUT FILE
1.1 deraadt 283: The
1.16 jmc 284: .Nm
1.1 deraadt 285: input file consists of three sections, separated by a line with just
1.16 jmc 286: .Qq %%
1.1 deraadt 287: in it:
1.16 jmc 288: .Bd -unfilled -offset indent
289: definitions
290: %%
291: rules
292: %%
293: user code
294: .Ed
295: .Pp
1.1 deraadt 296: The
1.16 jmc 297: .Em definitions
1.1 deraadt 298: section contains declarations of simple
1.16 jmc 299: .Em name
1.1 deraadt 300: definitions to simplify the scanner specification, and declarations of
1.16 jmc 301: .Em start conditions ,
1.1 deraadt 302: which are explained in a later section.
1.16 jmc 303: .Pp
1.1 deraadt 304: Name definitions have the form:
1.16 jmc 305: .Pp
306: .D1 name definition
307: .Pp
308: The
309: .Qq name
310: is a word beginning with a letter or an underscore
311: .Pq Sq _
312: followed by zero or more letters, digits,
313: .Sq _ ,
314: or
315: .Sq -
316: .Pq dash .
1.8 aaron 317: The definition is taken to begin at the first non-whitespace character
1.1 deraadt 318: following the name and continuing to the end of the line.
1.16 jmc 319: The definition can subsequently be referred to using
320: .Qq {name} ,
321: which will expand to
322: .Qq (definition) .
323: For example:
324: .Bd -literal -offset indent
325: DIGIT [0-9]
326: ID [a-z][a-z0-9]*
327: .Ed
328: .Pp
329: This defines
330: .Qq DIGIT
331: to be a regular expression which matches a single digit, and
332: .Qq ID
333: to be a regular expression which matches a letter
1.1 deraadt 334: followed by zero-or-more letters-or-digits.
335: A subsequent reference to
1.16 jmc 336: .Pp
337: .Dl {DIGIT}+"."{DIGIT}*
338: .Pp
1.1 deraadt 339: is identical to
1.16 jmc 340: .Pp
341: .Dl ([0-9])+"."([0-9])*
342: .Pp
343: and matches one-or-more digits followed by a
344: .Sq .\&
345: followed by zero-or-more digits.
346: .Pp
1.1 deraadt 347: The
1.16 jmc 348: .Em rules
1.1 deraadt 349: section of the
1.16 jmc 350: .Nm
1.1 deraadt 351: input contains a series of rules of the form:
1.16 jmc 352: .Pp
1.35 schwarze 353: .Dl pattern action
1.16 jmc 354: .Pp
355: The pattern must be unindented and the action must begin
1.1 deraadt 356: on the same line.
1.16 jmc 357: .Pp
1.1 deraadt 358: See below for a further description of patterns and actions.
1.16 jmc 359: .Pp
1.1 deraadt 360: Finally, the user code section is simply copied to
1.16 jmc 361: .Pa lex.yy.c
1.1 deraadt 362: verbatim.
1.16 jmc 363: It is used for companion routines which call or are called by the scanner.
364: The presence of this section is optional;
1.1 deraadt 365: if it is missing, the second
1.16 jmc 366: .Qq %%
367: in the input file may be skipped too.
368: .Pp
369: In the definitions and rules sections, any indented text or text enclosed in
370: .Sq %{
1.1 deraadt 371: and
1.16 jmc 372: .Sq %}
373: is copied verbatim to the output
374: .Pq with the %{}'s removed .
1.1 deraadt 375: The %{}'s must appear unindented on lines by themselves.
1.16 jmc 376: .Pp
1.1 deraadt 377: In the rules section,
1.16 jmc 378: any indented or %{} text appearing before the first rule may be used to
379: declare variables which are local to the scanning routine and
380: .Pq after the declarations
1.1 deraadt 381: code which is to be executed whenever the scanning routine is entered.
382: Other indented or %{} text in the rule section is still copied to the output,
383: but its meaning is not well-defined and it may well cause compile-time
384: errors (this feature is present for
1.16 jmc 385: .Tn POSIX
1.1 deraadt 386: compliance; see below for other such features).
1.16 jmc 387: .Pp
388: In the definitions section
389: .Pq but not in the rules section ,
390: an unindented comment
391: (i.e., a line beginning with
392: .Qq /* )
393: is also copied verbatim to the output up to the next
394: .Qq */ .
395: .Sh PATTERNS
1.1 deraadt 396: The patterns in the input are written using an extended set of regular
1.16 jmc 397: expressions.
398: These are:
399: .Bl -tag -width "XXXXXXXX"
400: .It x
401: Match the character
402: .Sq x .
403: .It .\&
404: Any character
405: .Pq byte
406: except newline.
407: .It [xyz]
408: A
409: .Qq character class ;
410: in this case, the pattern matches either an
411: .Sq x ,
412: a
413: .Sq y ,
414: or a
415: .Sq z .
416: .It [abj-oZ]
417: A
418: .Qq character class
419: with a range in it; matches an
420: .Sq a ,
421: a
422: .Sq b ,
423: any letter from
424: .Sq j
425: through
426: .Sq o ,
427: or a
428: .Sq Z .
429: .It [^A-Z]
430: A
431: .Qq negated character class ,
432: i.e., any character but those in the class.
433: In this case, any character EXCEPT an uppercase letter.
434: .It [^A-Z\en]
435: Any character EXCEPT an uppercase letter or a newline.
436: .It r*
437: Zero or more r's, where
438: .Sq r
439: is any regular expression.
440: .It r+
441: One or more r's.
442: .It r?
443: Zero or one r's (that is,
444: .Qq an optional r ) .
445: .It r{2,5}
446: Anywhere from two to five r's.
447: .It r{2,}
448: Two or more r's.
449: .It r{4}
450: Exactly 4 r's.
451: .It {name}
452: The expansion of the
453: .Qq name
454: definition
455: .Pq see above .
456: .It \&"[xyz]\e\&"foo\&"
457: The literal string: [xyz]"foo.
458: .It \eX
459: If
460: .Sq X
461: is an
462: .Sq a ,
463: .Sq b ,
464: .Sq f ,
465: .Sq n ,
466: .Sq r ,
467: .Sq t ,
468: or
469: .Sq v ,
470: then the ANSI-C interpretation of
471: .Sq \eX .
472: Otherwise, a literal
473: .Sq X
474: (used to escape operators such as
475: .Sq * ) .
476: .It \e0
477: A NUL character
478: .Pq ASCII code 0 .
479: .It \e123
480: The character with octal value 123.
481: .It \ex2a
482: The character with hexadecimal value 2a.
483: .It (r)
484: Match an
485: .Sq r ;
486: parentheses are used to override precedence
487: .Pq see below .
488: .It rs
489: The regular expression
490: .Sq r
491: followed by the regular expression
492: .Sq s ;
493: called
494: .Qq concatenation .
495: .It r|s
496: Either an
497: .Sq r
498: or an
499: .Sq s .
500: .It r/s
501: An
502: .Sq r ,
503: but only if it is followed by an
504: .Sq s .
505: The text matched by
506: .Sq s
507: is included when determining whether this rule is the
508: .Qq longest match ,
509: but is then returned to the input before the action is executed.
510: So the action only sees the text matched by
511: .Sq r .
512: This type of pattern is called
513: .Qq trailing context .
514: (There are some combinations of r/s that
515: .Nm
516: cannot match correctly; see notes in the
517: .Sx BUGS
518: section below regarding
519: .Qq dangerous trailing context . )
520: .It ^r
521: An
522: .Sq r ,
523: but only at the beginning of a line
524: (i.e., just starting to scan, or right after a newline has been scanned).
525: .It r$
526: An
527: .Sq r ,
528: but only at the end of a line
529: .Pq i.e., just before a newline .
530: Equivalent to
531: .Qq r/\en .
532: .Pp
533: Note that
534: .Nm flex Ns 's
535: notion of
536: .Qq newline
537: is exactly whatever the C compiler used to compile
538: .Nm
539: interprets
540: .Sq \en
541: as.
542: .\" In particular, on some DOS systems you must either filter out \er's in the
543: .\" input yourself, or explicitly use r/\er\en for
544: .\" .Qq r$ .
545: .It <s>r
546: An
547: .Sq r ,
548: but only in start condition
549: .Sq s
550: .Pq see below for discussion of start conditions .
551: .It <s1,s2,s3>r
552: The same, but in any of start conditions s1, s2, or s3.
553: .It <*>r
554: An
555: .Sq r
556: in any start condition, even an exclusive one.
557: .It <<EOF>>
558: An end-of-file.
559: .It <s1,s2><<EOF>>
560: An end-of-file when in start condition s1 or s2.
561: .El
562: .Pp
1.1 deraadt 563: Note that inside of a character class, all regular expression operators
1.16 jmc 564: lose their special meaning except escape
565: .Pq Sq \e
566: and the character class operators,
567: .Sq - ,
568: .Sq ]\& ,
569: and, at the beginning of the class,
570: .Sq ^ .
571: .Pp
1.1 deraadt 572: The regular expressions listed above are grouped according to
573: precedence, from highest precedence at the top to lowest at the bottom.
1.16 jmc 574: Those grouped together have equal precedence.
575: For example,
576: .Pp
577: .D1 foo|bar*
578: .Pp
1.1 deraadt 579: is the same as
1.16 jmc 580: .Pp
581: .D1 (foo)|(ba(r*))
582: .Pp
583: since the
584: .Sq *
585: operator has higher precedence than concatenation,
586: and concatenation higher than alternation
587: .Pq Sq |\& .
588: This pattern therefore matches
589: .Em either
590: the string
591: .Qq foo
592: .Em or
593: the string
594: .Qq ba
595: followed by zero-or-more r's.
596: To match
597: .Qq foo
598: or zero-or-more "bar"'s,
599: use:
600: .Pp
601: .D1 foo|(bar)*
602: .Pp
1.1 deraadt 603: and to match zero-or-more "foo"'s-or-"bar"'s:
1.16 jmc 604: .Pp
605: .D1 (foo|bar)*
606: .Pp
1.1 deraadt 607: In addition to characters and ranges of characters, character classes
608: can also contain character class
1.16 jmc 609: .Em expressions .
1.1 deraadt 610: These are expressions enclosed inside
1.16 jmc 611: .Sq [:
612: and
613: .Sq :]
614: delimiters (which themselves must appear between the
1.26 schwarze 615: .Sq \&[
1.1 deraadt 616: and
1.16 jmc 617: .Sq ]\&
618: of the
1.1 deraadt 619: character class; other elements may occur inside the character class, too).
620: The valid expressions are:
1.16 jmc 621: .Bd -unfilled -offset indent
622: [:alnum:] [:alpha:] [:blank:]
623: [:cntrl:] [:digit:] [:graph:]
624: [:lower:] [:print:] [:punct:]
625: [:space:] [:upper:] [:xdigit:]
626: .Ed
627: .Pp
1.1 deraadt 628: These expressions all designate a set of characters equivalent to
629: the corresponding standard C
1.16 jmc 630: .Fn isXXX
631: function.
632: For example, [:alnum:] designates those characters for which
633: .Xr isalnum 3
634: returns true \- i.e., any alphabetic or numeric.
1.1 deraadt 635: Some systems don't provide
1.16 jmc 636: .Xr isblank 3 ,
637: so
638: .Nm
639: defines [:blank:] as a blank or a tab.
640: .Pp
1.1 deraadt 641: For example, the following character classes are all equivalent:
1.16 jmc 642: .Bd -unfilled -offset indent
643: [[:alnum:]]
644: [[:alpha:][:digit:]]
645: [[:alpha:]0-9]
646: [a-zA-Z0-9]
647: .Ed
648: .Pp
649: If the scanner is case-insensitive (the
650: .Fl i
651: flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
652: .Pp
1.1 deraadt 653: Some notes on patterns:
1.16 jmc 654: .Bl -dash
655: .It
656: A negated character class such as the example
657: .Qq [^A-Z]
658: above will match a newline unless "\en"
659: .Pq or an equivalent escape sequence
660: is one of the characters explicitly present in the negated character class
661: (e.g.,
662: .Qq [^A-Z\en] ) .
663: This is unlike how many other regular expression tools treat negated character
664: classes, but unfortunately the inconsistency is historically entrenched.
665: Matching newlines means that a pattern like
666: .Qq [^"]*
667: can match the entire input unless there's another quote in the input.
668: .It
669: A rule can have at most one instance of trailing context
670: (the
671: .Sq /
672: operator or the
673: .Sq $
674: operator).
675: The start condition,
676: .Sq ^ ,
677: and
678: .Qq <<EOF>>
1.40 jmc 679: patterns can only occur at the beginning of a pattern and, as well as with
1.16 jmc 680: .Sq /
681: and
682: .Sq $ ,
683: cannot be grouped inside parentheses.
684: A
685: .Sq ^
686: which does not occur at the beginning of a rule or a
687: .Sq $
688: which does not occur at the end of a rule loses its special properties
689: and is treated as a normal character.
690: .It
1.1 deraadt 691: The following are illegal:
1.16 jmc 692: .Bd -unfilled -offset indent
693: foo/bar$
694: <sc1>foo<sc2>bar
695: .Ed
696: .Pp
697: Note that the first of these, can be written
698: .Qq foo/bar\en .
699: .It
700: The following will result in
701: .Sq $
702: or
703: .Sq ^
704: being treated as a normal character:
705: .Bd -unfilled -offset indent
706: foo|(bar$)
707: foo|^bar
708: .Ed
709: .Pp
710: If what's wanted is a
711: .Qq foo
712: or a bar-followed-by-a-newline, the following could be used
713: (the special
714: .Sq |\&
715: action is explained below):
716: .Bd -unfilled -offset indent
717: foo |
718: bar$ /* action goes here */
719: .Ed
720: .Pp
1.1 deraadt 721: A similar trick will work for matching a foo or a
722: bar-at-the-beginning-of-a-line.
1.16 jmc 723: .El
724: .Sh HOW THE INPUT IS MATCHED
725: When the generated scanner is run,
726: it analyzes its input looking for strings which match any of its patterns.
727: If it finds more than one match,
728: it takes the one matching the most text
729: (for trailing context rules, this includes the length of the trailing part,
730: even though it will then be returned to the input).
731: If it finds two or more matches of the same length,
732: the rule listed first in the
733: .Nm
1.1 deraadt 734: input file is chosen.
1.16 jmc 735: .Pp
1.1 deraadt 736: Once the match is determined, the text corresponding to the match
737: (called the
1.16 jmc 738: .Em token )
1.1 deraadt 739: is made available in the global character pointer
1.16 jmc 740: .Fa yytext ,
1.1 deraadt 741: and its length in the global integer
1.16 jmc 742: .Fa yyleng .
1.1 deraadt 743: The
1.16 jmc 744: .Em action
745: corresponding to the matched pattern is then executed
746: .Pq a more detailed description of actions follows ,
747: and then the remaining input is scanned for another match.
748: .Pp
749: If no match is found, then the default rule is executed:
750: the next character in the input is considered matched and
751: copied to the standard output.
752: Thus, the simplest legal
753: .Nm
1.1 deraadt 754: input is:
1.16 jmc 755: .Pp
756: .D1 %%
757: .Pp
758: which generates a scanner that simply copies its input
759: .Pq one character at a time
760: to its output.
761: .Pp
1.1 deraadt 762: Note that
1.16 jmc 763: .Fa yytext
764: can be defined in two different ways:
765: either as a character pointer or as a character array.
766: Which definition
767: .Nm
768: uses can be controlled by including one of the special directives
769: .Dq %pointer
770: or
771: .Dq %array
772: in the first
773: .Pq definitions
774: section of flex input.
775: The default is
776: .Dq %pointer ,
777: unless the
778: .Fl l
1.36 schwarze 779: .Nm lex
780: compatibility option is used, in which case
1.16 jmc 781: .Fa yytext
1.1 deraadt 782: will be an array.
783: The advantage of using
1.16 jmc 784: .Dq %pointer
1.1 deraadt 785: is substantially faster scanning and no buffer overflow when matching
1.16 jmc 786: very large tokens
787: .Pq unless not enough dynamic memory is available .
788: The disadvantage is that actions are restricted in how they can modify
789: .Fa yytext
790: .Pq see the next section ,
791: and calls to the
792: .Fn unput
1.10 deraadt 793: function destroy the present contents of
1.16 jmc 794: .Fa yytext ,
1.1 deraadt 795: which can be a considerable porting headache when moving between different
1.16 jmc 796: .Nm lex
1.1 deraadt 797: versions.
1.16 jmc 798: .Pp
1.1 deraadt 799: The advantage of
1.16 jmc 800: .Dq %array
801: is that
802: .Fa yytext
803: can be modified as much as wanted, and calls to
804: .Fn unput
1.1 deraadt 805: do not destroy
1.16 jmc 806: .Fa yytext
807: .Pq see below .
808: Furthermore, existing
809: .Nm lex
1.1 deraadt 810: programs sometimes access
1.16 jmc 811: .Fa yytext
1.1 deraadt 812: externally using declarations of the form:
1.16 jmc 813: .Pp
814: .D1 extern char yytext[];
815: .Pp
1.1 deraadt 816: This definition is erroneous when used with
1.16 jmc 817: .Dq %pointer ,
1.1 deraadt 818: but correct for
1.16 jmc 819: .Dq %array .
820: .Pp
821: .Dq %array
1.1 deraadt 822: defines
1.16 jmc 823: .Fa yytext
1.1 deraadt 824: to be an array of
1.16 jmc 825: .Dv YYLMAX
826: characters, which defaults to a fairly large value.
827: The size can be changed by simply #define'ing
828: .Dv YYLMAX
829: to a different value in the first section of
830: .Nm
831: input.
832: As mentioned above, with
833: .Dq %pointer
834: yytext grows dynamically to accommodate large tokens.
835: While this means a
836: .Dq %pointer
837: scanner can accommodate very large tokens
838: .Pq such as matching entire blocks of comments ,
839: bear in mind that each time the scanner must resize
840: .Fa yytext
1.1 deraadt 841: it also must rescan the entire token from the beginning, so matching such
842: tokens can prove slow.
1.16 jmc 843: .Fa yytext
844: presently does not dynamically grow if a call to
845: .Fn unput
1.1 deraadt 846: results in too much text being pushed back; instead, a run-time error results.
1.16 jmc 847: .Pp
848: Also note that
849: .Dq %array
850: cannot be used with C++ scanner classes
851: .Pq the c++ option; see below .
852: .Sh ACTIONS
853: Each pattern in a rule has a corresponding action,
854: which can be any arbitrary C statement.
855: The pattern ends at the first non-escaped whitespace character;
856: the remainder of the line is its action.
857: If the action is empty,
858: then when the pattern is matched the input token is simply discarded.
859: For example, here is the specification for a program
860: which deletes all occurrences of
861: .Qq zap me
862: from its input:
863: .Bd -literal -offset indent
864: %%
865: "zap me"
866: .Ed
867: .Pp
1.1 deraadt 868: (It will copy all other characters in the input to the output since
869: they will be matched by the default rule.)
1.16 jmc 870: .Pp
1.1 deraadt 871: Here is a program which compresses multiple blanks and tabs down to
872: a single blank, and throws away whitespace found at the end of a line:
1.16 jmc 873: .Bd -literal -offset indent
874: %%
875: [ \et]+ putchar(' ');
876: [ \et]+$ /* ignore this token */
877: .Ed
878: .Pp
879: If the action contains a
880: .Sq { ,
881: then the action spans till the balancing
882: .Sq }
1.1 deraadt 883: is found, and the action may cross multiple lines.
1.16 jmc 884: .Nm
1.1 deraadt 885: knows about C strings and comments and won't be fooled by braces found
886: within them, but also allows actions to begin with
1.16 jmc 887: .Sq %{
1.1 deraadt 888: and will consider the action to be all the text up to the next
1.16 jmc 889: .Sq %}
890: .Pq regardless of ordinary braces inside the action .
891: .Pp
892: An action consisting solely of a vertical bar
893: .Pq Sq |\&
894: means
895: .Qq same as the action for the next rule .
896: See below for an illustration.
897: .Pp
898: Actions can include arbitrary C code,
899: including return statements to return a value to whatever routine called
900: .Fn yylex .
1.1 deraadt 901: Each time
1.16 jmc 902: .Fn yylex
903: is called, it continues processing tokens from where it last left off
904: until it either reaches the end of the file or executes a return.
905: .Pp
1.1 deraadt 906: Actions are free to modify
1.16 jmc 907: .Fa yytext
908: except for lengthening it
909: (adding characters to its end \- these will overwrite later characters in the
910: input stream).
911: This, however, does not apply when using
912: .Dq %array
913: .Pq see above ;
914: in that case,
915: .Fa yytext
1.1 deraadt 916: may be freely modified in any way.
1.16 jmc 917: .Pp
1.1 deraadt 918: Actions are free to modify
1.16 jmc 919: .Fa yyleng
1.1 deraadt 920: except they should not do so if the action also includes use of
1.16 jmc 921: .Fn yymore
922: .Pq see below .
923: .Pp
1.1 deraadt 924: There are a number of special directives which can be included within
925: an action:
1.16 jmc 926: .Bl -tag -width Ds
927: .It ECHO
928: Copies
929: .Fa yytext
930: to the scanner's output.
931: .It BEGIN
932: Followed by the name of a start condition, places the scanner in the
933: corresponding start condition
934: .Pq see below .
935: .It REJECT
936: Directs the scanner to proceed on to the
937: .Qq second best
938: rule which matched the input
939: .Pq or a prefix of the input .
940: The rule is chosen as described above in
941: .Sx HOW THE INPUT IS MATCHED ,
942: and
943: .Fa yytext
1.1 deraadt 944: and
1.16 jmc 945: .Fa yyleng
1.1 deraadt 946: set up appropriately.
947: It may either be one which matched as much text
948: as the originally chosen rule but came later in the
1.16 jmc 949: .Nm
1.1 deraadt 950: input file, or one which matched less text.
951: For example, the following will both count the
1.16 jmc 952: words in the input and call the routine
953: .Fn special
954: whenever
955: .Qq frob
956: is seen:
957: .Bd -literal -offset indent
958: int word_count = 0;
959: %%
960:
961: frob special(); REJECT;
962: [^ \et\en]+ ++word_count;
963: .Ed
964: .Pp
1.1 deraadt 965: Without the
1.16 jmc 966: .Em REJECT ,
967: any "frob"'s in the input would not be counted as words,
968: since the scanner normally executes only one action per token.
1.1 deraadt 969: Multiple
1.16 jmc 970: .Em REJECT Ns 's
971: are allowed,
972: each one finding the next best choice to the currently active rule.
973: For example, when the following scanner scans the token
974: .Qq abcd ,
975: it will write
976: .Qq abcdabcaba
977: to the output:
978: .Bd -literal -offset indent
979: %%
980: a |
981: ab |
982: abc |
983: abcd ECHO; REJECT;
984: \&.|\en /* eat up any unmatched character */
985: .Ed
986: .Pp
1.1 deraadt 987: (The first three rules share the fourth's action since they use
1.16 jmc 988: the special
989: .Sq |\&
990: action.)
991: .Em REJECT
1.1 deraadt 992: is a particularly expensive feature in terms of scanner performance;
1.16 jmc 993: if it is used in any of the scanner's actions it will slow down
994: all of the scanner's matching.
995: Furthermore,
996: .Em REJECT
1.1 deraadt 997: cannot be used with the
1.16 jmc 998: .Fl Cf
1.1 deraadt 999: or
1.16 jmc 1000: .Fl CF
1001: options
1002: .Pq see below .
1003: .Pp
1.1 deraadt 1004: Note also that unlike the other special actions,
1.16 jmc 1005: .Em REJECT
1.1 deraadt 1006: is a
1.16 jmc 1007: .Em branch ;
1008: code immediately following it in the action will not be executed.
1009: .It yymore()
1010: Tells the scanner that the next time it matches a rule, the corresponding
1011: token should be appended onto the current value of
1012: .Fa yytext
1013: rather than replacing it.
1014: For example, given the input
1015: .Qq mega-kludge
1016: the following will write
1017: .Qq mega-mega-kludge
1018: to the output:
1019: .Bd -literal -offset indent
1020: %%
1021: mega- ECHO; yymore();
1022: kludge ECHO;
1023: .Ed
1024: .Pp
1025: First
1026: .Qq mega-
1027: is matched and echoed to the output.
1028: Then
1029: .Qq kludge
1030: is matched, but the previous
1031: .Qq mega-
1032: is still hanging around at the beginning of
1033: .Fa yytext
1.1 deraadt 1034: so the
1.16 jmc 1035: .Em ECHO
1036: for the
1037: .Qq kludge
1038: rule will actually write
1039: .Qq mega-kludge .
1040: .Pp
1.1 deraadt 1041: Two notes regarding use of
1.16 jmc 1042: .Fn yymore :
1.1 deraadt 1043: First,
1.16 jmc 1044: .Fn yymore
1.1 deraadt 1045: depends on the value of
1.16 jmc 1046: .Fa yyleng
1047: correctly reflecting the size of the current token, so
1048: .Fa yyleng
1049: must not be modified when using
1050: .Fn yymore .
1.1 deraadt 1051: Second, the presence of
1.16 jmc 1052: .Fn yymore
1.1 deraadt 1053: in the scanner's action entails a minor performance penalty in the
1054: scanner's matching speed.
1.16 jmc 1055: .It yyless(n)
1056: Returns all but the first
1057: .Ar n
1.1 deraadt 1058: characters of the current token back to the input stream, where they
1059: will be rescanned when the scanner looks for the next match.
1.16 jmc 1060: .Fa yytext
1.1 deraadt 1061: and
1.16 jmc 1062: .Fa yyleng
1.1 deraadt 1063: are adjusted appropriately (e.g.,
1.16 jmc 1064: .Fa yyleng
1.1 deraadt 1065: will now be equal to
1.16 jmc 1066: .Ar n ) .
1067: For example, on the input
1068: .Qq foobar
1069: the following will write out
1070: .Qq foobarbar :
1071: .Bd -literal -offset indent
1072: %%
1073: foobar ECHO; yyless(3);
1074: [a-z]+ ECHO;
1075: .Ed
1076: .Pp
1.1 deraadt 1077: An argument of 0 to
1.16 jmc 1078: .Fa yyless
1079: will cause the entire current input string to be scanned again.
1080: Unless how the scanner will subsequently process its input has been changed
1081: (using
1082: .Em BEGIN ,
1083: for example),
1084: this will result in an endless loop.
1085: .Pp
1.1 deraadt 1086: Note that
1.16 jmc 1087: .Fa yyless
1088: is a macro and can only be used in the
1089: .Nm
1090: input file, not from other source files.
1091: .It unput(c)
1092: Puts the character
1093: .Ar c
1094: back into the input stream.
1095: It will be the next character scanned.
1.1 deraadt 1096: The following action will take the current token and cause it
1097: to be rescanned enclosed in parentheses.
1.16 jmc 1098: .Bd -literal -offset indent
1099: {
1100: int i;
1101: char *yycopy;
1102:
1103: /* Copy yytext because unput() trashes yytext */
1104: if ((yycopy = strdup(yytext)) == NULL)
1105: err(1, NULL);
1106: unput(')');
1107: for (i = yyleng - 1; i >= 0; --i)
1108: unput(yycopy[i]);
1109: unput('(');
1110: free(yycopy);
1111: }
1112: .Ed
1113: .Pp
1.1 deraadt 1114: Note that since each
1.16 jmc 1115: .Fn unput
1116: puts the given character back at the beginning of the input stream,
1117: pushing back strings must be done back-to-front.
1118: .Pp
1.1 deraadt 1119: An important potential problem when using
1.16 jmc 1120: .Fn unput
1121: is that if using
1122: .Dq %pointer
1123: .Pq the default ,
1124: a call to
1125: .Fn unput
1126: destroys the contents of
1127: .Fa yytext ,
1.1 deraadt 1128: starting with its rightmost character and devouring one character to
1.16 jmc 1129: the left with each call.
1130: If the value of
1131: .Fa yytext
1132: should be preserved after a call to
1133: .Fn unput
1134: .Pq as in the above example ,
1135: it must either first be copied elsewhere, or the scanner must be built using
1136: .Dq %array
1137: instead (see
1138: .Sx HOW THE INPUT IS MATCHED ) .
1139: .Pp
1140: Finally, note that EOF cannot be put back
1.1 deraadt 1141: to attempt to mark the input stream with an end-of-file.
1.16 jmc 1142: .It input()
1143: Reads the next character from the input stream.
1144: For example, the following is one way to eat up C comments:
1145: .Bd -literal -offset indent
1146: %%
1147: "/*" {
1148: int c;
1149:
1150: for (;;) {
1151: while ((c = input()) != '*' && c != EOF)
1152: ; /* eat up text of comment */
1153:
1154: if (c == '*') {
1155: while ((c = input()) == '*')
1156: ;
1157: if (c == '/')
1158: break; /* found the end */
1159: }
1160:
1161: if (c == EOF) {
1162: errx(1, "EOF in comment");
1.1 deraadt 1163: break;
1164: }
1.16 jmc 1165: }
1166: }
1167: .Ed
1168: .Pp
1169: (Note that if the scanner is compiled using C++, then
1170: .Fn input
1.1 deraadt 1171: is instead referred to as
1.16 jmc 1172: .Fn yyinput ,
1173: in order to avoid a name clash with the C++ stream by the name of input.)
1174: .It YY_FLUSH_BUFFER
1175: Flushes the scanner's internal buffer
1176: so that the next time the scanner attempts to match a token,
1177: it will first refill the buffer using
1178: .Dv YY_INPUT
1179: (see
1180: .Sx THE GENERATED SCANNER ,
1181: below).
1182: This action is a special case of the more general
1183: .Fn yy_flush_buffer
1184: function, described below in the section
1185: .Sx MULTIPLE INPUT BUFFERS .
1186: .It yyterminate()
1187: Can be used in lieu of a return statement in an action.
1188: It terminates the scanner and returns a 0 to the scanner's caller, indicating
1189: .Qq all done .
1.1 deraadt 1190: By default,
1.16 jmc 1191: .Fn yyterminate
1192: is also called when an end-of-file is encountered.
1193: It is a macro and may be redefined.
1194: .El
1195: .Sh THE GENERATED SCANNER
1.1 deraadt 1196: The output of
1.16 jmc 1197: .Nm
1.1 deraadt 1198: is the file
1.16 jmc 1199: .Pa lex.yy.c ,
1.1 deraadt 1200: which contains the scanning routine
1.16 jmc 1201: .Fn yylex ,
1202: a number of tables used by it for matching tokens,
1203: and a number of auxiliary routines and macros.
1204: By default,
1205: .Fn yylex
1.1 deraadt 1206: is declared as follows:
1.16 jmc 1207: .Bd -unfilled -offset indent
1208: int yylex()
1209: {
1210: ... various definitions and the actions in here ...
1211: }
1212: .Ed
1213: .Pp
1214: (If the environment supports function prototypes, then it will
1215: be "int yylex(void)".)
1216: This definition may be changed by defining the
1217: .Dv YY_DECL
1218: macro.
1219: For example:
1220: .Bd -literal -offset indent
1221: #define YY_DECL float lexscan(a, b) float a, b;
1222: .Ed
1223: .Pp
1224: would give the scanning routine the name
1225: .Em lexscan ,
1226: returning a float, and taking two floats as arguments.
1227: Note that if arguments are given to the scanning routine using a
1228: K&R-style/non-prototyped function declaration,
1229: the definition must be terminated with a semi-colon
1230: .Pq Sq ;\& .
1231: .Pp
1.1 deraadt 1232: Whenever
1.16 jmc 1233: .Fn yylex
1.1 deraadt 1234: is called, it scans tokens from the global input file
1.16 jmc 1235: .Pa yyin
1236: .Pq which defaults to stdin .
1237: It continues until it either reaches an end-of-file
1238: .Pq at which point it returns the value 0
1239: or one of its actions executes a
1240: .Em return
1.1 deraadt 1241: statement.
1.16 jmc 1242: .Pp
1.1 deraadt 1243: If the scanner reaches an end-of-file, subsequent calls are undefined
1244: unless either
1.16 jmc 1245: .Em yyin
1246: is pointed at a new input file
1247: .Pq in which case scanning continues from that file ,
1248: or
1249: .Fn yyrestart
1.1 deraadt 1250: is called.
1.16 jmc 1251: .Fn yyrestart
1.1 deraadt 1252: takes one argument, a
1.16 jmc 1253: .Fa FILE *
1254: pointer (which can be nil, if
1255: .Dv YY_INPUT
1256: has been set up to scan from a source other than
1257: .Em yyin ) ,
1.1 deraadt 1258: and initializes
1.16 jmc 1259: .Em yyin
1260: for scanning from that file.
1261: Essentially there is no difference between just assigning
1262: .Em yyin
1.1 deraadt 1263: to a new input file or using
1.16 jmc 1264: .Fn yyrestart
1265: to do so; the latter is available for compatibility with previous versions of
1266: .Nm ,
1.1 deraadt 1267: and because it can be used to switch input files in the middle of scanning.
1.16 jmc 1268: It can also be used to throw away the current input buffer,
1269: by calling it with an argument of
1270: .Em yyin ;
1.1 deraadt 1271: but better is to use
1.16 jmc 1272: .Dv YY_FLUSH_BUFFER
1273: .Pq see above .
1.1 deraadt 1274: Note that
1.16 jmc 1275: .Fn yyrestart
1276: does not reset the start condition to
1277: .Em INITIAL
1278: (see
1279: .Sx START CONDITIONS ,
1280: below).
1281: .Pp
1.1 deraadt 1282: If
1.16 jmc 1283: .Fn yylex
1.1 deraadt 1284: stops scanning due to executing a
1.16 jmc 1285: .Em return
1.1 deraadt 1286: statement in one of the actions, the scanner may then be called again and it
1287: will resume scanning where it left off.
1.16 jmc 1288: .Pp
1289: By default
1290: .Pq and for purposes of efficiency ,
1291: the scanner uses block-reads rather than simple
1292: .Xr getc 3
1.1 deraadt 1293: calls to read characters from
1.16 jmc 1294: .Em yyin .
1.1 deraadt 1295: The nature of how it gets its input can be controlled by defining the
1.16 jmc 1296: .Dv YY_INPUT
1.1 deraadt 1297: macro.
1.16 jmc 1298: .Dv YY_INPUT Ns 's
1299: calling sequence is
1300: .Qq YY_INPUT(buf,result,max_size) .
1301: Its action is to place up to
1302: .Dv max_size
1.1 deraadt 1303: characters in the character array
1.16 jmc 1304: .Em buf
1.1 deraadt 1305: and return in the integer variable
1.16 jmc 1306: .Em result
1307: either the number of characters read or the constant
1308: .Dv YY_NULL
1309: (0 on
1310: .Ux
1311: systems)
1312: to indicate
1313: .Dv EOF .
1314: The default
1315: .Dv YY_INPUT
1316: reads from the global file-pointer
1317: .Qq yyin .
1318: .Pp
1319: A sample definition of
1320: .Dv YY_INPUT
1321: .Pq in the definitions section of the input file :
1322: .Bd -unfilled -offset indent
1323: %{
1324: #define YY_INPUT(buf,result,max_size) \e
1325: { \e
1326: int c = getchar(); \e
1327: result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1328: }
1329: %}
1330: .Ed
1331: .Pp
1.1 deraadt 1332: This definition will change the input processing to occur
1333: one character at a time.
1.16 jmc 1334: .Pp
1335: When the scanner receives an end-of-file indication from
1336: .Dv YY_INPUT ,
1.1 deraadt 1337: it then checks the
1.16 jmc 1338: .Fn yywrap
1339: function.
1340: If
1341: .Fn yywrap
1342: returns false
1343: .Pq zero ,
1344: then it is assumed that the function has gone ahead and set up
1345: .Em yyin
1346: to point to another input file, and scanning continues.
1347: If it returns true
1348: .Pq non-zero ,
1349: then the scanner terminates, returning 0 to its caller.
1350: Note that in either case, the start condition remains unchanged;
1351: it does not revert to
1352: .Em INITIAL .
1353: .Pp
1.1 deraadt 1354: If you do not supply your own version of
1.16 jmc 1355: .Fn yywrap ,
1.1 deraadt 1356: then you must either use
1.16 jmc 1357: .Dq %option noyywrap
1.1 deraadt 1358: (in which case the scanner behaves as though
1.16 jmc 1359: .Fn yywrap
1.1 deraadt 1360: returned 1), or you must link with
1.16 jmc 1361: .Fl lfl
1.1 deraadt 1362: to obtain the default version of the routine, which always returns 1.
1.16 jmc 1363: .Pp
1.1 deraadt 1364: Three routines are available for scanning from in-memory buffers rather
1365: than files:
1.16 jmc 1366: .Fn yy_scan_string ,
1367: .Fn yy_scan_bytes ,
1.1 deraadt 1368: and
1.16 jmc 1369: .Fn yy_scan_buffer .
1370: See the discussion of them below in the section
1371: .Sx MULTIPLE INPUT BUFFERS .
1372: .Pp
1.1 deraadt 1373: The scanner writes its
1.16 jmc 1374: .Em ECHO
1.1 deraadt 1375: output to the
1.16 jmc 1376: .Em yyout
1377: global
1378: .Pq default, stdout ,
1379: which may be redefined by the user simply by assigning it to some other
1380: .Va FILE
1.1 deraadt 1381: pointer.
1.16 jmc 1382: .Sh START CONDITIONS
1383: .Nm
1384: provides a mechanism for conditionally activating rules.
1385: Any rule whose pattern is prefixed with
1386: .Qq Aq sc
1387: will only be active when the scanner is in the start condition named
1388: .Qq sc .
1389: For example,
1390: .Bd -literal -offset indent
1391: <STRING>[^"]* { /* eat up the string body ... */
1392: ...
1393: }
1394: .Ed
1395: .Pp
1396: will be active only when the scanner is in the
1397: .Qq STRING
1398: start condition, and
1399: .Bd -literal -offset indent
1400: <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1401: ...
1402: }
1403: .Ed
1404: .Pp
1405: will be active only when the current start condition is either
1406: .Qq INITIAL ,
1407: .Qq STRING ,
1408: or
1409: .Qq QUOTE .
1410: .Pp
1411: Start conditions are declared in the definitions
1412: .Pq first
1413: section of the input using unindented lines beginning with either
1414: .Sq %s
1.1 deraadt 1415: or
1.16 jmc 1416: .Sq %x
1.1 deraadt 1417: followed by a list of names.
1418: The former declares
1.16 jmc 1419: .Em inclusive
1.1 deraadt 1420: start conditions, the latter
1.16 jmc 1421: .Em exclusive
1422: start conditions.
1423: A start condition is activated using the
1424: .Em BEGIN
1425: action.
1426: Until the next
1427: .Em BEGIN
1428: action is executed, rules with the given start condition will be active and
1.1 deraadt 1429: rules with other start conditions will be inactive.
1.16 jmc 1430: If the start condition is inclusive,
1.1 deraadt 1431: then rules with no start conditions at all will also be active.
1.16 jmc 1432: If it is exclusive,
1433: then only rules qualified with the start condition will be active.
1.1 deraadt 1434: A set of rules contingent on the same exclusive start condition
1435: describe a scanner which is independent of any of the other rules in the
1.16 jmc 1436: .Nm
1437: input.
1438: Because of this, exclusive start conditions make it easy to specify
1439: .Qq mini-scanners
1.1 deraadt 1440: which scan portions of the input that are syntactically different
1.16 jmc 1441: from the rest
1442: .Pq e.g., comments .
1443: .Pp
1.1 deraadt 1444: If the distinction between inclusive and exclusive start conditions
1445: is still a little vague, here's a simple example illustrating the
1.16 jmc 1446: connection between the two.
1447: The set of rules:
1448: .Bd -literal -offset indent
1449: %s example
1450: %%
1451:
1452: <example>foo do_something();
1453:
1454: bar something_else();
1455: .Ed
1456: .Pp
1.1 deraadt 1457: is equivalent to
1.16 jmc 1458: .Bd -literal -offset indent
1459: %x example
1460: %%
1461:
1462: <example>foo do_something();
1463:
1464: <INITIAL,example>bar something_else();
1465: .Ed
1466: .Pp
1.1 deraadt 1467: Without the
1.16 jmc 1468: .Aq INITIAL,example
1.1 deraadt 1469: qualifier, the
1.16 jmc 1470: .Dq bar
1471: pattern in the second example wouldn't be active
1472: .Pq i.e., couldn't match
1.1 deraadt 1473: when in start condition
1.16 jmc 1474: .Dq example .
1.1 deraadt 1475: If we just used
1.16 jmc 1476: .Aq example
1.1 deraadt 1477: to qualify
1.16 jmc 1478: .Dq bar ,
1.1 deraadt 1479: though, then it would only be active in
1.16 jmc 1480: .Dq example
1.1 deraadt 1481: and not in
1.16 jmc 1482: .Em INITIAL ,
1483: while in the first example it's active in both,
1484: because in the first example the
1485: .Dq example
1486: start condition is an inclusive
1487: .Pq Sq %s
1.1 deraadt 1488: start condition.
1.16 jmc 1489: .Pp
1.1 deraadt 1490: Also note that the special start-condition specifier
1.16 jmc 1491: .Sq Aq *
1492: matches every start condition.
1493: Thus, the above example could also have been written:
1494: .Bd -literal -offset indent
1495: %x example
1496: %%
1497:
1498: <example>foo do_something();
1499:
1500: <*>bar something_else();
1501: .Ed
1502: .Pp
1.1 deraadt 1503: The default rule (to
1.16 jmc 1504: .Em ECHO
1505: any unmatched character) remains active in start conditions.
1506: It is equivalent to:
1507: .Bd -literal -offset indent
1508: <*>.|\en ECHO;
1509: .Ed
1510: .Pp
1511: .Dq BEGIN(0)
1.1 deraadt 1512: returns to the original state where only the rules with
1.16 jmc 1513: no start conditions are active.
1514: This state can also be referred to as the start-condition
1515: .Em INITIAL ,
1516: so
1517: .Dq BEGIN(INITIAL)
1.1 deraadt 1518: is equivalent to
1.16 jmc 1519: .Dq BEGIN(0) .
1.1 deraadt 1520: (The parentheses around the start condition name are not required but
1521: are considered good style.)
1.16 jmc 1522: .Pp
1523: .Em BEGIN
1.1 deraadt 1524: actions can also be given as indented code at the beginning
1.16 jmc 1525: of the rules section.
1526: For example, the following will cause the scanner to enter the
1527: .Qq SPECIAL
1528: start condition whenever
1529: .Fn yylex
1.1 deraadt 1530: is called and the global variable
1.16 jmc 1531: .Fa enter_special
1.1 deraadt 1532: is true:
1.16 jmc 1533: .Bd -literal -offset indent
1534: int enter_special;
1.1 deraadt 1535:
1.16 jmc 1536: %x SPECIAL
1537: %%
1538: if (enter_special)
1.1 deraadt 1539: BEGIN(SPECIAL);
1540:
1.16 jmc 1541: <SPECIAL>blahblahblah
1542: \&...more rules follow...
1543: .Ed
1544: .Pp
1.1 deraadt 1545: To illustrate the uses of start conditions,
1546: here is a scanner which provides two different interpretations
1.16 jmc 1547: of a string like
1548: .Qq 123.456 .
1549: By default it will treat it as three tokens: the integer
1550: .Qq 123 ,
1551: a dot
1552: .Pq Sq .\& ,
1553: and the integer
1554: .Qq 456 .
1.1 deraadt 1555: But if the string is preceded earlier in the line by the string
1.16 jmc 1556: .Qq expect-floats
1557: it will treat it as a single token, the floating-point number 123.456:
1558: .Bd -literal -offset indent
1559: %{
1560: #include <math.h>
1561: %}
1562: %s expect
1563:
1564: %%
1565: expect-floats BEGIN(expect);
1566:
1567: <expect>[0-9]+"."[0-9]+ {
1568: printf("found a float, = %f\en",
1569: atof(yytext));
1570: }
1571: <expect>\en {
1572: /*
1573: * That's the end of the line, so
1574: * we need another "expect-number"
1575: * before we'll recognize any more
1576: * numbers.
1577: */
1578: BEGIN(INITIAL);
1579: }
1580:
1581: [0-9]+ {
1582: printf("found an integer, = %d\en",
1583: atoi(yytext));
1584: }
1585:
1586: "." printf("found a dot\en");
1587: .Ed
1588: .Pp
1589: Here is a scanner which recognizes
1590: .Pq and discards
1591: C comments while maintaining a count of the current input line:
1592: .Bd -literal -offset indent
1593: %x comment
1594: %%
1595: int line_num = 1;
1596:
1597: "/*" BEGIN(comment);
1598:
1599: <comment>[^*\en]* /* eat anything that's not a '*' */
1600: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1601: <comment>\en ++line_num;
1602: <comment>"*"+"/" BEGIN(INITIAL);
1603: .Ed
1604: .Pp
1.1 deraadt 1605: This scanner goes to a bit of trouble to match as much
1.16 jmc 1606: text as possible with each rule.
1607: In general, when attempting to write a high-speed scanner
1608: try to match as much as possible in each rule, as it's a big win.
1609: .Pp
1.10 deraadt 1610: Note that start-condition names are really integer values and
1.16 jmc 1611: can be stored as such.
1612: Thus, the above could be extended in the following fashion:
1613: .Bd -literal -offset indent
1614: %x comment foo
1615: %%
1616: int line_num = 1;
1617: int comment_caller;
1618:
1619: "/*" {
1620: comment_caller = INITIAL;
1621: BEGIN(comment);
1622: }
1623:
1624: \&...
1625:
1626: <foo>"/*" {
1627: comment_caller = foo;
1628: BEGIN(comment);
1629: }
1630:
1631: <comment>[^*\en]* /* eat anything that's not a '*' */
1632: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1633: <comment>\en ++line_num;
1634: <comment>"*"+"/" BEGIN(comment_caller);
1635: .Ed
1636: .Pp
1637: Furthermore, the current start condition can be accessed by using
1.1 deraadt 1638: the integer-valued
1.16 jmc 1639: .Dv YY_START
1640: macro.
1641: For example, the above assignments to
1642: .Em comment_caller
1.1 deraadt 1643: could instead be written
1.16 jmc 1644: .Pp
1645: .Dl comment_caller = YY_START;
1646: .Pp
1.1 deraadt 1647: Flex provides
1.16 jmc 1648: .Dv YYSTATE
1.1 deraadt 1649: as an alias for
1.16 jmc 1650: .Dv YY_START
1.36 schwarze 1651: (since that is what's used by
1652: .At
1.16 jmc 1653: .Nm lex ) .
1654: .Pp
1655: Note that start conditions do not have their own name-space;
1656: %s's and %x's declare names in the same fashion as #define's.
1657: .Pp
1.1 deraadt 1658: Finally, here's an example of how to match C-style quoted strings using
1.16 jmc 1659: exclusive start conditions, including expanded escape sequences
1660: (but not including checking for a string that's too long):
1661: .Bd -literal -offset indent
1662: %x str
1663:
1664: %%
1665: #define MAX_STR_CONST 1024
1666: char string_buf[MAX_STR_CONST];
1667: char *string_buf_ptr;
1668:
1669: \e" string_buf_ptr = string_buf; BEGIN(str);
1670:
1671: <str>\e" { /* saw closing quote - all done */
1672: BEGIN(INITIAL);
1673: *string_buf_ptr = '\e0';
1674: /*
1675: * return string constant token type and
1676: * value to parser
1677: */
1678: }
1679:
1680: <str>\en {
1681: /* error - unterminated string constant */
1682: /* generate error message */
1683: }
1684:
1685: <str>\e\e[0-7]{1,3} {
1686: /* octal escape sequence */
1687: int result;
1688:
1689: (void) sscanf(yytext + 1, "%o", &result);
1690:
1691: if (result > 0xff) {
1692: /* error, constant is out-of-bounds */
1693: } else
1694: *string_buf_ptr++ = result;
1695: }
1696:
1697: <str>\e\e[0-9]+ {
1698: /*
1699: * generate error - bad escape sequence; something
1700: * like '\e48' or '\e0777777'
1701: */
1702: }
1703:
1704: <str>\e\en *string_buf_ptr++ = '\en';
1705: <str>\e\et *string_buf_ptr++ = '\et';
1706: <str>\e\er *string_buf_ptr++ = '\er';
1707: <str>\e\eb *string_buf_ptr++ = '\eb';
1708: <str>\e\ef *string_buf_ptr++ = '\ef';
1709:
1710: <str>\e\e(.|\en) *string_buf_ptr++ = yytext[1];
1711:
1712: <str>[^\e\e\en\e"]+ {
1713: char *yptr = yytext;
1714:
1715: while (*yptr)
1716: *string_buf_ptr++ = *yptr++;
1717: }
1718: .Ed
1719: .Pp
1720: Often, such as in some of the examples above,
1721: a whole bunch of rules are all preceded by the same start condition(s).
1722: .Nm
1.1 deraadt 1723: makes this a little easier and cleaner by introducing a notion of
1724: start condition
1.16 jmc 1725: .Em scope .
1.1 deraadt 1726: A start condition scope is begun with:
1.16 jmc 1727: .Pp
1728: .Dl <SCs>{
1729: .Pp
1.1 deraadt 1730: where
1.16 jmc 1731: .Dq SCs
1732: is a list of one or more start conditions.
1733: Inside the start condition scope, every rule automatically has the prefix
1734: .Aq SCs
1.1 deraadt 1735: applied to it, until a
1.16 jmc 1736: .Sq }
1.1 deraadt 1737: which matches the initial
1.16 jmc 1738: .Sq { .
1.1 deraadt 1739: So, for example,
1.16 jmc 1740: .Bd -literal -offset indent
1741: <ESC>{
1742: "\e\en" return '\en';
1743: "\e\er" return '\er';
1744: "\e\ef" return '\ef';
1745: "\e\e0" return '\e0';
1746: }
1747: .Ed
1748: .Pp
1.1 deraadt 1749: is equivalent to:
1.16 jmc 1750: .Bd -literal -offset indent
1751: <ESC>"\e\en" return '\en';
1752: <ESC>"\e\er" return '\er';
1753: <ESC>"\e\ef" return '\ef';
1754: <ESC>"\e\e0" return '\e0';
1755: .Ed
1756: .Pp
1.1 deraadt 1757: Start condition scopes may be nested.
1.16 jmc 1758: .Pp
1.1 deraadt 1759: Three routines are available for manipulating stacks of start conditions:
1.16 jmc 1760: .Bl -tag -width Ds
1761: .It void yy_push_state(int new_state)
1762: Pushes the current start condition onto the top of the start condition
1.1 deraadt 1763: stack and switches to
1.16 jmc 1764: .Fa new_state
1765: as though
1766: .Dq BEGIN new_state
1767: had been used
1768: .Pq recall that start condition names are also integers .
1769: .It void yy_pop_state()
1770: Pops the top of the stack and switches to it via
1771: .Em BEGIN .
1772: .It int yy_top_state()
1773: Returns the top of the stack without altering the stack's contents.
1774: .El
1775: .Pp
1.1 deraadt 1776: The start condition stack grows dynamically and so has no built-in
1.16 jmc 1777: size limitation.
1778: If memory is exhausted, program execution aborts.
1779: .Pp
1780: To use start condition stacks, scanners must include a
1781: .Dq %option stack
1782: directive (see
1783: .Sx OPTIONS
1784: below).
1785: .Sh MULTIPLE INPUT BUFFERS
1786: Some scanners
1787: (such as those which support
1788: .Qq include
1789: files)
1790: require reading from several input streams.
1791: As
1792: .Nm
1.1 deraadt 1793: scanners do a large amount of buffering, one cannot control
1794: where the next input will be read from by simply writing a
1.16 jmc 1795: .Dv YY_INPUT
1.1 deraadt 1796: which is sensitive to the scanning context.
1.16 jmc 1797: .Dv YY_INPUT
1.1 deraadt 1798: is only called when the scanner reaches the end of its buffer, which
1.16 jmc 1799: may be a long time after scanning a statement such as an
1800: .Qq include
1.1 deraadt 1801: which requires switching the input source.
1.16 jmc 1802: .Pp
1.1 deraadt 1803: To negotiate these sorts of problems,
1.16 jmc 1804: .Nm
1.1 deraadt 1805: provides a mechanism for creating and switching between multiple
1.16 jmc 1806: input buffers.
1807: An input buffer is created by using:
1808: .Pp
1809: .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1810: .Pp
1.1 deraadt 1811: which takes a
1.16 jmc 1812: .Fa FILE
1813: pointer and a
1814: .Fa size
1815: and creates a buffer associated with the given file and large enough to hold
1816: .Fa size
1.1 deraadt 1817: characters (when in doubt, use
1.16 jmc 1818: .Dv YY_BUF_SIZE
1819: for the size).
1820: It returns a
1821: .Dv YY_BUFFER_STATE
1822: handle, which may then be passed to other routines
1823: .Pq see below .
1824: The
1825: .Dv YY_BUFFER_STATE
1.1 deraadt 1826: type is a pointer to an opaque
1.16 jmc 1827: .Dq struct yy_buffer_state
1828: structure, so
1829: .Dv YY_BUFFER_STATE
1830: variables may be safely initialized to
1831: .Dq ((YY_BUFFER_STATE) 0)
1832: if desired, and the opaque structure can also be referred to in order to
1833: correctly declare input buffers in source files other than that of scanners.
1834: Note that the
1835: .Fa FILE
1.1 deraadt 1836: pointer in the call to
1.16 jmc 1837: .Fn yy_create_buffer
1.1 deraadt 1838: is only used as the value of
1.16 jmc 1839: .Fa yyin
1.1 deraadt 1840: seen by
1.16 jmc 1841: .Dv YY_INPUT ;
1842: if
1843: .Dv YY_INPUT
1844: is redefined so that it no longer uses
1845: .Fa yyin ,
1846: then a nil
1847: .Fa FILE
1848: pointer can safely be passed to
1849: .Fn yy_create_buffer .
1850: To select a particular buffer to scan:
1851: .Pp
1852: .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1853: .Pp
1854: It switches the scanner's input buffer so subsequent tokens will
1.1 deraadt 1855: come from
1.16 jmc 1856: .Fa new_buffer .
1.1 deraadt 1857: Note that
1.16 jmc 1858: .Fn yy_switch_to_buffer
1859: may be used by
1860: .Fn yywrap
1861: to set things up for continued scanning,
1862: instead of opening a new file and pointing
1863: .Fa yyin
1864: at it.
1865: Note also that switching input sources via either
1866: .Fn yy_switch_to_buffer
1867: or
1868: .Fn yywrap
1869: does not change the start condition.
1870: .Pp
1871: .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1872: .Pp
1873: is used to reclaim the storage associated with a buffer.
1874: .Pf ( Fa buffer
1.1 deraadt 1875: can be nil, in which case the routine does nothing.)
1.16 jmc 1876: To clear the current contents of a buffer:
1877: .Pp
1878: .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1879: .Pp
1.1 deraadt 1880: This function discards the buffer's contents,
1.16 jmc 1881: so the next time the scanner attempts to match a token from the buffer,
1882: it will first fill the buffer anew using
1883: .Dv YY_INPUT .
1884: .Pp
1885: .Fn yy_new_buffer
1.1 deraadt 1886: is an alias for
1.16 jmc 1887: .Fn yy_create_buffer ,
1.1 deraadt 1888: provided for compatibility with the C++ use of
1.16 jmc 1889: .Em new
1.1 deraadt 1890: and
1.16 jmc 1891: .Em delete
1.1 deraadt 1892: for creating and destroying dynamic objects.
1.16 jmc 1893: .Pp
1.1 deraadt 1894: Finally, the
1.16 jmc 1895: .Dv YY_CURRENT_BUFFER
1.1 deraadt 1896: macro returns a
1.16 jmc 1897: .Dv YY_BUFFER_STATE
1.1 deraadt 1898: handle to the current buffer.
1.16 jmc 1899: .Pp
1.1 deraadt 1900: Here is an example of using these features for writing a scanner
1901: which expands include files (the
1.16 jmc 1902: .Aq Aq EOF
1.1 deraadt 1903: feature is discussed below):
1.16 jmc 1904: .Bd -literal -offset indent
1905: /*
1906: * the "incl" state is used for picking up the name
1907: * of an include file
1908: */
1909: %x incl
1910:
1911: %{
1912: #define MAX_INCLUDE_DEPTH 10
1913: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1914: int include_stack_ptr = 0;
1915: %}
1916:
1917: %%
1918: include BEGIN(incl);
1919:
1920: [a-z]+ ECHO;
1921: [^a-z\en]*\en? ECHO;
1922:
1923: <incl>[ \et]* /* eat the whitespace */
1924: <incl>[^ \et\en]+ { /* got the include file name */
1925: if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1926: errx(1, "Includes nested too deeply");
1927:
1928: include_stack[include_stack_ptr++] =
1929: YY_CURRENT_BUFFER;
1930:
1931: yyin = fopen(yytext, "r");
1932:
1933: if (yyin == NULL)
1934: err(1, NULL);
1.1 deraadt 1935:
1.16 jmc 1936: yy_switch_to_buffer(
1937: yy_create_buffer(yyin, YY_BUF_SIZE));
1.1 deraadt 1938:
1.16 jmc 1939: BEGIN(INITIAL);
1940: }
1.1 deraadt 1941:
1.16 jmc 1942: <<EOF>> {
1943: if (--include_stack_ptr < 0)
1.1 deraadt 1944: yyterminate();
1.16 jmc 1945: else {
1946: yy_delete_buffer(YY_CURRENT_BUFFER);
1.1 deraadt 1947: yy_switch_to_buffer(
1.16 jmc 1948: include_stack[include_stack_ptr]);
1949: }
1950: }
1951: .Ed
1952: .Pp
1.1 deraadt 1953: Three routines are available for setting up input buffers for
1.16 jmc 1954: scanning in-memory strings instead of files.
1955: All of them create a new input buffer for scanning the string,
1956: and return a corresponding
1957: .Dv YY_BUFFER_STATE
1958: handle (which should be deleted afterwards using
1959: .Fn yy_delete_buffer ) .
1960: They also switch to the new buffer using
1961: .Fn yy_switch_to_buffer ,
1.1 deraadt 1962: so the next call to
1.16 jmc 1963: .Fn yylex
1.1 deraadt 1964: will start scanning the string.
1.16 jmc 1965: .Bl -tag -width Ds
1966: .It yy_scan_string(const char *str)
1967: Scans a NUL-terminated string.
1968: .It yy_scan_bytes(const char *bytes, int len)
1969: Scans
1970: .Fa len
1971: bytes
1972: .Pq including possibly NUL's
1.1 deraadt 1973: starting at location
1.16 jmc 1974: .Fa bytes .
1975: .El
1976: .Pp
1977: Note that both of these functions create and scan a copy
1978: of the string or bytes.
1979: (This may be desirable, since
1980: .Fn yylex
1981: modifies the contents of the buffer it is scanning.)
1982: The copy can be avoided by using:
1983: .Bl -tag -width Ds
1984: .It yy_scan_buffer(char *base, yy_size_t size)
1985: Which scans the buffer starting at
1986: .Fa base ,
1.1 deraadt 1987: consisting of
1.16 jmc 1988: .Fa size
1989: bytes, the last two bytes of which must be
1990: .Dv YY_END_OF_BUFFER_CHAR
1991: .Pq ASCII NUL .
1992: These last two bytes are not scanned; thus, scanning consists of
1993: base[0] through base[size-2], inclusive.
1994: .Pp
1995: If
1996: .Fa base
1997: is not set up in this manner
1998: (i.e., forget the final two
1999: .Dv YY_END_OF_BUFFER_CHAR
1.1 deraadt 2000: bytes), then
1.16 jmc 2001: .Fn yy_scan_buffer
1.1 deraadt 2002: returns a nil pointer instead of creating a new input buffer.
1.16 jmc 2003: .Pp
1.1 deraadt 2004: The type
1.16 jmc 2005: .Fa yy_size_t
2006: is an integral type which can be cast to an integer expression
1.1 deraadt 2007: reflecting the size of the buffer.
1.16 jmc 2008: .El
2009: .Sh END-OF-FILE RULES
2010: The special rule
2011: .Qq Aq Aq EOF
2012: indicates actions which are to be taken when an end-of-file is encountered and
2013: .Fn yywrap
2014: returns non-zero
2015: .Pq i.e., indicates no further files to process .
2016: The action must finish by doing one of four things:
2017: .Bl -dash
2018: .It
2019: Assigning
2020: .Em yyin
2021: to a new input file
2022: (in previous versions of
2023: .Nm ,
2024: after doing the assignment, it was necessary to call the special action
2025: .Dv YY_NEW_FILE ;
2026: this is no longer necessary).
2027: .It
2028: Executing a
2029: .Em return
2030: statement.
2031: .It
2032: Executing the special
2033: .Fn yyterminate
2034: action.
2035: .It
2036: Switching to a new buffer using
2037: .Fn yy_switch_to_buffer
1.1 deraadt 2038: as shown in the example above.
1.16 jmc 2039: .El
2040: .Pp
2041: .Aq Aq EOF
2042: rules may not be used with other patterns;
2043: they may only be qualified with a list of start conditions.
2044: If an unqualified
2045: .Aq Aq EOF
2046: rule is given, it applies to all start conditions which do not already have
2047: .Aq Aq EOF
2048: actions.
2049: To specify an
2050: .Aq Aq EOF
2051: rule for only the initial start condition, use
2052: .Pp
2053: .Dl <INITIAL><<EOF>>
2054: .Pp
1.1 deraadt 2055: These rules are useful for catching things like unclosed comments.
2056: An example:
1.16 jmc 2057: .Bd -literal -offset indent
2058: %x quote
2059: %%
2060:
2061: \&...other rules for dealing with quotes...
2062:
2063: <quote><<EOF>> {
2064: error("unterminated quote");
2065: yyterminate();
2066: }
2067: <<EOF>> {
2068: if (*++filelist)
2069: yyin = fopen(*filelist, "r");
2070: else
2071: yyterminate();
2072: }
2073: .Ed
2074: .Sh MISCELLANEOUS MACROS
1.1 deraadt 2075: The macro
1.16 jmc 2076: .Dv YY_USER_ACTION
1.1 deraadt 2077: can be defined to provide an action
1.16 jmc 2078: which is always executed prior to the matched rule's action.
2079: For example,
1.1 deraadt 2080: it could be #define'd to call a routine to convert yytext to lower-case.
2081: When
1.16 jmc 2082: .Dv YY_USER_ACTION
1.1 deraadt 2083: is invoked, the variable
1.16 jmc 2084: .Fa yy_act
2085: gives the number of the matched rule
2086: .Pq rules are numbered starting with 1 .
2087: For example, to profile how often each rule is matched,
2088: the following would do the trick:
2089: .Pp
2090: .Dl #define YY_USER_ACTION ++ctr[yy_act]
2091: .Pp
1.1 deraadt 2092: where
1.16 jmc 2093: .Fa ctr
2094: is an array to hold the counts for the different rules.
2095: Note that the macro
2096: .Dv YY_NUM_RULES
2097: gives the total number of rules
2098: (including the default rule, even if
2099: .Fl s
2100: is used),
1.1 deraadt 2101: so a correct declaration for
1.16 jmc 2102: .Fa ctr
1.1 deraadt 2103: is:
1.16 jmc 2104: .Pp
2105: .Dl int ctr[YY_NUM_RULES];
2106: .Pp
1.1 deraadt 2107: The macro
1.16 jmc 2108: .Dv YY_USER_INIT
1.1 deraadt 2109: may be defined to provide an action which is always executed before
1.16 jmc 2110: the first scan
2111: .Pq and before the scanner's internal initializations are done .
1.1 deraadt 2112: For example, it could be used to call a routine to read
2113: in a data table or open a logging file.
1.16 jmc 2114: .Pp
1.1 deraadt 2115: The macro
1.16 jmc 2116: .Dv yy_set_interactive(is_interactive)
1.1 deraadt 2117: can be used to control whether the current buffer is considered
1.16 jmc 2118: .Em interactive .
1.1 deraadt 2119: An interactive buffer is processed more slowly,
2120: but must be used when the scanner's input source is indeed
2121: interactive to avoid problems due to waiting to fill buffers
2122: (see the discussion of the
1.16 jmc 2123: .Fl I
2124: flag below).
2125: A non-zero value in the macro invocation marks the buffer as interactive,
2126: a zero value as non-interactive.
2127: Note that use of this macro overrides
2128: .Dq %option always-interactive
2129: or
2130: .Dq %option never-interactive
2131: (see
2132: .Sx OPTIONS
2133: below).
2134: .Fn yy_set_interactive
1.1 deraadt 2135: must be invoked prior to beginning to scan the buffer that is
1.16 jmc 2136: .Pq or is not
2137: to be considered interactive.
2138: .Pp
1.1 deraadt 2139: The macro
1.16 jmc 2140: .Dv yy_set_bol(at_bol)
1.1 deraadt 2141: can be used to control whether the current buffer's scanning
2142: context for the next token match is done as though at the
1.16 jmc 2143: beginning of a line.
2144: A non-zero macro argument makes rules anchored with
2145: .Sq ^
2146: active, while a zero argument makes
2147: .Sq ^
2148: rules inactive.
2149: .Pp
1.1 deraadt 2150: The macro
1.16 jmc 2151: .Dv YY_AT_BOL
2152: returns true if the next token scanned from the current buffer will have
2153: .Sq ^
2154: rules active, false otherwise.
2155: .Pp
1.1 deraadt 2156: In the generated scanner, the actions are all gathered in one large
2157: switch statement and separated using
1.16 jmc 2158: .Dv YY_BREAK ,
2159: which may be redefined.
2160: By default, it is simply a
2161: .Qq break ,
2162: to separate each rule's action from the following rules.
1.1 deraadt 2163: Redefining
1.16 jmc 2164: .Dv YY_BREAK
1.1 deraadt 2165: allows, for example, C++ users to
1.16 jmc 2166: .Dq #define YY_BREAK
2167: to do nothing
2168: (while being very careful that every rule ends with a
2169: .Qq break
2170: or a
2171: .Qq return ! )
2172: to avoid suffering from unreachable statement warnings where because a rule's
2173: action ends with
2174: .Dq return ,
2175: the
2176: .Dv YY_BREAK
1.1 deraadt 2177: is inaccessible.
1.16 jmc 2178: .Sh VALUES AVAILABLE TO THE USER
1.1 deraadt 2179: This section summarizes the various values available to the user
2180: in the rule actions.
1.16 jmc 2181: .Bl -tag -width Ds
2182: .It char *yytext
2183: Holds the text of the current token.
2184: It may be modified but not lengthened
2185: .Pq characters cannot be appended to the end .
2186: .Pp
1.1 deraadt 2187: If the special directive
1.16 jmc 2188: .Dq %array
1.1 deraadt 2189: appears in the first section of the scanner description, then
1.16 jmc 2190: .Fa yytext
1.1 deraadt 2191: is instead declared
1.16 jmc 2192: .Dq char yytext[YYLMAX] ,
1.1 deraadt 2193: where
1.16 jmc 2194: .Dv YYLMAX
2195: is a macro definition that can be redefined in the first section
2196: to change the default value
2197: .Pq generally 8KB .
2198: Using
2199: .Dq %array
1.1 deraadt 2200: results in somewhat slower scanners, but the value of
1.16 jmc 2201: .Fa yytext
1.1 deraadt 2202: becomes immune to calls to
1.16 jmc 2203: .Fn input
1.1 deraadt 2204: and
1.16 jmc 2205: .Fn unput ,
1.1 deraadt 2206: which potentially destroy its value when
1.16 jmc 2207: .Fa yytext
2208: is a character pointer.
2209: The opposite of
2210: .Dq %array
1.1 deraadt 2211: is
1.16 jmc 2212: .Dq %pointer ,
1.1 deraadt 2213: which is the default.
1.16 jmc 2214: .Pp
2215: .Dq %array
2216: cannot be used when generating C++ scanner classes
1.1 deraadt 2217: (the
1.16 jmc 2218: .Fl +
1.1 deraadt 2219: flag).
1.16 jmc 2220: .It int yyleng
2221: Holds the length of the current token.
2222: .It FILE *yyin
2223: Is the file which by default
2224: .Nm
2225: reads from.
2226: It may be redefined, but doing so only makes sense before
2227: scanning begins or after an
2228: .Dv EOF
2229: has been encountered.
2230: Changing it in the midst of scanning will have unexpected results since
2231: .Nm
1.1 deraadt 2232: buffers its input; use
1.16 jmc 2233: .Fn yyrestart
1.1 deraadt 2234: instead.
2235: Once scanning terminates because an end-of-file
1.16 jmc 2236: has been seen,
2237: .Fa yyin
2238: can be assigned as the new input file
2239: and the scanner can be called again to continue scanning.
2240: .It void yyrestart(FILE *new_file)
2241: May be called to point
2242: .Fa yyin
2243: at the new input file.
2244: The switch-over to the new file is immediate
2245: .Pq any previously buffered-up input is lost .
2246: Note that calling
2247: .Fn yyrestart
1.1 deraadt 2248: with
1.16 jmc 2249: .Fa yyin
1.1 deraadt 2250: as an argument thus throws away the current input buffer and continues
2251: scanning the same input file.
1.16 jmc 2252: .It FILE *yyout
2253: Is the file to which
2254: .Em ECHO
2255: actions are done.
2256: It can be reassigned by the user.
2257: .It YY_CURRENT_BUFFER
2258: Returns a
2259: .Dv YY_BUFFER_STATE
1.1 deraadt 2260: handle to the current buffer.
1.16 jmc 2261: .It YY_START
2262: Returns an integer value corresponding to the current start condition.
2263: This value can subsequently be used with
2264: .Em BEGIN
1.1 deraadt 2265: to return to that start condition.
1.16 jmc 2266: .El
2267: .Sh INTERFACING WITH YACC
1.1 deraadt 2268: One of the main uses of
1.16 jmc 2269: .Nm
1.1 deraadt 2270: is as a companion to the
1.16 jmc 2271: .Xr yacc 1
1.1 deraadt 2272: parser-generator.
1.16 jmc 2273: yacc parsers expect to call a routine named
2274: .Fn yylex
2275: to find the next input token.
2276: The routine is supposed to return the type of the next token
2277: as well as putting any associated value in the global
1.17 jmc 2278: .Fa yylval ,
2279: which is defined externally,
2280: and can be a union or any other complex data structure.
1.1 deraadt 2281: To use
1.16 jmc 2282: .Nm
2283: with yacc, one specifies the
2284: .Fl d
2285: option to yacc to instruct it to generate the file
2286: .Pa y.tab.h
1.1 deraadt 2287: containing definitions of all the
1.16 jmc 2288: .Dq %tokens
2289: appearing in the yacc input.
2290: This file is then included in the
2291: .Nm
2292: scanner.
2293: For example, if one of the tokens is
2294: .Qq TOK_NUMBER ,
1.1 deraadt 2295: part of the scanner might look like:
1.16 jmc 2296: .Bd -literal -offset indent
2297: %{
2298: #include "y.tab.h"
2299: %}
2300:
2301: %%
2302:
2303: [0-9]+ yylval = atoi(yytext); return TOK_NUMBER;
2304: .Ed
2305: .Sh OPTIONS
2306: .Nm
1.1 deraadt 2307: has the following options:
1.16 jmc 2308: .Bl -tag -width Ds
2309: .It Fl 7
2310: Instructs
2311: .Nm
2312: to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2313: characters in its input.
2314: The advantage of using
2315: .Fl 7
1.1 deraadt 2316: is that the scanner's tables can be up to half the size of those generated
2317: using the
1.16 jmc 2318: .Fl 8
2319: option
2320: .Pq see below .
2321: The disadvantage is that such scanners often hang
1.1 deraadt 2322: or crash if their input contains an 8-bit character.
1.16 jmc 2323: .Pp
2324: Note, however, that unless generating a scanner using the
2325: .Fl Cf
1.1 deraadt 2326: or
1.16 jmc 2327: .Fl CF
1.1 deraadt 2328: table compression options, use of
1.16 jmc 2329: .Fl 7
2330: will save only a small amount of table space,
2331: and make the scanner considerably less portable.
2332: .Nm flex Ns 's
2333: default behavior is to generate an 8-bit scanner unless
2334: .Fl Cf
2335: or
2336: .Fl CF
2337: is specified, in which case
2338: .Nm
2339: defaults to generating 7-bit scanners unless it was
2340: configured to generate 8-bit scanners
2341: (as will often be the case with non-USA sites).
2342: It is possible tell whether
2343: .Nm
2344: generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2345: .Fl v
2346: output as described below.
2347: .Pp
2348: Note that if
2349: .Fl Cfe
2350: or
2351: .Fl CFe
2352: are used
2353: (the table compression options, but also using equivalence classes as
2354: discussed below),
2355: .Nm
2356: still defaults to generating an 8-bit scanner,
2357: since usually with these compression options full 8-bit tables
1.1 deraadt 2358: are not much more expensive than 7-bit tables.
1.16 jmc 2359: .It Fl 8
2360: Instructs
2361: .Nm
1.1 deraadt 2362: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1.16 jmc 2363: characters.
2364: This flag is only needed for scanners generated using
2365: .Fl Cf
1.1 deraadt 2366: or
1.16 jmc 2367: .Fl CF ,
2368: as otherwise
2369: .Nm
2370: defaults to generating an 8-bit scanner anyway.
2371: .Pp
1.1 deraadt 2372: See the discussion of
1.16 jmc 2373: .Fl 7
2374: above for
2375: .Nm flex Ns 's
2376: default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2377: .It Fl B
2378: Instructs
2379: .Nm
2380: to generate a
2381: .Em batch
2382: scanner, the opposite of
2383: .Em interactive
2384: scanners generated by
2385: .Fl I
2386: .Pq see below .
2387: In general,
2388: .Fl B
2389: is used when the scanner will never be used interactively,
2390: and you want to squeeze a little more performance out of it.
2391: If the aim is instead to squeeze out a lot more performance,
2392: use the
2393: .Fl Cf
2394: or
2395: .Fl CF
2396: options
2397: .Pq discussed below ,
2398: which turn on
2399: .Fl B
2400: automatically anyway.
2401: .It Fl b
2402: Generate backing-up information to
2403: .Pa lex.backup .
2404: This is a list of scanner states which require backing up
2405: and the input characters on which they do so.
2406: By adding rules one can remove backing-up states.
2407: If all backing-up states are eliminated and
2408: .Fl Cf
2409: or
2410: .Fl CF
2411: is used, the generated scanner will run faster (see the
2412: .Fl p
2413: flag).
2414: Only users who wish to squeeze every last cycle out of their
2415: scanners need worry about this option.
2416: (See the section on
2417: .Sx PERFORMANCE CONSIDERATIONS
2418: below.)
2419: .It Fl C Ns Op Cm aeFfmr
2420: Controls the degree of table compression and, more generally, trade-offs
1.1 deraadt 2421: between small scanners and fast scanners.
1.16 jmc 2422: .Bl -tag -width Ds
2423: .It Fl Ca
2424: Instructs
2425: .Nm
2426: to trade off larger tables in the generated scanner for faster performance
2427: because the elements of the tables are better aligned for memory access
2428: and computation.
2429: On some
2430: .Tn RISC
2431: architectures, fetching and manipulating longwords is more efficient
2432: than with smaller-sized units such as shortwords.
2433: This option can double the size of the tables used by the scanner.
2434: .It Fl Ce
2435: Directs
2436: .Nm
1.1 deraadt 2437: to construct
1.16 jmc 2438: .Em equivalence classes ,
2439: i.e., sets of characters which have identical lexical properties
2440: (for example, if the only appearance of digits in the
2441: .Nm
1.1 deraadt 2442: input is in the character class
1.16 jmc 2443: .Qq [0-9]
2444: then the digits
2445: .Sq 0 ,
2446: .Sq 1 ,
2447: .Sq ... ,
2448: .Sq 9
2449: will all be put in the same equivalence class).
2450: Equivalence classes usually give dramatic reductions in the final
2451: table/object file sizes
2452: .Pq typically a factor of 2\-5
2453: and are pretty cheap performance-wise
2454: .Pq one array look-up per character scanned .
2455: .It Fl CF
2456: Specifies that the alternate fast scanner representation
2457: (described below under the
2458: .Fl F
2459: option)
2460: should be used.
2461: This option cannot be used with
2462: .Fl + .
2463: .It Fl Cf
2464: Specifies that the
2465: .Em full
2466: scanner tables should be generated \-
2467: .Nm
2468: should not compress the tables by taking advantage of
2469: similar transition functions for different states.
2470: .It Fl \&Cm
2471: Directs
2472: .Nm
1.1 deraadt 2473: to construct
1.16 jmc 2474: .Em meta-equivalence classes ,
2475: which are sets of equivalence classes
2476: (or characters, if equivalence classes are not being used)
2477: that are commonly used together.
2478: Meta-equivalence classes are often a big win when using compressed tables,
2479: but they have a moderate performance impact
2480: (one or two
2481: .Qq if
2482: tests and one array look-up per character scanned).
2483: .It Fl Cr
2484: Causes the generated scanner to
2485: .Em bypass
2486: use of the standard I/O library
2487: .Pq stdio
2488: for input.
2489: Instead of calling
2490: .Xr fread 3
1.1 deraadt 2491: or
1.16 jmc 2492: .Xr getc 3 ,
1.1 deraadt 2493: the scanner will use the
1.16 jmc 2494: .Xr read 2
2495: system call,
2496: resulting in a performance gain which varies from system to system,
2497: but in general is probably negligible unless
2498: .Fl Cf
1.1 deraadt 2499: or
1.16 jmc 2500: .Fl CF
2501: are being used.
1.1 deraadt 2502: Using
1.16 jmc 2503: .Fl Cr
2504: can cause strange behavior if, for example, reading from
2505: .Fa yyin
2506: using stdio prior to calling the scanner
2507: (because the scanner will miss whatever text previous reads left
2508: in the stdio input buffer).
2509: .Pp
2510: .Fl Cr
2511: has no effect if
2512: .Dv YY_INPUT
2513: is defined
2514: (see
2515: .Sx THE GENERATED SCANNER
2516: above).
2517: .El
2518: .Pp
1.1 deraadt 2519: A lone
1.16 jmc 2520: .Fl C
1.1 deraadt 2521: specifies that the scanner tables should be compressed but neither
2522: equivalence classes nor meta-equivalence classes should be used.
1.16 jmc 2523: .Pp
1.1 deraadt 2524: The options
1.16 jmc 2525: .Fl Cf
1.1 deraadt 2526: or
1.16 jmc 2527: .Fl CF
1.1 deraadt 2528: and
1.16 jmc 2529: .Fl \&Cm
2530: do not make sense together \- there is no opportunity for meta-equivalence
2531: classes if the table is not being compressed.
2532: Otherwise the options may be freely mixed, and are cumulative.
2533: .Pp
1.1 deraadt 2534: The default setting is
1.16 jmc 2535: .Fl Cem
1.1 deraadt 2536: which specifies that
1.16 jmc 2537: .Nm
2538: should generate equivalence classes and meta-equivalence classes.
2539: This setting provides the highest degree of table compression.
2540: It is possible to trade off faster-executing scanners at the cost of
2541: larger tables with the following generally being true:
2542: .Bd -unfilled -offset indent
2543: slowest & smallest
2544: -Cem
2545: -Cm
2546: -Ce
2547: -C
2548: -C{f,F}e
2549: -C{f,F}
2550: -C{f,F}a
2551: fastest & largest
2552: .Ed
2553: .Pp
1.1 deraadt 2554: Note that scanners with the smallest tables are usually generated and
1.16 jmc 2555: compiled the quickest,
2556: so during development the default is usually best,
2557: maximal compression.
2558: .Pp
2559: .Fl Cfe
2560: is often a good compromise between speed and size for production scanners.
2561: .It Fl d
2562: Makes the generated scanner run in debug mode.
2563: Whenever a pattern is recognized and the global
2564: .Fa yy_flex_debug
2565: is non-zero
2566: .Pq which is the default ,
2567: the scanner will write to stderr a line of the form:
2568: .Pp
2569: .D1 --accepting rule at line 53 ("the matched text")
2570: .Pp
2571: The line number refers to the location of the rule in the file
2572: defining the scanner
2573: (i.e., the file that was fed to
2574: .Nm ) .
2575: Messages are also generated when the scanner backs up,
2576: accepts the default rule,
2577: reaches the end of its input buffer
2578: (or encounters a NUL;
2579: at this point, the two look the same as far as the scanner's concerned),
2580: or reaches an end-of-file.
2581: .It Fl F
2582: Specifies that the fast scanner table representation should be used
2583: .Pq and stdio bypassed .
2584: This representation is about as fast as the full table representation
2585: .Pq Fl f ,
2586: and for some sets of patterns will be considerably smaller
2587: .Pq and for others, larger .
2588: In general, if the pattern set contains both
2589: .Qq keywords
2590: and a catch-all,
2591: .Qq identifier
2592: rule, such as in the set:
2593: .Bd -unfilled -offset indent
2594: "case" return TOK_CASE;
2595: "switch" return TOK_SWITCH;
2596: \&...
2597: "default" return TOK_DEFAULT;
2598: [a-z]+ return TOK_ID;
2599: .Ed
2600: .Pp
2601: then it's better to use the full table representation.
2602: If only the
2603: .Qq identifier
2604: rule is present and a hash table or some such is used to detect the keywords,
2605: it's better to use
2606: .Fl F .
2607: .Pp
2608: This option is equivalent to
2609: .Fl CFr
2610: .Pq see above .
2611: It cannot be used with
2612: .Fl + .
2613: .It Fl f
2614: Specifies
2615: .Em fast scanner .
2616: No table compression is done and stdio is bypassed.
2617: The result is large but fast.
2618: This option is equivalent to
2619: .Fl Cfr
2620: .Pq see above .
2621: .It Fl h
2622: Generates a help summary of
2623: .Nm flex Ns 's
2624: options to stdout and then exits.
2625: .Fl ?\&
2626: and
2627: .Fl Fl help
2628: are synonyms for
2629: .Fl h .
2630: .It Fl I
2631: Instructs
2632: .Nm
2633: to generate an
2634: .Em interactive
2635: scanner.
2636: An interactive scanner is one that only looks ahead to decide
2637: what token has been matched if it absolutely must.
2638: It turns out that always looking one extra character ahead,
2639: even if the scanner has already seen enough text
2640: to disambiguate the current token, is a bit faster than
2641: only looking ahead when necessary.
2642: But scanners that always look ahead give dreadful interactive performance;
2643: for example, when a user types a newline,
2644: it is not recognized as a newline token until they enter
2645: .Em another
2646: token, which often means typing in another whole line.
2647: .Pp
2648: .Nm
2649: scanners default to
2650: .Em interactive
2651: unless
2652: .Fl Cf
2653: or
2654: .Fl CF
2655: table-compression options are specified
2656: .Pq see above .
2657: That's because if high-performance is most important,
2658: one of these options should be used,
2659: so if they weren't,
2660: .Nm
1.24 sobrado 2661: assumes it is preferable to trade off a bit of run-time performance for
1.16 jmc 2662: intuitive interactive behavior.
2663: Note also that
2664: .Fl I
2665: cannot be used in conjunction with
2666: .Fl Cf
2667: or
2668: .Fl CF .
2669: Thus, this option is not really needed; it is on by default for all those
2670: cases in which it is allowed.
2671: .Pp
2672: A scanner can be forced to not be interactive by using
2673: .Fl B
2674: .Pq see above .
2675: .It Fl i
2676: Instructs
2677: .Nm
2678: to generate a case-insensitive scanner.
2679: The case of letters given in the
2680: .Nm
2681: input patterns will be ignored,
2682: and tokens in the input will be matched regardless of case.
2683: The matched text given in
2684: .Fa yytext
2685: will have the preserved case
2686: .Pq i.e., it will not be folded .
2687: .It Fl L
2688: Instructs
2689: .Nm
2690: not to generate
2691: .Dq #line
2692: directives.
2693: Without this option,
2694: .Nm
2695: peppers the generated scanner with #line directives so error messages
2696: in the actions will be correctly located with respect to either the original
2697: .Nm
2698: input file
2699: (if the errors are due to code in the input file),
2700: or
2701: .Pa lex.yy.c
2702: (if the errors are
2703: .Nm flex Ns 's
2704: fault \- these sorts of errors should be reported to the email address
2705: given below).
2706: .It Fl l
1.36 schwarze 2707: Turns on maximum compatibility with the original
2708: .At
1.16 jmc 2709: .Nm lex
2710: implementation.
2711: Note that this does not mean full compatibility.
2712: Use of this option costs a considerable amount of performance,
2713: and it cannot be used with the
2714: .Fl + , f , F , Cf ,
2715: or
2716: .Fl CF
2717: options.
2718: For details on the compatibilities it provides, see the section
2719: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
2720: below.
2721: This option also results in the name
2722: .Dv YY_FLEX_LEX_COMPAT
2723: being #define'd in the generated scanner.
2724: .It Fl n
2725: Another do-nothing, deprecated option included only for
2726: .Tn POSIX
2727: compliance.
2728: .It Fl o Ns Ar output
2729: Directs
2730: .Nm
2731: to write the scanner to the file
2732: .Ar output
1.1 deraadt 2733: instead of
1.16 jmc 2734: .Pa lex.yy.c .
2735: If
2736: .Fl o
2737: is combined with the
2738: .Fl t
2739: option, then the scanner is written to stdout but its
2740: .Dq #line
2741: directives
2742: (see the
2743: .Fl L
2744: option above)
2745: refer to the file
2746: .Ar output .
2747: .It Fl P Ns Ar prefix
2748: Changes the default
2749: .Qq yy
1.1 deraadt 2750: prefix used by
1.16 jmc 2751: .Nm
1.6 aaron 2752: for all globally visible variable and function names to instead be
1.16 jmc 2753: .Ar prefix .
1.1 deraadt 2754: For example,
1.16 jmc 2755: .Fl P Ns Ar foo
1.1 deraadt 2756: changes the name of
1.16 jmc 2757: .Fa yytext
1.1 deraadt 2758: to
1.16 jmc 2759: .Fa footext .
1.1 deraadt 2760: It also changes the name of the default output file from
1.16 jmc 2761: .Pa lex.yy.c
1.1 deraadt 2762: to
1.16 jmc 2763: .Pa lex.foo.c .
1.1 deraadt 2764: Here are all of the names affected:
1.16 jmc 2765: .Bd -unfilled -offset indent
2766: yy_create_buffer
2767: yy_delete_buffer
2768: yy_flex_debug
2769: yy_init_buffer
2770: yy_flush_buffer
2771: yy_load_buffer_state
2772: yy_switch_to_buffer
2773: yyin
2774: yyleng
2775: yylex
2776: yylineno
2777: yyout
2778: yyrestart
2779: yytext
2780: yywrap
2781: .Ed
2782: .Pp
2783: (If using a C++ scanner, then only
2784: .Fa yywrap
1.1 deraadt 2785: and
1.16 jmc 2786: .Fa yyFlexLexer
1.1 deraadt 2787: are affected.)
1.16 jmc 2788: Within the scanner itself, it is still possible to refer to the global variables
1.1 deraadt 2789: and functions using either version of their name; but externally, they
2790: have the modified name.
1.16 jmc 2791: .Pp
2792: This option allows multiple
2793: .Nm
2794: programs to be easily linked together into the same executable.
2795: Note, though, that using this option also renames
2796: .Fn yywrap ,
2797: so now either an
2798: .Pq appropriately named
2799: version of the routine for the scanner must be supplied, or
2800: .Dq %option noyywrap
2801: must be used, as linking with
2802: .Fl lfl
2803: no longer provides one by default.
2804: .It Fl p
2805: Generates a performance report to stderr.
2806: The report consists of comments regarding features of the
2807: .Nm
2808: input file which will cause a serious loss of performance in the resulting
2809: scanner.
2810: If the flag is specified twice,
2811: comments regarding features that lead to minor performance losses
2812: will also be reported>
2813: .Pp
2814: Note that the use of
2815: .Em REJECT ,
2816: .Dq %option yylineno ,
2817: and variable trailing context
2818: (see the
2819: .Sx BUGS
2820: section below)
2821: entails a substantial performance penalty; use of
2822: .Fn yymore ,
2823: the
2824: .Sq ^
2825: operator, and the
2826: .Fl I
2827: flag entail minor performance penalties.
2828: .It Fl S Ns Ar skeleton
2829: Overrides the default skeleton file from which
2830: .Nm
2831: constructs its scanners.
2832: This option is needed only for
2833: .Nm
1.1 deraadt 2834: maintenance or development.
1.16 jmc 2835: .It Fl s
2836: Causes the default rule
2837: .Pq that unmatched scanner input is echoed to stdout
2838: to be suppressed.
2839: If the scanner encounters input that does not
2840: match any of its rules, it aborts with an error.
2841: This option is useful for finding holes in a scanner's rule set.
2842: .It Fl T
2843: Makes
2844: .Nm
2845: run in
2846: .Em trace
2847: mode.
2848: It will generate a lot of messages to stderr concerning
2849: the form of the input and the resultant non-deterministic and deterministic
2850: finite automata.
2851: This option is mostly for use in maintaining
2852: .Nm .
2853: .It Fl t
2854: Instructs
2855: .Nm
2856: to write the scanner it generates to standard output instead of
2857: .Pa lex.yy.c .
2858: .It Fl V
2859: Prints the version number to stdout and exits.
2860: .Fl Fl version
2861: is a synonym for
2862: .Fl V .
2863: .It Fl v
2864: Specifies that
2865: .Nm
2866: should write to stderr
2867: a summary of statistics regarding the scanner it generates.
2868: Most of the statistics are meaningless to the casual
2869: .Nm
2870: user, but the first line identifies the version of
2871: .Nm
2872: (same as reported by
2873: .Fl V ) ,
2874: and the next line the flags used when generating the scanner,
2875: including those that are on by default.
2876: .It Fl w
2877: Suppresses warning messages.
2878: .It Fl +
2879: Specifies that
2880: .Nm
2881: should generate a C++ scanner class.
2882: See the section on
2883: .Sx GENERATING C++ SCANNERS
2884: below for details.
2885: .El
2886: .Pp
2887: .Nm
1.1 deraadt 2888: also provides a mechanism for controlling options within the
1.16 jmc 2889: scanner specification itself, rather than from the
2890: .Nm
1.33 jmc 2891: command line.
1.1 deraadt 2892: This is done by including
1.16 jmc 2893: .Dq %option
1.1 deraadt 2894: directives in the first section of the scanner specification.
1.16 jmc 2895: Multiple options can be specified with a single
2896: .Dq %option
2897: directive, and multiple directives in the first section of the
2898: .Nm
2899: input file.
2900: .Pp
2901: Most options are given simply as names, optionally preceded by the word
2902: .Qq no
2903: .Pq with no intervening whitespace
2904: to negate their meaning.
2905: A number are equivalent to
2906: .Nm
2907: flags or their negation:
2908: .Bd -unfilled -offset indent
2909: 7bit -7 option
2910: 8bit -8 option
2911: align -Ca option
2912: backup -b option
2913: batch -B option
2914: c++ -+ option
2915:
2916: caseful or
2917: case-sensitive opposite of -i (default)
2918:
2919: case-insensitive or
2920: caseless -i option
2921:
2922: debug -d option
2923: default opposite of -s option
2924: ecs -Ce option
2925: fast -F option
2926: full -f option
2927: interactive -I option
2928: lex-compat -l option
2929: meta-ecs -Cm option
2930: perf-report -p option
2931: read -Cr option
2932: stdout -t option
2933: verbose -v option
2934: warn opposite of -w option
2935: (use "%option nowarn" for -w)
2936:
2937: array equivalent to "%array"
2938: pointer equivalent to "%pointer" (default)
2939: .Ed
2940: .Pp
2941: Some %option's provide features otherwise not available:
2942: .Bl -tag -width Ds
2943: .It always-interactive
2944: Instructs
2945: .Nm
2946: to generate a scanner which always considers its input
2947: .Qq interactive .
2948: Normally, on each new input file the scanner calls
2949: .Fn isatty
2950: in an attempt to determine whether the scanner's input source is interactive
2951: and thus should be read a character at a time.
2952: When this option is used, however, no such call is made.
2953: .It main
2954: Directs
2955: .Nm
2956: to provide a default
2957: .Fn main
1.1 deraadt 2958: program for the scanner, which simply calls
1.16 jmc 2959: .Fn yylex .
1.1 deraadt 2960: This option implies
1.16 jmc 2961: .Dq noyywrap
2962: .Pq see below .
2963: .It never-interactive
2964: Instructs
2965: .Nm
2966: to generate a scanner which never considers its input
2967: .Qq interactive
2968: (again, no call made to
2969: .Fn isatty ) .
1.1 deraadt 2970: This is the opposite of
1.16 jmc 2971: .Dq always-interactive .
2972: .It stack
2973: Enables the use of start condition stacks
2974: (see
2975: .Sx START CONDITIONS
2976: above).
2977: .It stdinit
2978: If set (i.e.,
2979: .Dq %option stdinit ) ,
1.1 deraadt 2980: initializes
1.16 jmc 2981: .Fa yyin
1.1 deraadt 2982: and
1.16 jmc 2983: .Fa yyout
2984: to stdin and stdout, instead of the default of
2985: .Dq nil .
1.1 deraadt 2986: Some existing
1.16 jmc 2987: .Nm lex
2988: programs depend on this behavior, even though it is not compliant with ANSI C,
2989: which does not require stdin and stdout to be compile-time constant.
2990: .It yylineno
2991: Directs
2992: .Nm
1.1 deraadt 2993: to generate a scanner that maintains the number of the current line
2994: read from its input in the global variable
1.16 jmc 2995: .Fa yylineno .
1.1 deraadt 2996: This option is implied by
1.16 jmc 2997: .Dq %option lex-compat .
2998: .It yywrap
2999: If unset (i.e.,
3000: .Dq %option noyywrap ) ,
1.1 deraadt 3001: makes the scanner not call
1.16 jmc 3002: .Fn yywrap
3003: upon an end-of-file, but simply assume that there are no more files to scan
3004: (until the user points
3005: .Fa yyin
1.1 deraadt 3006: at a new file and calls
1.16 jmc 3007: .Fn yylex
1.1 deraadt 3008: again).
1.16 jmc 3009: .El
3010: .Pp
3011: .Nm
3012: scans rule actions to determine whether the
3013: .Em REJECT
3014: or
3015: .Fn yymore
3016: features are being used.
3017: The
3018: .Dq reject
1.1 deraadt 3019: and
1.16 jmc 3020: .Dq yymore
3021: options are available to override its decision as to whether to use the
1.1 deraadt 3022: options, either by setting them (e.g.,
1.16 jmc 3023: .Dq %option reject )
3024: to indicate the feature is indeed used,
3025: or unsetting them to indicate it actually is not used
1.1 deraadt 3026: (e.g.,
1.16 jmc 3027: .Dq %option noyymore ) .
3028: .Pp
3029: Three options take string-delimited values, offset with
3030: .Sq = :
3031: .Pp
3032: .D1 %option outfile="ABC"
3033: .Pp
1.1 deraadt 3034: is equivalent to
1.16 jmc 3035: .Fl o Ns Ar ABC ,
1.1 deraadt 3036: and
1.16 jmc 3037: .Pp
3038: .D1 %option prefix="XYZ"
3039: .Pp
1.1 deraadt 3040: is equivalent to
1.16 jmc 3041: .Fl P Ns Ar XYZ .
1.1 deraadt 3042: Finally,
1.16 jmc 3043: .Pp
3044: .D1 %option yyclass="foo"
3045: .Pp
3046: only applies when generating a C++ scanner
3047: .Pf ( Fl +
3048: option).
3049: It informs
3050: .Nm
3051: that
3052: .Dq foo
3053: has been derived as a subclass of yyFlexLexer, so
3054: .Nm
3055: will place actions in the member function
3056: .Dq foo::yylex()
1.1 deraadt 3057: instead of
1.16 jmc 3058: .Dq yyFlexLexer::yylex() .
1.1 deraadt 3059: It also generates a
1.16 jmc 3060: .Dq yyFlexLexer::yylex()
1.1 deraadt 3061: member function that emits a run-time error (by invoking
1.16 jmc 3062: .Dq yyFlexLexer::LexerError() )
1.1 deraadt 3063: if called.
1.16 jmc 3064: See
3065: .Sx GENERATING C++ SCANNERS ,
3066: below, for additional information.
3067: .Pp
3068: A number of options are available for
1.32 jmc 3069: lint
1.16 jmc 3070: purists who want to suppress the appearance of unneeded routines
3071: in the generated scanner.
3072: Each of the following, if unset
1.1 deraadt 3073: (e.g.,
1.16 jmc 3074: .Dq %option nounput ) ,
3075: results in the corresponding routine not appearing in the generated scanner:
3076: .Bd -unfilled -offset indent
3077: input, unput
3078: yy_push_state, yy_pop_state, yy_top_state
3079: yy_scan_buffer, yy_scan_bytes, yy_scan_string
3080: .Ed
3081: .Pp
1.1 deraadt 3082: (though
1.16 jmc 3083: .Fn yy_push_state
3084: and friends won't appear anyway unless
3085: .Dq %option stack
3086: is being used).
3087: .Sh PERFORMANCE CONSIDERATIONS
1.1 deraadt 3088: The main design goal of
1.16 jmc 3089: .Nm
3090: is that it generate high-performance scanners.
3091: It has been optimized for dealing well with large sets of rules.
3092: Aside from the effects on scanner speed of the table compression
3093: .Fl C
1.1 deraadt 3094: options outlined above,
1.16 jmc 3095: there are a number of options/actions which degrade performance.
3096: These are, from most expensive to least:
3097: .Bd -unfilled -offset indent
3098: REJECT
3099: %option yylineno
3100: arbitrary trailing context
3101:
3102: pattern sets that require backing up
3103: %array
3104: %option interactive
3105: %option always-interactive
3106:
3107: \&'^' beginning-of-line operator
3108: yymore()
3109: .Ed
3110: .Pp
3111: with the first three all being quite expensive
3112: and the last two being quite cheap.
3113: Note also that
3114: .Fn unput
3115: is implemented as a routine call that potentially does quite a bit of work,
3116: while
3117: .Fn yyless
3118: is a quite-cheap macro; so if just putting back some excess text,
3119: use
3120: .Fn yyless .
3121: .Pp
3122: .Em REJECT
1.1 deraadt 3123: should be avoided at all costs when performance is important.
3124: It is a particularly expensive option.
1.16 jmc 3125: .Pp
1.1 deraadt 3126: Getting rid of backing up is messy and often may be an enormous
1.16 jmc 3127: amount of work for a complicated scanner.
3128: In principal, one begins by using the
3129: .Fl b
1.1 deraadt 3130: flag to generate a
1.16 jmc 3131: .Pa lex.backup
3132: file.
3133: For example, on the input
3134: .Bd -literal -offset indent
3135: %%
3136: foo return TOK_KEYWORD;
3137: foobar return TOK_KEYWORD;
3138: .Ed
3139: .Pp
1.1 deraadt 3140: the file looks like:
1.16 jmc 3141: .Bd -literal -offset indent
3142: State #6 is non-accepting -
3143: associated rule line numbers:
3144: 2 3
3145: out-transitions: [ o ]
3146: jam-transitions: EOF [ \e001-n p-\e177 ]
3147:
3148: State #8 is non-accepting -
3149: associated rule line numbers:
3150: 3
3151: out-transitions: [ a ]
3152: jam-transitions: EOF [ \e001-` b-\e177 ]
3153:
3154: State #9 is non-accepting -
3155: associated rule line numbers:
3156: 3
3157: out-transitions: [ r ]
3158: jam-transitions: EOF [ \e001-q s-\e177 ]
3159:
3160: Compressed tables always back up.
3161: .Ed
3162: .Pp
1.1 deraadt 3163: The first few lines tell us that there's a scanner state in
1.16 jmc 3164: which it can make a transition on an
3165: .Sq o
3166: but not on any other character,
3167: and that in that state the currently scanned text does not match any rule.
3168: The state occurs when trying to match the rules found
1.1 deraadt 3169: at lines 2 and 3 in the input file.
1.16 jmc 3170: If the scanner is in that state and then reads something other than an
3171: .Sq o ,
3172: it will have to back up to find a rule which is matched.
3173: With a bit of headscratching one can see that this must be the
3174: state it's in when it has seen
3175: .Sq fo .
3176: When this has happened, if anything other than another
3177: .Sq o
3178: is seen, the scanner will have to back up to simply match the
3179: .Sq f
3180: .Pq by the default rule .
3181: .Pp
3182: The comment regarding State #8 indicates there's a problem when
3183: .Qq foob
3184: has been scanned.
3185: Indeed, on any character other than an
3186: .Sq a ,
3187: the scanner will have to back up to accept
3188: .Qq foo .
3189: Similarly, the comment for State #9 concerns when
3190: .Qq fooba
3191: has been scanned and an
3192: .Sq r
3193: does not follow.
3194: .Pp
1.1 deraadt 3195: The final comment reminds us that there's no point going to
1.16 jmc 3196: all the trouble of removing backing up from the rules unless we're using
3197: .Fl Cf
1.1 deraadt 3198: or
1.16 jmc 3199: .Fl CF ,
1.1 deraadt 3200: since there's no performance gain doing so with compressed scanners.
1.16 jmc 3201: .Pp
3202: The way to remove the backing up is to add
3203: .Qq error
3204: rules:
3205: .Bd -literal -offset indent
3206: %%
3207: foo return TOK_KEYWORD;
3208: foobar return TOK_KEYWORD;
3209:
3210: fooba |
3211: foob |
3212: fo {
3213: /* false alarm, not really a keyword */
3214: return TOK_ID;
3215: }
3216: .Ed
3217: .Pp
3218: Eliminating backing up among a list of keywords can also be done using a
3219: .Qq catch-all
3220: rule:
3221: .Bd -literal -offset indent
3222: %%
3223: foo return TOK_KEYWORD;
3224: foobar return TOK_KEYWORD;
3225:
3226: [a-z]+ return TOK_ID;
3227: .Ed
3228: .Pp
1.1 deraadt 3229: This is usually the best solution when appropriate.
1.16 jmc 3230: .Pp
1.1 deraadt 3231: Backing up messages tend to cascade.
1.16 jmc 3232: With a complicated set of rules it's not uncommon to get hundreds of messages.
3233: If one can decipher them, though,
3234: it often only takes a dozen or so rules to eliminate the backing up
3235: (though it's easy to make a mistake and have an error rule accidentally match
3236: a valid token; a possible future
3237: .Nm
1.1 deraadt 3238: feature will be to automatically add rules to eliminate backing up).
1.16 jmc 3239: .Pp
3240: It's important to keep in mind that the benefits of eliminating
3241: backing up are gained only if
3242: .Em every
3243: instance of backing up is eliminated.
3244: Leaving just one gains nothing.
3245: .Pp
3246: .Em Variable
3247: trailing context
3248: (where both the leading and trailing parts do not have a fixed length)
3249: entails almost the same performance loss as
3250: .Em REJECT
3251: .Pq i.e., substantial .
3252: So when possible a rule like:
3253: .Bd -literal -offset indent
3254: %%
3255: mouse|rat/(cat|dog) run();
3256: .Ed
3257: .Pp
1.1 deraadt 3258: is better written:
1.16 jmc 3259: .Bd -literal -offset indent
3260: %%
3261: mouse/cat|dog run();
3262: rat/cat|dog run();
3263: .Ed
3264: .Pp
1.1 deraadt 3265: or as
1.16 jmc 3266: .Bd -literal -offset indent
3267: %%
3268: mouse|rat/cat run();
3269: mouse|rat/dog run();
3270: .Ed
3271: .Pp
3272: Note that here the special
3273: .Sq |\&
3274: action does not provide any savings, and can even make things worse (see
3275: .Sx BUGS
3276: below).
3277: .Pp
1.1 deraadt 3278: Another area where the user can increase a scanner's performance
1.16 jmc 3279: .Pq and one that's easier to implement
3280: arises from the fact that the longer the tokens matched,
3281: the faster the scanner will run.
1.1 deraadt 3282: This is because with long tokens the processing of most input
1.16 jmc 3283: characters takes place in the
3284: .Pq short
3285: inner scanning loop, and does not often have to go through the additional work
3286: of setting up the scanning environment (e.g.,
3287: .Fa yytext )
3288: for the action.
3289: Recall the scanner for C comments:
3290: .Bd -literal -offset indent
3291: %x comment
3292: %%
3293: int line_num = 1;
3294:
3295: "/*" BEGIN(comment);
3296:
3297: <comment>[^*\en]*
3298: <comment>"*"+[^*/\en]*
3299: <comment>\en ++line_num;
3300: <comment>"*"+"/" BEGIN(INITIAL);
3301: .Ed
3302: .Pp
1.1 deraadt 3303: This could be sped up by writing it as:
1.16 jmc 3304: .Bd -literal -offset indent
3305: %x comment
3306: %%
3307: int line_num = 1;
3308:
3309: "/*" BEGIN(comment);
3310:
3311: <comment>[^*\en]*
3312: <comment>[^*\en]*\en ++line_num;
3313: <comment>"*"+[^*/\en]*
3314: <comment>"*"+[^*/\en]*\en ++line_num;
3315: <comment>"*"+"/" BEGIN(INITIAL);
3316: .Ed
3317: .Pp
3318: Now instead of each newline requiring the processing of another action,
3319: recognizing the newlines is
3320: .Qq distributed
3321: over the other rules to keep the matched text as long as possible.
3322: Note that adding rules does
3323: .Em not
3324: slow down the scanner!
3325: The speed of the scanner is independent of the number of rules or
3326: (modulo the considerations given at the beginning of this section)
3327: how complicated the rules are with regard to operators such as
3328: .Sq *
3329: and
3330: .Sq |\& .
3331: .Pp
3332: A final example in speeding up a scanner:
3333: scan through a file containing identifiers and keywords, one per line
3334: and with no other extraneous characters, and recognize all the keywords.
3335: A natural first approach is:
3336: .Bd -literal -offset indent
3337: %%
3338: asm |
3339: auto |
3340: break |
3341: \&... etc ...
3342: volatile |
3343: while /* it's a keyword */
3344:
3345: \&.|\en /* it's not a keyword */
3346: .Ed
3347: .Pp
1.1 deraadt 3348: To eliminate the back-tracking, introduce a catch-all rule:
1.16 jmc 3349: .Bd -literal -offset indent
3350: %%
3351: asm |
3352: auto |
3353: break |
3354: \&... etc ...
3355: volatile |
3356: while /* it's a keyword */
3357:
3358: [a-z]+ |
3359: \&.|\en /* it's not a keyword */
3360: .Ed
3361: .Pp
1.1 deraadt 3362: Now, if it's guaranteed that there's exactly one word per line,
3363: then we can reduce the total number of matches by a half by
1.16 jmc 3364: merging in the recognition of newlines with that of the other tokens:
3365: .Bd -literal -offset indent
3366: %%
3367: asm\en |
3368: auto\en |
3369: break\en |
3370: \&... etc ...
3371: volatile\en |
3372: while\en /* it's a keyword */
3373:
3374: [a-z]+\en |
3375: \&.|\en /* it's not a keyword */
3376: .Ed
3377: .Pp
3378: One has to be careful here,
3379: as we have now reintroduced backing up into the scanner.
3380: In particular, while we know that there will never be any characters
3381: in the input stream other than letters or newlines,
3382: .Nm
1.1 deraadt 3383: can't figure this out, and it will plan for possibly needing to back up
1.16 jmc 3384: when it has scanned a token like
3385: .Qq auto
3386: and then the next character is something other than a newline or a letter.
3387: Previously it would then just match the
3388: .Qq auto
3389: rule and be done, but now it has no
3390: .Qq auto
3391: rule, only an
3392: .Qq auto\en
3393: rule.
3394: To eliminate the possibility of backing up,
1.40 jmc 3395: we could either duplicate all rules but without final newlines or,
1.1 deraadt 3396: since we never expect to encounter such an input and therefore don't
1.16 jmc 3397: how it's classified, we can introduce one more catch-all rule,
3398: this one which doesn't include a newline:
3399: .Bd -literal -offset indent
3400: %%
3401: asm\en |
3402: auto\en |
3403: break\en |
3404: \&... etc ...
3405: volatile\en |
3406: while\en /* it's a keyword */
3407:
3408: [a-z]+\en |
3409: [a-z]+ |
3410: \&.|\en /* it's not a keyword */
3411: .Ed
3412: .Pp
1.1 deraadt 3413: Compiled with
1.16 jmc 3414: .Fl Cf ,
1.1 deraadt 3415: this is about as fast as one can get a
1.16 jmc 3416: .Nm
1.1 deraadt 3417: scanner to go for this particular problem.
1.16 jmc 3418: .Pp
1.1 deraadt 3419: A final note:
1.16 jmc 3420: .Nm
3421: is slow when matching NUL's,
3422: particularly when a token contains multiple NUL's.
3423: It's best to write rules which match short
1.1 deraadt 3424: amounts of text if it's anticipated that the text will often include NUL's.
1.16 jmc 3425: .Pp
1.1 deraadt 3426: Another final note regarding performance: as mentioned above in the section
1.16 jmc 3427: .Sx HOW THE INPUT IS MATCHED ,
3428: dynamically resizing
3429: .Fa yytext
1.1 deraadt 3430: to accommodate huge tokens is a slow process because it presently requires that
1.16 jmc 3431: the
3432: .Pq huge
3433: token be rescanned from the beginning.
3434: Thus if performance is vital, it is better to attempt to match
3435: .Qq large
3436: quantities of text but not
3437: .Qq huge
3438: quantities, where the cutoff between the two is at about 8K characters/token.
3439: .Sh GENERATING C++ SCANNERS
3440: .Nm
3441: provides two different ways to generate scanners for use with C++.
3442: The first way is to simply compile a scanner generated by
3443: .Nm
3444: using a C++ compiler instead of a C compiler.
3445: This should not generate any compilation errors
3446: (please report any found to the email address given in the
3447: .Sx AUTHORS
3448: section below).
3449: C++ code can then be used in rule actions instead of C code.
3450: Note that the default input source for scanners remains
3451: .Fa yyin ,
1.1 deraadt 3452: and default echoing is still done to
1.16 jmc 3453: .Fa yyout .
1.1 deraadt 3454: Both of these remain
1.16 jmc 3455: .Fa FILE *
3456: variables and not C++ streams.
3457: .Pp
3458: .Nm
3459: can also be used to generate a C++ scanner class, using the
3460: .Fl +
1.1 deraadt 3461: option (or, equivalently,
1.16 jmc 3462: .Dq %option c++ ) ,
3463: which is automatically specified if the name of the flex executable ends in a
3464: .Sq + ,
3465: such as
3466: .Nm flex++ .
3467: When using this option,
3468: .Nm
3469: defaults to generating the scanner to the file
3470: .Pa lex.yy.cc
1.1 deraadt 3471: instead of
1.16 jmc 3472: .Pa lex.yy.c .
1.1 deraadt 3473: The generated scanner includes the header file
1.38 bentley 3474: .In g++/FlexLexer.h ,
1.1 deraadt 3475: which defines the interface to two C++ classes.
1.16 jmc 3476: .Pp
1.1 deraadt 3477: The first class,
1.16 jmc 3478: .Em FlexLexer ,
3479: provides an abstract base class defining the general scanner class interface.
3480: It provides the following member functions:
3481: .Bl -tag -width Ds
3482: .It const char* YYText()
3483: Returns the text of the most recently matched token, the equivalent of
3484: .Fa yytext .
3485: .It int YYLeng()
3486: Returns the length of the most recently matched token, the equivalent of
3487: .Fa yyleng .
3488: .It int lineno() const
3489: Returns the current input line number
1.1 deraadt 3490: (see
1.16 jmc 3491: .Dq %option yylineno ) ,
3492: or 1 if
3493: .Dq %option yylineno
1.1 deraadt 3494: was not used.
1.16 jmc 3495: .It void set_debug(int flag)
3496: Sets the debugging flag for the scanner, equivalent to assigning to
3497: .Fa yy_flex_debug
3498: (see the
3499: .Sx OPTIONS
3500: section above).
3501: Note that the scanner must be built using
3502: .Dq %option debug
1.1 deraadt 3503: to include debugging information in it.
1.16 jmc 3504: .It int debug() const
3505: Returns the current setting of the debugging flag.
3506: .El
3507: .Pp
1.1 deraadt 3508: Also provided are member functions equivalent to
1.16 jmc 3509: .Fn yy_switch_to_buffer ,
3510: .Fn yy_create_buffer
1.1 deraadt 3511: (though the first argument is an
1.18 espie 3512: .Fa std::istream*
1.1 deraadt 3513: object pointer and not a
1.16 jmc 3514: .Fa FILE* ) ,
3515: .Fn yy_flush_buffer ,
3516: .Fn yy_delete_buffer ,
1.1 deraadt 3517: and
1.16 jmc 3518: .Fn yyrestart
1.10 deraadt 3519: (again, the first argument is an
1.18 espie 3520: .Fa std::istream*
1.1 deraadt 3521: object pointer).
1.16 jmc 3522: .Pp
1.1 deraadt 3523: The second class defined in
1.38 bentley 3524: .In g++/FlexLexer.h
1.1 deraadt 3525: is
1.16 jmc 3526: .Fa yyFlexLexer ,
1.1 deraadt 3527: which is derived from
1.16 jmc 3528: .Fa FlexLexer .
1.1 deraadt 3529: It defines the following additional member functions:
1.16 jmc 3530: .Bl -tag -width Ds
1.18 espie 3531: .It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
1.16 jmc 3532: Constructs a
3533: .Fa yyFlexLexer
3534: object using the given streams for input and output.
3535: If not specified, the streams default to
3536: .Fa cin
1.1 deraadt 3537: and
1.16 jmc 3538: .Fa cout ,
1.1 deraadt 3539: respectively.
1.16 jmc 3540: .It virtual int yylex()
3541: Performs the same role as
3542: .Fn yylex
1.1 deraadt 3543: does for ordinary flex scanners: it scans the input stream, consuming
1.16 jmc 3544: tokens, until a rule's action returns a value.
3545: If subclass
3546: .Sq S
3547: is derived from
3548: .Fa yyFlexLexer ,
3549: in order to access the member functions and variables of
3550: .Sq S
1.1 deraadt 3551: inside
1.16 jmc 3552: .Fn yylex ,
3553: use
3554: .Dq %option yyclass="S"
1.1 deraadt 3555: to inform
1.16 jmc 3556: .Nm
3557: that the
3558: .Sq S
3559: subclass will be used instead of
3560: .Fa yyFlexLexer .
1.1 deraadt 3561: In this case, rather than generating
1.16 jmc 3562: .Dq yyFlexLexer::yylex() ,
3563: .Nm
1.1 deraadt 3564: generates
1.16 jmc 3565: .Dq S::yylex()
1.1 deraadt 3566: (and also generates a dummy
1.16 jmc 3567: .Dq yyFlexLexer::yylex()
1.1 deraadt 3568: that calls
1.16 jmc 3569: .Dq yyFlexLexer::LexerError()
1.1 deraadt 3570: if called).
1.18 espie 3571: .It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
1.16 jmc 3572: Reassigns
3573: .Fa yyin
1.1 deraadt 3574: to
1.16 jmc 3575: .Fa new_in
3576: .Pq if non-nil
1.1 deraadt 3577: and
1.16 jmc 3578: .Fa yyout
1.1 deraadt 3579: to
1.16 jmc 3580: .Fa new_out
3581: .Pq ditto ,
3582: deleting the previous input buffer if
3583: .Fa yyin
1.1 deraadt 3584: is reassigned.
1.18 espie 3585: .It int yylex(std::istream* new_in, std::ostream* new_out = 0)
1.16 jmc 3586: First switches the input streams via
3587: .Dq switch_streams(new_in, new_out)
1.1 deraadt 3588: and then returns the value of
1.16 jmc 3589: .Fn yylex .
3590: .El
3591: .Pp
1.1 deraadt 3592: In addition,
1.16 jmc 3593: .Fa yyFlexLexer
3594: defines the following protected virtual functions which can be redefined
1.1 deraadt 3595: in derived classes to tailor the scanner:
1.16 jmc 3596: .Bl -tag -width Ds
3597: .It virtual int LexerInput(char* buf, int max_size)
3598: Reads up to
3599: .Fa max_size
1.1 deraadt 3600: characters into
1.16 jmc 3601: .Fa buf
3602: and returns the number of characters read.
3603: To indicate end-of-input, return 0 characters.
3604: Note that
3605: .Qq interactive
3606: scanners (see the
3607: .Fl B
1.1 deraadt 3608: and
1.16 jmc 3609: .Fl I
1.1 deraadt 3610: flags) define the macro
1.16 jmc 3611: .Dv YY_INTERACTIVE .
3612: If
3613: .Fn LexerInput
3614: has been redefined, and it's necessary to take different actions depending on
3615: whether or not the scanner might be scanning an interactive input source,
3616: it's possible to test for the presence of this name via
3617: .Dq #ifdef .
3618: .It virtual void LexerOutput(const char* buf, int size)
3619: Writes out
3620: .Fa size
1.1 deraadt 3621: characters from the buffer
1.16 jmc 3622: .Fa buf ,
3623: which, while NUL-terminated, may also contain
3624: .Qq internal
3625: NUL's if the scanner's rules can match text with NUL's in them.
3626: .It virtual void LexerError(const char* msg)
3627: Reports a fatal error message.
3628: The default version of this function writes the message to the stream
3629: .Fa cerr
1.1 deraadt 3630: and exits.
1.16 jmc 3631: .El
3632: .Pp
1.1 deraadt 3633: Note that a
1.16 jmc 3634: .Fa yyFlexLexer
3635: object contains its entire scanning state.
3636: Thus such objects can be used to create reentrant scanners.
3637: Multiple instances of the same
3638: .Fa yyFlexLexer
3639: class can be instantiated, and multiple C++ scanner classes can be combined
1.1 deraadt 3640: in the same program using the
1.16 jmc 3641: .Fl P
1.1 deraadt 3642: option discussed above.
1.16 jmc 3643: .Pp
1.1 deraadt 3644: Finally, note that the
1.16 jmc 3645: .Dq %array
3646: feature is not available to C++ scanner classes;
3647: .Dq %pointer
3648: must be used
3649: .Pq the default .
3650: .Pp
1.1 deraadt 3651: Here is an example of a simple C++ scanner:
1.16 jmc 3652: .Bd -literal -offset indent
3653: // An example of using the flex C++ scanner class.
1.1 deraadt 3654:
1.16 jmc 3655: %{
3656: #include <errno.h>
3657: int mylineno = 0;
3658: %}
1.1 deraadt 3659:
1.16 jmc 3660: string \e"[^\en"]+\e"
1.1 deraadt 3661:
1.16 jmc 3662: ws [ \et]+
1.1 deraadt 3663:
1.16 jmc 3664: alpha [A-Za-z]
3665: dig [0-9]
3666: name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3667: num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3668: num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3669: number {num1}|{num2}
1.1 deraadt 3670:
1.16 jmc 3671: %%
1.1 deraadt 3672:
1.16 jmc 3673: {ws} /* skip blanks and tabs */
1.1 deraadt 3674:
1.16 jmc 3675: "/*" {
3676: int c;
1.1 deraadt 3677:
1.16 jmc 3678: while ((c = yyinput()) != 0) {
3679: if(c == '\en')
1.1 deraadt 3680: ++mylineno;
1.16 jmc 3681: else if(c == '*') {
3682: if ((c = yyinput()) == '/')
1.1 deraadt 3683: break;
3684: else
3685: unput(c);
3686: }
1.16 jmc 3687: }
3688: }
1.1 deraadt 3689:
1.16 jmc 3690: {number} cout << "number " << YYText() << '\en';
1.1 deraadt 3691:
1.16 jmc 3692: \en mylineno++;
1.1 deraadt 3693:
1.16 jmc 3694: {name} cout << "name " << YYText() << '\en';
1.1 deraadt 3695:
1.16 jmc 3696: {string} cout << "string " << YYText() << '\en';
3697:
3698: %%
3699:
3700: int main(int /* argc */, char** /* argv */)
3701: {
3702: FlexLexer* lexer = new yyFlexLexer;
3703: while(lexer->yylex() != 0)
3704: ;
3705: return 0;
3706: }
3707: .Ed
3708: .Pp
3709: To create multiple
3710: .Pq different
3711: lexer classes, use the
3712: .Fl P
3713: flag
3714: (or the
3715: .Dq prefix=
3716: option)
3717: to rename each
3718: .Fa yyFlexLexer
1.1 deraadt 3719: to some other
1.16 jmc 3720: .Fa xxFlexLexer .
1.38 bentley 3721: .In g++/FlexLexer.h
1.16 jmc 3722: can then be included in other sources once per lexer class, first renaming
3723: .Fa yyFlexLexer
1.1 deraadt 3724: as follows:
1.16 jmc 3725: .Bd -literal -offset indent
3726: #undef yyFlexLexer
3727: #define yyFlexLexer xxFlexLexer
3728: #include <g++/FlexLexer.h>
3729:
3730: #undef yyFlexLexer
3731: #define yyFlexLexer zzFlexLexer
3732: #include <g++/FlexLexer.h>
3733: .Ed
3734: .Pp
3735: If, for example,
3736: .Dq %option prefix="xx"
3737: is used for one scanner and
3738: .Dq %option prefix="zz"
3739: is used for the other.
3740: .Pp
3741: .Sy IMPORTANT :
3742: the present form of the scanning class is experimental
1.7 aaron 3743: and may change considerably between major releases.
1.16 jmc 3744: .Sh INCOMPATIBILITIES WITH LEX AND POSIX
3745: .Nm
1.25 sobrado 3746: is a rewrite of the
3747: .At
1.16 jmc 3748: .Nm lex
3749: tool
3750: (the two implementations do not share any code, though),
3751: with some extensions and incompatibilities, both of which are of concern
3752: to those who wish to write scanners acceptable to either implementation.
3753: .Nm
3754: is fully compliant with the
3755: .Tn POSIX
3756: .Nm lex
1.1 deraadt 3757: specification, except that when using
1.16 jmc 3758: .Dq %pointer
3759: .Pq the default ,
3760: a call to
3761: .Fn unput
1.1 deraadt 3762: destroys the contents of
1.16 jmc 3763: .Fa yytext ,
3764: which is counter to the
3765: .Tn POSIX
3766: specification.
3767: .Pp
3768: In this section we discuss all of the known areas of incompatibility between
3769: .Nm ,
1.36 schwarze 3770: .At
1.16 jmc 3771: .Nm lex ,
3772: and the
3773: .Tn POSIX
3774: specification.
3775: .Pp
3776: .Nm flex Ns 's
3777: .Fl l
1.36 schwarze 3778: option turns on maximum compatibility with the original
3779: .At
1.16 jmc 3780: .Nm lex
1.1 deraadt 3781: implementation, at the cost of a major loss in the generated scanner's
1.16 jmc 3782: performance.
3783: We note below which incompatibilities can be overcome using the
3784: .Fl l
1.1 deraadt 3785: option.
1.16 jmc 3786: .Pp
3787: .Nm
1.1 deraadt 3788: is fully compatible with
1.16 jmc 3789: .Nm lex
1.1 deraadt 3790: with the following exceptions:
1.16 jmc 3791: .Bl -dash
3792: .It
1.1 deraadt 3793: The undocumented
1.16 jmc 3794: .Nm lex
1.1 deraadt 3795: scanner internal variable
1.16 jmc 3796: .Fa yylineno
1.1 deraadt 3797: is not supported unless
1.16 jmc 3798: .Fl l
1.1 deraadt 3799: or
1.16 jmc 3800: .Dq %option yylineno
1.1 deraadt 3801: is used.
1.16 jmc 3802: .Pp
3803: .Fa yylineno
1.1 deraadt 3804: should be maintained on a per-buffer basis, rather than a per-scanner
1.16 jmc 3805: .Pq single global variable
3806: basis.
3807: .Pp
3808: .Fa yylineno
3809: is not part of the
3810: .Tn POSIX
3811: specification.
3812: .It
1.1 deraadt 3813: The
1.16 jmc 3814: .Fn input
1.1 deraadt 3815: routine is not redefinable, though it may be called to read characters
1.16 jmc 3816: following whatever has been matched by a rule.
3817: If
3818: .Fn input
3819: encounters an end-of-file, the normal
3820: .Fn yywrap
3821: processing is done.
3822: A
3823: .Dq real
3824: end-of-file is returned by
3825: .Fn input
1.1 deraadt 3826: as
1.16 jmc 3827: .Dv EOF .
3828: .Pp
1.1 deraadt 3829: Input is instead controlled by defining the
1.16 jmc 3830: .Dv YY_INPUT
1.1 deraadt 3831: macro.
1.16 jmc 3832: .Pp
1.1 deraadt 3833: The
1.16 jmc 3834: .Nm
1.1 deraadt 3835: restriction that
1.16 jmc 3836: .Fn input
3837: cannot be redefined is in accordance with the
3838: .Tn POSIX
3839: specification, which simply does not specify any way of controlling the
1.1 deraadt 3840: scanner's input other than by making an initial assignment to
1.16 jmc 3841: .Fa yyin .
3842: .It
1.1 deraadt 3843: The
1.16 jmc 3844: .Fn unput
3845: routine is not redefinable.
3846: This restriction is in accordance with
3847: .Tn POSIX .
3848: .It
3849: .Nm
1.1 deraadt 3850: scanners are not as reentrant as
1.16 jmc 3851: .Nm lex
3852: scanners.
3853: In particular, if a scanner is interactive and
3854: an interrupt handler long-jumps out of the scanner,
3855: and the scanner is subsequently called again,
3856: the following error message may be displayed:
3857: .Pp
3858: .D1 fatal flex scanner internal error--end of buffer missed
3859: .Pp
1.1 deraadt 3860: To reenter the scanner, first use
1.16 jmc 3861: .Pp
3862: .Dl yyrestart(yyin);
3863: .Pp
3864: Note that this call will throw away any buffered input;
3865: usually this isn't a problem with an interactive scanner.
3866: .Pp
3867: Also note that flex C++ scanner classes are reentrant,
3868: so if using C++ is an option , they should be used instead.
3869: See
3870: .Sx GENERATING C++ SCANNERS
3871: above for details.
3872: .It
3873: .Fn output
1.1 deraadt 3874: is not supported.
3875: Output from the
1.16 jmc 3876: .Em ECHO
1.1 deraadt 3877: macro is done to the file-pointer
1.16 jmc 3878: .Fa yyout
3879: .Pq default stdout .
3880: .Pp
3881: .Fn output
3882: is not part of the
3883: .Tn POSIX
3884: specification.
3885: .It
3886: .Nm lex
3887: does not support exclusive start conditions
3888: .Pq %x ,
3889: though they are in the
3890: .Tn POSIX
3891: specification.
3892: .It
1.1 deraadt 3893: When definitions are expanded,
1.16 jmc 3894: .Nm
1.1 deraadt 3895: encloses them in parentheses.
1.16 jmc 3896: With
3897: .Nm lex ,
3898: the following:
3899: .Bd -literal -offset indent
3900: NAME [A-Z][A-Z0-9]*
3901: %%
3902: foo{NAME}? printf("Found it\en");
3903: %%
3904: .Ed
3905: .Pp
3906: will not match the string
3907: .Qq foo
3908: because when the macro is expanded the rule is equivalent to
3909: .Qq foo[A-Z][A-Z0-9]*?
3910: and the precedence is such that the
3911: .Sq ?\&
3912: is associated with
3913: .Qq [A-Z0-9]* .
3914: With
3915: .Nm ,
1.1 deraadt 3916: the rule will be expanded to
1.16 jmc 3917: .Qq foo([A-Z][A-Z0-9]*)?
3918: and so the string
3919: .Qq foo
3920: will match.
3921: .Pp
1.1 deraadt 3922: Note that if the definition begins with
1.16 jmc 3923: .Sq ^
1.1 deraadt 3924: or ends with
1.16 jmc 3925: .Sq $
3926: then it is not expanded with parentheses, to allow these operators to appear in
3927: definitions without losing their special meanings.
3928: But the
3929: .Sq Aq s ,
3930: .Sq / ,
1.1 deraadt 3931: and
1.16 jmc 3932: .Aq Aq EOF
1.1 deraadt 3933: operators cannot be used in a
1.16 jmc 3934: .Nm
1.1 deraadt 3935: definition.
1.16 jmc 3936: .Pp
1.1 deraadt 3937: Using
1.16 jmc 3938: .Fl l
1.1 deraadt 3939: results in the
1.16 jmc 3940: .Nm lex
1.1 deraadt 3941: behavior of no parentheses around the definition.
1.16 jmc 3942: .Pp
3943: The
3944: .Tn POSIX
3945: specification is that the definition be enclosed in parentheses.
3946: .It
1.1 deraadt 3947: Some implementations of
1.16 jmc 3948: .Nm lex
3949: allow a rule's action to begin on a separate line,
3950: if the rule's pattern has trailing whitespace:
3951: .Bd -literal -offset indent
3952: %%
3953: foo|bar<space here>
3954: { foobar_action(); }
3955: .Ed
3956: .Pp
3957: .Nm
1.1 deraadt 3958: does not support this feature.
1.16 jmc 3959: .It
1.1 deraadt 3960: The
1.16 jmc 3961: .Nm lex
3962: .Sq %r
3963: .Pq generate a Ratfor scanner
3964: option is not supported.
3965: It is not part of the
3966: .Tn POSIX
3967: specification.
3968: .It
1.1 deraadt 3969: After a call to
1.16 jmc 3970: .Fn unput ,
3971: .Fa yytext
3972: is undefined until the next token is matched,
3973: unless the scanner was built using
3974: .Dq %array .
1.1 deraadt 3975: This is not the case with
1.16 jmc 3976: .Nm lex
3977: or the
3978: .Tn POSIX
3979: specification.
3980: The
3981: .Fl l
1.1 deraadt 3982: option does away with this incompatibility.
1.16 jmc 3983: .It
1.1 deraadt 3984: The precedence of the
1.16 jmc 3985: .Sq {}
3986: .Pq numeric range
3987: operator is different.
3988: .Nm lex
3989: interprets
3990: .Qq abc{1,3}
3991: as match one, two, or three occurrences of
3992: .Sq abc ,
3993: whereas
3994: .Nm
3995: interprets it as match
3996: .Sq ab
3997: followed by one, two, or three occurrences of
3998: .Sq c .
3999: The latter is in agreement with the
4000: .Tn POSIX
4001: specification.
4002: .It
1.1 deraadt 4003: The precedence of the
1.16 jmc 4004: .Sq ^
1.1 deraadt 4005: operator is different.
1.16 jmc 4006: .Nm lex
4007: interprets
4008: .Qq ^foo|bar
4009: as match either
4010: .Sq foo
4011: at the beginning of a line, or
4012: .Sq bar
4013: anywhere, whereas
4014: .Nm
4015: interprets it as match either
4016: .Sq foo
4017: or
4018: .Sq bar
4019: if they come at the beginning of a line.
4020: The latter is in agreement with the
4021: .Tn POSIX
4022: specification.
4023: .It
1.1 deraadt 4024: The special table-size declarations such as
1.16 jmc 4025: .Sq %a
1.1 deraadt 4026: supported by
1.16 jmc 4027: .Nm lex
1.1 deraadt 4028: are not required by
1.16 jmc 4029: .Nm
1.1 deraadt 4030: scanners;
1.16 jmc 4031: .Nm
1.1 deraadt 4032: ignores them.
1.16 jmc 4033: .It
1.1 deraadt 4034: The name
1.16 jmc 4035: .Dv FLEX_SCANNER
1.1 deraadt 4036: is #define'd so scanners may be written for use with either
1.16 jmc 4037: .Nm
1.1 deraadt 4038: or
1.16 jmc 4039: .Nm lex .
1.1 deraadt 4040: Scanners also include
1.16 jmc 4041: .Dv YY_FLEX_MAJOR_VERSION
1.1 deraadt 4042: and
1.16 jmc 4043: .Dv YY_FLEX_MINOR_VERSION
1.1 deraadt 4044: indicating which version of
1.16 jmc 4045: .Nm
1.1 deraadt 4046: generated the scanner
1.16 jmc 4047: (for example, for the 2.5 release, these defines would be 2 and 5,
1.1 deraadt 4048: respectively).
1.16 jmc 4049: .El
4050: .Pp
1.1 deraadt 4051: The following
1.16 jmc 4052: .Nm
1.1 deraadt 4053: features are not included in
1.16 jmc 4054: .Nm lex
4055: or the
4056: .Tn POSIX
4057: specification:
4058: .Bd -unfilled -offset indent
4059: C++ scanners
4060: %option
4061: start condition scopes
4062: start condition stacks
4063: interactive/non-interactive scanners
4064: yy_scan_string() and friends
4065: yyterminate()
4066: yy_set_interactive()
4067: yy_set_bol()
4068: YY_AT_BOL()
4069: <<EOF>>
4070: <*>
4071: YY_DECL
4072: YY_START
4073: YY_USER_ACTION
4074: YY_USER_INIT
4075: #line directives
4076: %{}'s around actions
4077: multiple actions on a line
4078: .Ed
4079: .Pp
4080: plus almost all of the
4081: .Nm
4082: flags.
1.1 deraadt 4083: The last feature in the list refers to the fact that with
1.16 jmc 4084: .Nm
1.37 jmc 4085: multiple actions can be placed on the same line,
1.16 jmc 4086: separated with semi-colons, while with
4087: .Nm lex ,
1.1 deraadt 4088: the following
1.16 jmc 4089: .Pp
4090: .Dl foo handle_foo(); ++num_foos_seen;
4091: .Pp
4092: is
4093: .Pq rather surprisingly
4094: truncated to
4095: .Pp
4096: .Dl foo handle_foo();
4097: .Pp
4098: .Nm
4099: does not truncate the action.
4100: Actions that are not enclosed in braces
4101: are simply terminated at the end of the line.
4102: .Sh FILES
4103: .Bl -tag -width "<g++/FlexLexer.h>"
1.41 sobrado 4104: .It Pa flex.skl
1.16 jmc 4105: Skeleton scanner.
4106: This file is only used when building flex, not when
4107: .Nm
4108: executes.
1.41 sobrado 4109: .It Pa lex.backup
1.16 jmc 4110: Backing-up information for the
4111: .Fl b
4112: flag (called
4113: .Pa lex.bck
4114: on some systems).
1.41 sobrado 4115: .It Pa lex.yy.c
1.16 jmc 4116: Generated scanner
4117: (called
4118: .Pa lexyy.c
4119: on some systems).
1.41 sobrado 4120: .It Pa lex.yy.cc
1.16 jmc 4121: Generated C++ scanner class, when using
4122: .Fl + .
1.38 bentley 4123: .It In g++/FlexLexer.h
1.16 jmc 4124: Header file defining the C++ scanner base class,
4125: .Fa FlexLexer ,
4126: and its derived class,
4127: .Fa yyFlexLexer .
1.41 sobrado 4128: .It Pa /usr/lib/libl.*
1.16 jmc 4129: .Nm
4130: libraries.
4131: The
4132: .Pa /usr/lib/libfl.*\&
4133: libraries are links to these.
4134: Scanners must be linked using either
4135: .Fl \&ll
4136: or
4137: .Fl lfl .
4138: .El
1.29 jmc 4139: .Sh EXIT STATUS
4140: .Ex -std flex
1.16 jmc 4141: .Sh DIAGNOSTICS
4142: .Bl -diag
4143: .It warning, rule cannot be matched
4144: Indicates that the given rule cannot be matched because it follows other rules
4145: that will always match the same text as it.
4146: For example, in the following
4147: .Dq foo
4148: cannot be matched because it comes after an identifier
4149: .Qq catch-all
4150: rule:
4151: .Bd -literal -offset indent
4152: [a-z]+ got_identifier();
4153: foo got_foo();
4154: .Ed
4155: .Pp
1.1 deraadt 4156: Using
1.16 jmc 4157: .Em REJECT
1.1 deraadt 4158: in a scanner suppresses this warning.
1.16 jmc 4159: .It "warning, \-s option given but default rule can be matched"
4160: Means that it is possible
4161: .Pq perhaps only in a particular start condition
4162: that the default rule
4163: .Pq match any single character
4164: is the only one that will match a particular input.
4165: Since
4166: .Fl s
1.1 deraadt 4167: was given, presumably this is not intended.
1.16 jmc 4168: .It reject_used_but_not_detected undefined
4169: .It yymore_used_but_not_detected undefined
4170: These errors can occur at compile time.
4171: They indicate that the scanner uses
4172: .Em REJECT
1.1 deraadt 4173: or
1.16 jmc 4174: .Fn yymore
1.1 deraadt 4175: but that
1.16 jmc 4176: .Nm
1.1 deraadt 4177: failed to notice the fact, meaning that
1.16 jmc 4178: .Nm
1.1 deraadt 4179: scanned the first two sections looking for occurrences of these actions
1.16 jmc 4180: and failed to find any, but somehow they snuck in
4181: .Pq via an #include file, for example .
4182: Use
4183: .Dq %option reject
4184: or
4185: .Dq %option yymore
4186: to indicate to
4187: .Nm
4188: that these features are really needed.
4189: .It flex scanner jammed
4190: A scanner compiled with
4191: .Fl s
4192: has encountered an input string which wasn't matched by any of its rules.
4193: This error can also occur due to internal problems.
4194: .It token too large, exceeds YYLMAX
4195: The scanner uses
4196: .Dq %array
1.1 deraadt 4197: and one of its rules matched a string longer than the
1.16 jmc 4198: .Dv YYLMAX
4199: constant
4200: .Pq 8K bytes by default .
4201: The value can be increased by #define'ing
4202: .Dv YYLMAX
4203: in the definitions section of
4204: .Nm
1.1 deraadt 4205: input.
1.16 jmc 4206: .It "scanner requires \-8 flag to use the character 'x'"
4207: The scanner specification includes recognizing the 8-bit character
4208: .Sq x
4209: and the
4210: .Fl 8
4211: flag was not specified, and defaulted to 7-bit because the
4212: .Fl Cf
4213: or
4214: .Fl CF
4215: table compression options were used.
4216: See the discussion of the
4217: .Fl 7
1.1 deraadt 4218: flag for details.
1.16 jmc 4219: .It flex scanner push-back overflow
4220: unput() was used to push back so much text that the scanner's buffer
4221: could not hold both the pushed-back text and the current token in
4222: .Fa yytext .
4223: Ideally the scanner should dynamically resize the buffer in this case,
4224: but at present it does not.
4225: .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4226: The scanner was working on matching an extremely large token and needed
4227: to expand the input buffer.
4228: This doesn't work with scanners that use
4229: .Em REJECT .
4230: .It "fatal flex scanner internal error--end of buffer missed"
1.1 deraadt 4231: This can occur in an scanner which is reentered after a long-jump
1.16 jmc 4232: has jumped out
4233: .Pq or over
4234: the scanner's activation frame.
4235: Before reentering the scanner, use:
4236: .Pp
4237: .Dl yyrestart(yyin);
4238: .Pp
1.1 deraadt 4239: or, as noted above, switch to using the C++ scanner class.
1.16 jmc 4240: .It "too many start conditions in <> construct!"
4241: More start conditions than exist were listed in a <> construct
4242: (so at least one of them must have been listed twice).
4243: .El
4244: .Sh SEE ALSO
4245: .Xr awk 1 ,
4246: .Xr sed 1 ,
4247: .Xr yacc 1
4248: .Rs
4249: .%A John Levine
4250: .%A Tony Mason
4251: .%A Doug Brown
4252: .%B Lex & Yacc
4253: .%I O'Reilly and Associates
4254: .%N 2nd edition
4255: .Re
4256: .Rs
4257: .%A Alfred Aho
4258: .%A Ravi Sethi
4259: .%A Jeffrey Ullman
4260: .%B Compilers: Principles, Techniques and Tools
4261: .%I Addison-Wesley
4262: .%D 1986
4263: .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4264: .Re
1.23 jmc 4265: .Sh STANDARDS
4266: The
4267: .Nm lex
4268: utility is compliant with the
4269: .St -p1003.1-2008
4270: specification,
4271: though its presence is optional.
4272: .Pp
4273: The flags
1.31 jmc 4274: .Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
1.23 jmc 4275: .Op Fl -help ,
4276: and
4277: .Op Fl -version
4278: are extensions to that specification.
1.37 jmc 4279: .Pp
4280: See also the
4281: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
4282: section, above.
1.16 jmc 4283: .Sh AUTHORS
1.1 deraadt 4284: Vern Paxson, with the help of many ideas and much inspiration from
1.16 jmc 4285: Van Jacobson.
4286: Original version by Jef Poskanzer.
4287: The fast table representation is a partial implementation of a design done by
4288: Van Jacobson.
4289: The implementation was done by Kevin Gong and Vern Paxson.
4290: .Pp
1.1 deraadt 4291: Thanks to the many
1.16 jmc 4292: .Nm
1.1 deraadt 4293: beta-testers, feedbackers, and contributors, especially Francois Pinard,
4294: Casey Leedom,
4295: Robert Abramovitz,
4296: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
1.39 bentley 4297: Neal Becker, Nelson H.F. Beebe,
4298: .Mt benson@odi.com ,
1.1 deraadt 4299: Karl Berry, Peter A. Bigot, Simon Blanchard,
4300: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4301: Brian Clapper, J.T. Conklin,
4302: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11 deraadt 4303: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1 deraadt 4304: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4305: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4306: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4307: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4308: Jan Hajic, Charles Hemphill, NORO Hideo,
4309: Jarkko Hietaniemi, Scott Hofmann,
4310: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4311: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4312: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
1.39 bentley 4313: Amir Katz,
4314: .Mt ken@ken.hilco.com ,
4315: Kevin B. Kenny,
1.1 deraadt 4316: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4317: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4318: David Loffredo, Mike Long,
4319: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4320: Bengt Martensson, Chris Metcalf,
4321: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4322: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4323: Richard Ohnemus, Karsten Pahnke,
1.16 jmc 4324: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4325: Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
1.1 deraadt 4326: Frederic Raimbault, Pat Rankin, Rick Richardson,
4327: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4328: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4329: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4330: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4331: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4332: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
1.16 jmc 4333: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4334: Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4335: and those whose names have slipped my marginal mail-archiving skills
4336: but whose contributions are appreciated all the
1.1 deraadt 4337: same.
1.16 jmc 4338: .Pp
1.1 deraadt 4339: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4340: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4341: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4342: distribution headaches.
1.16 jmc 4343: .Pp
4344: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4345: to Benson Margulies and Fred Burke for C++ support;
4346: to Kent Williams and Tom Epperly for C++ class support;
4347: to Ove Ewerlid for support of NUL's;
4348: and to Eric Hughes for support of multiple buffers.
4349: .Pp
1.1 deraadt 4350: This work was primarily done when I was with the Real Time Systems Group
1.16 jmc 4351: at the Lawrence Berkeley Laboratory in Berkeley, CA.
4352: Many thanks to all there for the support I received.
4353: .Pp
4354: Send comments to
1.34 schwarze 4355: .Aq Mt vern@ee.lbl.gov .
1.16 jmc 4356: .Sh BUGS
4357: Some trailing context patterns cannot be properly matched and generate
4358: warning messages
4359: .Pq "dangerous trailing context" .
4360: These are patterns where the ending of the first part of the rule
4361: matches the beginning of the second part, such as
4362: .Qq zx*/xy* ,
4363: where the
4364: .Sq x*
4365: matches the
4366: .Sq x
4367: at the beginning of the trailing context.
4368: (Note that the POSIX draft states that the text matched by such patterns
4369: is undefined.)
4370: .Pp
4371: For some trailing context rules, parts which are actually fixed-length are
4372: not recognized as such, leading to the above mentioned performance loss.
4373: In particular, parts using
4374: .Sq |\&
4375: or
4376: .Sq {n}
4377: (such as
4378: .Qq foo{3} )
4379: are always considered variable-length.
4380: .Pp
4381: Combining trailing context with the special
4382: .Sq |\&
4383: action can result in fixed trailing context being turned into
4384: the more expensive variable trailing context.
4385: For example, in the following:
4386: .Bd -literal -offset indent
4387: %%
4388: abc |
4389: xyz/def
4390: .Ed
4391: .Pp
4392: Use of
4393: .Fn unput
4394: invalidates yytext and yyleng, unless the
4395: .Dq %array
4396: directive
4397: or the
4398: .Fl l
4399: option has been used.
4400: .Pp
4401: Pattern-matching of NUL's is substantially slower than matching other
4402: characters.
4403: .Pp
4404: Dynamic resizing of the input buffer is slow, as it entails rescanning
4405: all the text matched so far by the current
4406: .Pq generally huge
4407: token.
4408: .Pp
4409: Due to both buffering of input and read-ahead,
4410: it is not possible to intermix calls to
1.38 bentley 4411: .In stdio.h
1.16 jmc 4412: routines, such as, for example,
4413: .Fn getchar ,
4414: with
4415: .Nm
4416: rules and expect it to work.
4417: Call
4418: .Fn input
4419: instead.
4420: .Pp
4421: The total table entries listed by the
4422: .Fl v
4423: flag excludes the number of table entries needed to determine
4424: what rule has been matched.
4425: The number of entries is equal to the number of DFA states
4426: if the scanner does not use
4427: .Em REJECT ,
4428: and somewhat greater than the number of states if it does.
4429: .Pp
4430: .Em REJECT
4431: cannot be used with the
4432: .Fl f
4433: or
4434: .Fl F
4435: options.
4436: .Pp
4437: The
4438: .Nm
4439: internal algorithms need documentation.