Annotation of src/usr.bin/lex/flex.1, Revision 1.20
1.20 ! pvalchev 1: .\" $OpenBSD: flex.1,v 1.19 2004/04/19 18:29:17 jmc Exp $
1.16 jmc 2: .\"
1.12 jmc 3: .\" Copyright (c) 1990 The Regents of the University of California.
4: .\" All rights reserved.
1.2 deraadt 5: .\"
1.12 jmc 6: .\" This code is derived from software contributed to Berkeley by
7: .\" Vern Paxson.
8: .\"
9: .\" The United States Government has rights in this work pursuant
10: .\" to contract no. DE-AC03-76SF00098 between the United States
11: .\" Department of Energy and the University of California.
12: .\"
13: .\" Redistribution and use in source and binary forms, with or without
1.13 millert 14: .\" modification, are permitted provided that the following conditions
15: .\" are met:
16: .\"
17: .\" 1. Redistributions of source code must retain the above copyright
18: .\" notice, this list of conditions and the following disclaimer.
19: .\" 2. Redistributions in binary form must reproduce the above copyright
20: .\" notice, this list of conditions and the following disclaimer in the
21: .\" documentation and/or other materials provided with the distribution.
22: .\"
23: .\" Neither the name of the University nor the names of its contributors
24: .\" may be used to endorse or promote products derived from this software
25: .\" without specific prior written permission.
26: .\"
27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30: .\" PURPOSE.
1.16 jmc 31: .\"
32: .Dd April 1, 1995
33: .Dt FLEX 1
34: .Os
35: .Sh NAME
36: .Nm flex
37: .Nd fast lexical analyzer generator
38: .Sh SYNOPSIS
39: .Nm
40: .Op Fl 78BbcdFfhIiLlnpsTtVvw+?
41: .Op Fl C Ns Op Cm aeFfmr
42: .Op Fl Fl help
43: .Op Fl Fl version
44: .Sm off
45: .Op Fl o Ar output
46: .Op Fl P Ar prefix
47: .Op Fl S Ar skeleton
48: .Op Ar filename ...
49: .Sm on
50: .Sh OVERVIEW
1.1 deraadt 51: This manual describes
1.16 jmc 52: .Nm ,
53: a tool for generating programs that perform pattern-matching on text.
54: The manual includes both tutorial and reference sections:
55: .Bl -ohang
56: .It Sy Description
57: A brief overview of the tool.
58: .It Sy Some Simple Examples
59: .It Sy Format of the Input File
60: .It Sy Patterns
61: The extended regular expressions used by
62: .Nm .
63: .It Sy How the Input is Matched
64: The rules for determining what has been matched.
65: .It Sy Actions
66: How to specify what to do when a pattern is matched.
67: .It Sy The Generated Scanner
68: Details regarding the scanner that
69: .Nm
70: produces;
71: how to control the input source.
72: .It Sy Start Conditions
73: Introducing context into scanners, and managing
74: .Qq mini-scanners .
75: .It Sy Multiple Input Buffers
76: How to manipulate multiple input sources;
77: how to scan from strings instead of files.
78: .It Sy End-of-File Rules
79: Special rules for matching the end of the input.
80: .It Sy Miscellaneous Macros
81: A summary of macros available to the actions.
82: .It Sy Values Available to the User
83: A summary of values available to the actions.
84: .It Sy Interfacing with Yacc
85: Connecting flex scanners together with
86: .Xr yacc 1
87: parsers.
88: .It Sy Options
89: .Nm
90: command-line options, and the
91: .Dq %option
92: directive.
93: .It Sy Performance Considerations
94: How to make scanners go as fast as possible.
95: .It Sy Generating C++ Scanners
96: The
97: .Pq experimental
98: facility for generating C++ scanner classes.
99: .It Sy Incompatibilities with Lex and POSIX
100: How
101: .Nm
102: differs from AT&T lex and the
103: .Tn POSIX
104: lex standard.
105: .It Sy Files
106: Files used by
107: .Nm .
108: .It Sy Diagnostics
109: Those error messages produced by
110: .Nm
111: .Pq or scanners it generates
112: whose meanings might not be apparent.
113: .It Sy See Also
114: Other documentation, related tools.
115: .It Sy Authors
116: Includes contact information.
117: .It Sy Bugs
118: Known problems with
119: .Nm .
120: .El
121: .Sh DESCRIPTION
122: .Nm
1.1 deraadt 123: is a tool for generating
1.16 jmc 124: .Em scanners :
1.9 millert 125: programs which recognize lexical patterns in text.
1.16 jmc 126: .Nm
127: reads the given input files, or its standard input if no file names are given,
128: for a description of a scanner to generate.
129: The description is in the form of pairs of regular expressions and C code,
130: called
131: .Em rules .
132: .Nm
1.1 deraadt 133: generates as output a C source file,
1.16 jmc 134: .Pa lex.yy.c ,
1.1 deraadt 135: which defines a routine
1.16 jmc 136: .Fn yylex .
1.1 deraadt 137: This file is compiled and linked with the
1.16 jmc 138: .Fl lfl
139: library to produce an executable.
140: When the executable is run, it analyzes its input for occurrences
141: of the regular expressions.
142: Whenever it finds one, it executes the corresponding C code.
143: .Sh SOME SIMPLE EXAMPLES
1.1 deraadt 144: First some simple examples to get the flavor of how one uses
1.16 jmc 145: .Nm .
1.1 deraadt 146: The following
1.16 jmc 147: .Nm
1.1 deraadt 148: input specifies a scanner which whenever it encounters the string
1.16 jmc 149: .Qq username
150: will replace it with the user's login name:
151: .Bd -literal -offset indent
152: %%
153: username printf("%s", getlogin());
154: .Ed
155: .Pp
1.1 deraadt 156: By default, any text not matched by a
1.16 jmc 157: .Nm
158: scanner is copied to the output, so the net effect of this scanner is
159: to copy its input file to its output with each occurrence of
160: .Qq username
161: expanded.
162: In this input, there is just one rule.
163: .Qq username
164: is the
165: .Em pattern
166: and the
167: .Qq printf
168: is the
169: .Em action .
170: The
171: .Qq %%
172: marks the beginning of the rules.
173: .Pp
1.1 deraadt 174: Here's another simple example:
1.16 jmc 175: .Bd -literal -offset indent
1.20 ! pvalchev 176: %{
1.16 jmc 177: int num_lines = 0, num_chars = 0;
1.20 ! pvalchev 178: %}
1.1 deraadt 179:
1.16 jmc 180: %%
181: \en ++num_lines; ++num_chars;
182: \&. ++num_chars;
183:
184: %%
185: main()
186: {
187: yylex();
188: printf("# of lines = %d, # of chars = %d\en",
189: num_lines, num_chars);
190: }
191: .Ed
192: .Pp
1.1 deraadt 193: This scanner counts the number of characters and the number
1.16 jmc 194: of lines in its input
195: (it produces no output other than the final report on the counts).
196: The first line declares two globals,
197: .Qq num_lines
198: and
199: .Qq num_chars ,
200: which are accessible both inside
201: .Fn yylex
1.1 deraadt 202: and in the
1.16 jmc 203: .Fn main
204: routine declared after the second
205: .Qq %% .
206: There are two rules, one which matches a newline
207: .Pq \&"\en\&"
208: and increments both the line count and the character count,
209: and one which matches any character other than a newline
210: (indicated by the
211: .Qq \&.
212: regular expression).
213: .Pp
1.1 deraadt 214: A somewhat more complicated example:
1.16 jmc 215: .Bd -literal -offset indent
216: /* scanner for a toy Pascal-like language */
1.1 deraadt 217:
1.16 jmc 218: %{
219: /* need this for the call to atof() below */
220: #include <math.h>
221: %}
1.1 deraadt 222:
1.16 jmc 223: DIGIT [0-9]
224: ID [a-z][a-z0-9]*
1.1 deraadt 225:
1.16 jmc 226: %%
1.1 deraadt 227:
1.16 jmc 228: {DIGIT}+ {
229: printf("An integer: %s (%d)\en", yytext,
230: atoi(yytext));
231: }
1.1 deraadt 232:
1.16 jmc 233: {DIGIT}+"."{DIGIT}* {
234: printf("A float: %s (%g)\en", yytext,
235: atof(yytext));
236: }
1.1 deraadt 237:
1.16 jmc 238: if|then|begin|end|procedure|function {
239: printf("A keyword: %s\en", yytext);
240: }
1.1 deraadt 241:
1.16 jmc 242: {ID} printf("An identifier: %s\en", yytext);
1.1 deraadt 243:
1.16 jmc 244: "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
1.1 deraadt 245:
1.16 jmc 246: "{"[^}\en]*"}" /* eat up one-line comments */
1.1 deraadt 247:
1.16 jmc 248: [ \et\en]+ /* eat up whitespace */
1.1 deraadt 249:
1.16 jmc 250: \&. printf("Unrecognized character: %s\en", yytext);
1.1 deraadt 251:
1.16 jmc 252: %%
1.1 deraadt 253:
1.16 jmc 254: main(int argc, char *argv[])
255: {
256: ++argv; --argc; /* skip over program name */
257: if (argc > 0)
258: yyin = fopen(argv[0], "r");
1.1 deraadt 259: else
260: yyin = stdin;
1.7 aaron 261:
1.1 deraadt 262: yylex();
1.16 jmc 263: }
264: .Ed
265: .Pp
266: This is the beginnings of a simple scanner for a language like Pascal.
267: It identifies different types of
268: .Em tokens
1.1 deraadt 269: and reports on what it has seen.
1.16 jmc 270: .Pp
271: The details of this example will be explained in the following sections.
272: .Sh FORMAT OF THE INPUT FILE
1.1 deraadt 273: The
1.16 jmc 274: .Nm
1.1 deraadt 275: input file consists of three sections, separated by a line with just
1.16 jmc 276: .Qq %%
1.1 deraadt 277: in it:
1.16 jmc 278: .Bd -unfilled -offset indent
279: definitions
280: %%
281: rules
282: %%
283: user code
284: .Ed
285: .Pp
1.1 deraadt 286: The
1.16 jmc 287: .Em definitions
1.1 deraadt 288: section contains declarations of simple
1.16 jmc 289: .Em name
1.1 deraadt 290: definitions to simplify the scanner specification, and declarations of
1.16 jmc 291: .Em start conditions ,
1.1 deraadt 292: which are explained in a later section.
1.16 jmc 293: .Pp
1.1 deraadt 294: Name definitions have the form:
1.16 jmc 295: .Pp
296: .D1 name definition
297: .Pp
298: The
299: .Qq name
300: is a word beginning with a letter or an underscore
301: .Pq Sq _
302: followed by zero or more letters, digits,
303: .Sq _ ,
304: or
305: .Sq -
306: .Pq dash .
1.8 aaron 307: The definition is taken to begin at the first non-whitespace character
1.1 deraadt 308: following the name and continuing to the end of the line.
1.16 jmc 309: The definition can subsequently be referred to using
310: .Qq {name} ,
311: which will expand to
312: .Qq (definition) .
313: For example:
314: .Bd -literal -offset indent
315: DIGIT [0-9]
316: ID [a-z][a-z0-9]*
317: .Ed
318: .Pp
319: This defines
320: .Qq DIGIT
321: to be a regular expression which matches a single digit, and
322: .Qq ID
323: to be a regular expression which matches a letter
1.1 deraadt 324: followed by zero-or-more letters-or-digits.
325: A subsequent reference to
1.16 jmc 326: .Pp
327: .Dl {DIGIT}+"."{DIGIT}*
328: .Pp
1.1 deraadt 329: is identical to
1.16 jmc 330: .Pp
331: .Dl ([0-9])+"."([0-9])*
332: .Pp
333: and matches one-or-more digits followed by a
334: .Sq .\&
335: followed by zero-or-more digits.
336: .Pp
1.1 deraadt 337: The
1.16 jmc 338: .Em rules
1.1 deraadt 339: section of the
1.16 jmc 340: .Nm
1.1 deraadt 341: input contains a series of rules of the form:
1.16 jmc 342: .Pp
343: .D1 pattern action
344: .Pp
345: The pattern must be unindented and the action must begin
1.1 deraadt 346: on the same line.
1.16 jmc 347: .Pp
1.1 deraadt 348: See below for a further description of patterns and actions.
1.16 jmc 349: .Pp
1.1 deraadt 350: Finally, the user code section is simply copied to
1.16 jmc 351: .Pa lex.yy.c
1.1 deraadt 352: verbatim.
1.16 jmc 353: It is used for companion routines which call or are called by the scanner.
354: The presence of this section is optional;
1.1 deraadt 355: if it is missing, the second
1.16 jmc 356: .Qq %%
357: in the input file may be skipped too.
358: .Pp
359: In the definitions and rules sections, any indented text or text enclosed in
360: .Sq %{
1.1 deraadt 361: and
1.16 jmc 362: .Sq %}
363: is copied verbatim to the output
364: .Pq with the %{}'s removed .
1.1 deraadt 365: The %{}'s must appear unindented on lines by themselves.
1.16 jmc 366: .Pp
1.1 deraadt 367: In the rules section,
1.16 jmc 368: any indented or %{} text appearing before the first rule may be used to
369: declare variables which are local to the scanning routine and
370: .Pq after the declarations
1.1 deraadt 371: code which is to be executed whenever the scanning routine is entered.
372: Other indented or %{} text in the rule section is still copied to the output,
373: but its meaning is not well-defined and it may well cause compile-time
374: errors (this feature is present for
1.16 jmc 375: .Tn POSIX
1.1 deraadt 376: compliance; see below for other such features).
1.16 jmc 377: .Pp
378: In the definitions section
379: .Pq but not in the rules section ,
380: an unindented comment
381: (i.e., a line beginning with
382: .Qq /* )
383: is also copied verbatim to the output up to the next
384: .Qq */ .
385: .Sh PATTERNS
1.1 deraadt 386: The patterns in the input are written using an extended set of regular
1.16 jmc 387: expressions.
388: These are:
389: .Bl -tag -width "XXXXXXXX"
390: .It x
391: Match the character
392: .Sq x .
393: .It .\&
394: Any character
395: .Pq byte
396: except newline.
397: .It [xyz]
398: A
399: .Qq character class ;
400: in this case, the pattern matches either an
401: .Sq x ,
402: a
403: .Sq y ,
404: or a
405: .Sq z .
406: .It [abj-oZ]
407: A
408: .Qq character class
409: with a range in it; matches an
410: .Sq a ,
411: a
412: .Sq b ,
413: any letter from
414: .Sq j
415: through
416: .Sq o ,
417: or a
418: .Sq Z .
419: .It [^A-Z]
420: A
421: .Qq negated character class ,
422: i.e., any character but those in the class.
423: In this case, any character EXCEPT an uppercase letter.
424: .It [^A-Z\en]
425: Any character EXCEPT an uppercase letter or a newline.
426: .It r*
427: Zero or more r's, where
428: .Sq r
429: is any regular expression.
430: .It r+
431: One or more r's.
432: .It r?
433: Zero or one r's (that is,
434: .Qq an optional r ) .
435: .It r{2,5}
436: Anywhere from two to five r's.
437: .It r{2,}
438: Two or more r's.
439: .It r{4}
440: Exactly 4 r's.
441: .It {name}
442: The expansion of the
443: .Qq name
444: definition
445: .Pq see above .
446: .It \&"[xyz]\e\&"foo\&"
447: The literal string: [xyz]"foo.
448: .It \eX
449: If
450: .Sq X
451: is an
452: .Sq a ,
453: .Sq b ,
454: .Sq f ,
455: .Sq n ,
456: .Sq r ,
457: .Sq t ,
458: or
459: .Sq v ,
460: then the ANSI-C interpretation of
461: .Sq \eX .
462: Otherwise, a literal
463: .Sq X
464: (used to escape operators such as
465: .Sq * ) .
466: .It \e0
467: A NUL character
468: .Pq ASCII code 0 .
469: .It \e123
470: The character with octal value 123.
471: .It \ex2a
472: The character with hexadecimal value 2a.
473: .It (r)
474: Match an
475: .Sq r ;
476: parentheses are used to override precedence
477: .Pq see below .
478: .It rs
479: The regular expression
480: .Sq r
481: followed by the regular expression
482: .Sq s ;
483: called
484: .Qq concatenation .
485: .It r|s
486: Either an
487: .Sq r
488: or an
489: .Sq s .
490: .It r/s
491: An
492: .Sq r ,
493: but only if it is followed by an
494: .Sq s .
495: The text matched by
496: .Sq s
497: is included when determining whether this rule is the
498: .Qq longest match ,
499: but is then returned to the input before the action is executed.
500: So the action only sees the text matched by
501: .Sq r .
502: This type of pattern is called
503: .Qq trailing context .
504: (There are some combinations of r/s that
505: .Nm
506: cannot match correctly; see notes in the
507: .Sx BUGS
508: section below regarding
509: .Qq dangerous trailing context . )
510: .It ^r
511: An
512: .Sq r ,
513: but only at the beginning of a line
514: (i.e., just starting to scan, or right after a newline has been scanned).
515: .It r$
516: An
517: .Sq r ,
518: but only at the end of a line
519: .Pq i.e., just before a newline .
520: Equivalent to
521: .Qq r/\en .
522: .Pp
523: Note that
524: .Nm flex Ns 's
525: notion of
526: .Qq newline
527: is exactly whatever the C compiler used to compile
528: .Nm
529: interprets
530: .Sq \en
531: as.
532: .\" In particular, on some DOS systems you must either filter out \er's in the
533: .\" input yourself, or explicitly use r/\er\en for
534: .\" .Qq r$ .
535: .It <s>r
536: An
537: .Sq r ,
538: but only in start condition
539: .Sq s
540: .Pq see below for discussion of start conditions .
541: .It <s1,s2,s3>r
542: The same, but in any of start conditions s1, s2, or s3.
543: .It <*>r
544: An
545: .Sq r
546: in any start condition, even an exclusive one.
547: .It <<EOF>>
548: An end-of-file.
549: .It <s1,s2><<EOF>>
550: An end-of-file when in start condition s1 or s2.
551: .El
552: .Pp
1.1 deraadt 553: Note that inside of a character class, all regular expression operators
1.16 jmc 554: lose their special meaning except escape
555: .Pq Sq \e
556: and the character class operators,
557: .Sq - ,
558: .Sq ]\& ,
559: and, at the beginning of the class,
560: .Sq ^ .
561: .Pp
1.1 deraadt 562: The regular expressions listed above are grouped according to
563: precedence, from highest precedence at the top to lowest at the bottom.
1.16 jmc 564: Those grouped together have equal precedence.
565: For example,
566: .Pp
567: .D1 foo|bar*
568: .Pp
1.1 deraadt 569: is the same as
1.16 jmc 570: .Pp
571: .D1 (foo)|(ba(r*))
572: .Pp
573: since the
574: .Sq *
575: operator has higher precedence than concatenation,
576: and concatenation higher than alternation
577: .Pq Sq |\& .
578: This pattern therefore matches
579: .Em either
580: the string
581: .Qq foo
582: .Em or
583: the string
584: .Qq ba
585: followed by zero-or-more r's.
586: To match
587: .Qq foo
588: or zero-or-more "bar"'s,
589: use:
590: .Pp
591: .D1 foo|(bar)*
592: .Pp
1.1 deraadt 593: and to match zero-or-more "foo"'s-or-"bar"'s:
1.16 jmc 594: .Pp
595: .D1 (foo|bar)*
596: .Pp
1.1 deraadt 597: In addition to characters and ranges of characters, character classes
598: can also contain character class
1.16 jmc 599: .Em expressions .
1.1 deraadt 600: These are expressions enclosed inside
1.16 jmc 601: .Sq [:
602: and
603: .Sq :]
604: delimiters (which themselves must appear between the
605: .Sq [
1.1 deraadt 606: and
1.16 jmc 607: .Sq ]\&
608: of the
1.1 deraadt 609: character class; other elements may occur inside the character class, too).
610: The valid expressions are:
1.16 jmc 611: .Bd -unfilled -offset indent
612: [:alnum:] [:alpha:] [:blank:]
613: [:cntrl:] [:digit:] [:graph:]
614: [:lower:] [:print:] [:punct:]
615: [:space:] [:upper:] [:xdigit:]
616: .Ed
617: .Pp
1.1 deraadt 618: These expressions all designate a set of characters equivalent to
619: the corresponding standard C
1.16 jmc 620: .Fn isXXX
621: function.
622: For example, [:alnum:] designates those characters for which
623: .Xr isalnum 3
624: returns true \- i.e., any alphabetic or numeric.
1.1 deraadt 625: Some systems don't provide
1.16 jmc 626: .Xr isblank 3 ,
627: so
628: .Nm
629: defines [:blank:] as a blank or a tab.
630: .Pp
1.1 deraadt 631: For example, the following character classes are all equivalent:
1.16 jmc 632: .Bd -unfilled -offset indent
633: [[:alnum:]]
634: [[:alpha:][:digit:]]
635: [[:alpha:]0-9]
636: [a-zA-Z0-9]
637: .Ed
638: .Pp
639: If the scanner is case-insensitive (the
640: .Fl i
641: flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
642: .Pp
1.1 deraadt 643: Some notes on patterns:
1.16 jmc 644: .Bl -dash
645: .It
646: A negated character class such as the example
647: .Qq [^A-Z]
648: above will match a newline unless "\en"
649: .Pq or an equivalent escape sequence
650: is one of the characters explicitly present in the negated character class
651: (e.g.,
652: .Qq [^A-Z\en] ) .
653: This is unlike how many other regular expression tools treat negated character
654: classes, but unfortunately the inconsistency is historically entrenched.
655: Matching newlines means that a pattern like
656: .Qq [^"]*
657: can match the entire input unless there's another quote in the input.
658: .It
659: A rule can have at most one instance of trailing context
660: (the
661: .Sq /
662: operator or the
663: .Sq $
664: operator).
665: The start condition,
666: .Sq ^ ,
667: and
668: .Qq <<EOF>>
669: patterns can only occur at the beginning of a pattern, and, as well as with
670: .Sq /
671: and
672: .Sq $ ,
673: cannot be grouped inside parentheses.
674: A
675: .Sq ^
676: which does not occur at the beginning of a rule or a
677: .Sq $
678: which does not occur at the end of a rule loses its special properties
679: and is treated as a normal character.
680: .It
1.1 deraadt 681: The following are illegal:
1.16 jmc 682: .Bd -unfilled -offset indent
683: foo/bar$
684: <sc1>foo<sc2>bar
685: .Ed
686: .Pp
687: Note that the first of these, can be written
688: .Qq foo/bar\en .
689: .It
690: The following will result in
691: .Sq $
692: or
693: .Sq ^
694: being treated as a normal character:
695: .Bd -unfilled -offset indent
696: foo|(bar$)
697: foo|^bar
698: .Ed
699: .Pp
700: If what's wanted is a
701: .Qq foo
702: or a bar-followed-by-a-newline, the following could be used
703: (the special
704: .Sq |\&
705: action is explained below):
706: .Bd -unfilled -offset indent
707: foo |
708: bar$ /* action goes here */
709: .Ed
710: .Pp
1.1 deraadt 711: A similar trick will work for matching a foo or a
712: bar-at-the-beginning-of-a-line.
1.16 jmc 713: .El
714: .Sh HOW THE INPUT IS MATCHED
715: When the generated scanner is run,
716: it analyzes its input looking for strings which match any of its patterns.
717: If it finds more than one match,
718: it takes the one matching the most text
719: (for trailing context rules, this includes the length of the trailing part,
720: even though it will then be returned to the input).
721: If it finds two or more matches of the same length,
722: the rule listed first in the
723: .Nm
1.1 deraadt 724: input file is chosen.
1.16 jmc 725: .Pp
1.1 deraadt 726: Once the match is determined, the text corresponding to the match
727: (called the
1.16 jmc 728: .Em token )
1.1 deraadt 729: is made available in the global character pointer
1.16 jmc 730: .Fa yytext ,
1.1 deraadt 731: and its length in the global integer
1.16 jmc 732: .Fa yyleng .
1.1 deraadt 733: The
1.16 jmc 734: .Em action
735: corresponding to the matched pattern is then executed
736: .Pq a more detailed description of actions follows ,
737: and then the remaining input is scanned for another match.
738: .Pp
739: If no match is found, then the default rule is executed:
740: the next character in the input is considered matched and
741: copied to the standard output.
742: Thus, the simplest legal
743: .Nm
1.1 deraadt 744: input is:
1.16 jmc 745: .Pp
746: .D1 %%
747: .Pp
748: which generates a scanner that simply copies its input
749: .Pq one character at a time
750: to its output.
751: .Pp
1.1 deraadt 752: Note that
1.16 jmc 753: .Fa yytext
754: can be defined in two different ways:
755: either as a character pointer or as a character array.
756: Which definition
757: .Nm
758: uses can be controlled by including one of the special directives
759: .Dq %pointer
760: or
761: .Dq %array
762: in the first
763: .Pq definitions
764: section of flex input.
765: The default is
766: .Dq %pointer ,
767: unless the
768: .Fl l
769: lex compatibility option is used, in which case
770: .Fa yytext
1.1 deraadt 771: will be an array.
772: The advantage of using
1.16 jmc 773: .Dq %pointer
1.1 deraadt 774: is substantially faster scanning and no buffer overflow when matching
1.16 jmc 775: very large tokens
776: .Pq unless not enough dynamic memory is available .
777: The disadvantage is that actions are restricted in how they can modify
778: .Fa yytext
779: .Pq see the next section ,
780: and calls to the
781: .Fn unput
1.10 deraadt 782: function destroy the present contents of
1.16 jmc 783: .Fa yytext ,
1.1 deraadt 784: which can be a considerable porting headache when moving between different
1.16 jmc 785: .Nm lex
1.1 deraadt 786: versions.
1.16 jmc 787: .Pp
1.1 deraadt 788: The advantage of
1.16 jmc 789: .Dq %array
790: is that
791: .Fa yytext
792: can be modified as much as wanted, and calls to
793: .Fn unput
1.1 deraadt 794: do not destroy
1.16 jmc 795: .Fa yytext
796: .Pq see below .
797: Furthermore, existing
798: .Nm lex
1.1 deraadt 799: programs sometimes access
1.16 jmc 800: .Fa yytext
1.1 deraadt 801: externally using declarations of the form:
1.16 jmc 802: .Pp
803: .D1 extern char yytext[];
804: .Pp
1.1 deraadt 805: This definition is erroneous when used with
1.16 jmc 806: .Dq %pointer ,
1.1 deraadt 807: but correct for
1.16 jmc 808: .Dq %array .
809: .Pp
810: .Dq %array
1.1 deraadt 811: defines
1.16 jmc 812: .Fa yytext
1.1 deraadt 813: to be an array of
1.16 jmc 814: .Dv YYLMAX
815: characters, which defaults to a fairly large value.
816: The size can be changed by simply #define'ing
817: .Dv YYLMAX
818: to a different value in the first section of
819: .Nm
820: input.
821: As mentioned above, with
822: .Dq %pointer
823: yytext grows dynamically to accommodate large tokens.
824: While this means a
825: .Dq %pointer
826: scanner can accommodate very large tokens
827: .Pq such as matching entire blocks of comments ,
828: bear in mind that each time the scanner must resize
829: .Fa yytext
1.1 deraadt 830: it also must rescan the entire token from the beginning, so matching such
831: tokens can prove slow.
1.16 jmc 832: .Fa yytext
833: presently does not dynamically grow if a call to
834: .Fn unput
1.1 deraadt 835: results in too much text being pushed back; instead, a run-time error results.
1.16 jmc 836: .Pp
837: Also note that
838: .Dq %array
839: cannot be used with C++ scanner classes
840: .Pq the c++ option; see below .
841: .Sh ACTIONS
842: Each pattern in a rule has a corresponding action,
843: which can be any arbitrary C statement.
844: The pattern ends at the first non-escaped whitespace character;
845: the remainder of the line is its action.
846: If the action is empty,
847: then when the pattern is matched the input token is simply discarded.
848: For example, here is the specification for a program
849: which deletes all occurrences of
850: .Qq zap me
851: from its input:
852: .Bd -literal -offset indent
853: %%
854: "zap me"
855: .Ed
856: .Pp
1.1 deraadt 857: (It will copy all other characters in the input to the output since
858: they will be matched by the default rule.)
1.16 jmc 859: .Pp
1.1 deraadt 860: Here is a program which compresses multiple blanks and tabs down to
861: a single blank, and throws away whitespace found at the end of a line:
1.16 jmc 862: .Bd -literal -offset indent
863: %%
864: [ \et]+ putchar(' ');
865: [ \et]+$ /* ignore this token */
866: .Ed
867: .Pp
868: If the action contains a
869: .Sq { ,
870: then the action spans till the balancing
871: .Sq }
1.1 deraadt 872: is found, and the action may cross multiple lines.
1.16 jmc 873: .Nm
1.1 deraadt 874: knows about C strings and comments and won't be fooled by braces found
875: within them, but also allows actions to begin with
1.16 jmc 876: .Sq %{
1.1 deraadt 877: and will consider the action to be all the text up to the next
1.16 jmc 878: .Sq %}
879: .Pq regardless of ordinary braces inside the action .
880: .Pp
881: An action consisting solely of a vertical bar
882: .Pq Sq |\&
883: means
884: .Qq same as the action for the next rule .
885: See below for an illustration.
886: .Pp
887: Actions can include arbitrary C code,
888: including return statements to return a value to whatever routine called
889: .Fn yylex .
1.1 deraadt 890: Each time
1.16 jmc 891: .Fn yylex
892: is called, it continues processing tokens from where it last left off
893: until it either reaches the end of the file or executes a return.
894: .Pp
1.1 deraadt 895: Actions are free to modify
1.16 jmc 896: .Fa yytext
897: except for lengthening it
898: (adding characters to its end \- these will overwrite later characters in the
899: input stream).
900: This, however, does not apply when using
901: .Dq %array
902: .Pq see above ;
903: in that case,
904: .Fa yytext
1.1 deraadt 905: may be freely modified in any way.
1.16 jmc 906: .Pp
1.1 deraadt 907: Actions are free to modify
1.16 jmc 908: .Fa yyleng
1.1 deraadt 909: except they should not do so if the action also includes use of
1.16 jmc 910: .Fn yymore
911: .Pq see below .
912: .Pp
1.1 deraadt 913: There are a number of special directives which can be included within
914: an action:
1.16 jmc 915: .Bl -tag -width Ds
916: .It ECHO
917: Copies
918: .Fa yytext
919: to the scanner's output.
920: .It BEGIN
921: Followed by the name of a start condition, places the scanner in the
922: corresponding start condition
923: .Pq see below .
924: .It REJECT
925: Directs the scanner to proceed on to the
926: .Qq second best
927: rule which matched the input
928: .Pq or a prefix of the input .
929: The rule is chosen as described above in
930: .Sx HOW THE INPUT IS MATCHED ,
931: and
932: .Fa yytext
1.1 deraadt 933: and
1.16 jmc 934: .Fa yyleng
1.1 deraadt 935: set up appropriately.
936: It may either be one which matched as much text
937: as the originally chosen rule but came later in the
1.16 jmc 938: .Nm
1.1 deraadt 939: input file, or one which matched less text.
940: For example, the following will both count the
1.16 jmc 941: words in the input and call the routine
942: .Fn special
943: whenever
944: .Qq frob
945: is seen:
946: .Bd -literal -offset indent
947: int word_count = 0;
948: %%
949:
950: frob special(); REJECT;
951: [^ \et\en]+ ++word_count;
952: .Ed
953: .Pp
1.1 deraadt 954: Without the
1.16 jmc 955: .Em REJECT ,
956: any "frob"'s in the input would not be counted as words,
957: since the scanner normally executes only one action per token.
1.1 deraadt 958: Multiple
1.16 jmc 959: .Em REJECT Ns 's
960: are allowed,
961: each one finding the next best choice to the currently active rule.
962: For example, when the following scanner scans the token
963: .Qq abcd ,
964: it will write
965: .Qq abcdabcaba
966: to the output:
967: .Bd -literal -offset indent
968: %%
969: a |
970: ab |
971: abc |
972: abcd ECHO; REJECT;
973: \&.|\en /* eat up any unmatched character */
974: .Ed
975: .Pp
1.1 deraadt 976: (The first three rules share the fourth's action since they use
1.16 jmc 977: the special
978: .Sq |\&
979: action.)
980: .Em REJECT
1.1 deraadt 981: is a particularly expensive feature in terms of scanner performance;
1.16 jmc 982: if it is used in any of the scanner's actions it will slow down
983: all of the scanner's matching.
984: Furthermore,
985: .Em REJECT
1.1 deraadt 986: cannot be used with the
1.16 jmc 987: .Fl Cf
1.1 deraadt 988: or
1.16 jmc 989: .Fl CF
990: options
991: .Pq see below .
992: .Pp
1.1 deraadt 993: Note also that unlike the other special actions,
1.16 jmc 994: .Em REJECT
1.1 deraadt 995: is a
1.16 jmc 996: .Em branch ;
997: code immediately following it in the action will not be executed.
998: .It yymore()
999: Tells the scanner that the next time it matches a rule, the corresponding
1000: token should be appended onto the current value of
1001: .Fa yytext
1002: rather than replacing it.
1003: For example, given the input
1004: .Qq mega-kludge
1005: the following will write
1006: .Qq mega-mega-kludge
1007: to the output:
1008: .Bd -literal -offset indent
1009: %%
1010: mega- ECHO; yymore();
1011: kludge ECHO;
1012: .Ed
1013: .Pp
1014: First
1015: .Qq mega-
1016: is matched and echoed to the output.
1017: Then
1018: .Qq kludge
1019: is matched, but the previous
1020: .Qq mega-
1021: is still hanging around at the beginning of
1022: .Fa yytext
1.1 deraadt 1023: so the
1.16 jmc 1024: .Em ECHO
1025: for the
1026: .Qq kludge
1027: rule will actually write
1028: .Qq mega-kludge .
1029: .Pp
1.1 deraadt 1030: Two notes regarding use of
1.16 jmc 1031: .Fn yymore :
1.1 deraadt 1032: First,
1.16 jmc 1033: .Fn yymore
1.1 deraadt 1034: depends on the value of
1.16 jmc 1035: .Fa yyleng
1036: correctly reflecting the size of the current token, so
1037: .Fa yyleng
1038: must not be modified when using
1039: .Fn yymore .
1.1 deraadt 1040: Second, the presence of
1.16 jmc 1041: .Fn yymore
1.1 deraadt 1042: in the scanner's action entails a minor performance penalty in the
1043: scanner's matching speed.
1.16 jmc 1044: .It yyless(n)
1045: Returns all but the first
1046: .Ar n
1.1 deraadt 1047: characters of the current token back to the input stream, where they
1048: will be rescanned when the scanner looks for the next match.
1.16 jmc 1049: .Fa yytext
1.1 deraadt 1050: and
1.16 jmc 1051: .Fa yyleng
1.1 deraadt 1052: are adjusted appropriately (e.g.,
1.16 jmc 1053: .Fa yyleng
1.1 deraadt 1054: will now be equal to
1.16 jmc 1055: .Ar n ) .
1056: For example, on the input
1057: .Qq foobar
1058: the following will write out
1059: .Qq foobarbar :
1060: .Bd -literal -offset indent
1061: %%
1062: foobar ECHO; yyless(3);
1063: [a-z]+ ECHO;
1064: .Ed
1065: .Pp
1.1 deraadt 1066: An argument of 0 to
1.16 jmc 1067: .Fa yyless
1068: will cause the entire current input string to be scanned again.
1069: Unless how the scanner will subsequently process its input has been changed
1070: (using
1071: .Em BEGIN ,
1072: for example),
1073: this will result in an endless loop.
1074: .Pp
1.1 deraadt 1075: Note that
1.16 jmc 1076: .Fa yyless
1077: is a macro and can only be used in the
1078: .Nm
1079: input file, not from other source files.
1080: .It unput(c)
1081: Puts the character
1082: .Ar c
1083: back into the input stream.
1084: It will be the next character scanned.
1.1 deraadt 1085: The following action will take the current token and cause it
1086: to be rescanned enclosed in parentheses.
1.16 jmc 1087: .Bd -literal -offset indent
1088: {
1089: int i;
1090: char *yycopy;
1091:
1092: /* Copy yytext because unput() trashes yytext */
1093: if ((yycopy = strdup(yytext)) == NULL)
1094: err(1, NULL);
1095: unput(')');
1096: for (i = yyleng - 1; i >= 0; --i)
1097: unput(yycopy[i]);
1098: unput('(');
1099: free(yycopy);
1100: }
1101: .Ed
1102: .Pp
1.1 deraadt 1103: Note that since each
1.16 jmc 1104: .Fn unput
1105: puts the given character back at the beginning of the input stream,
1106: pushing back strings must be done back-to-front.
1107: .Pp
1.1 deraadt 1108: An important potential problem when using
1.16 jmc 1109: .Fn unput
1110: is that if using
1111: .Dq %pointer
1112: .Pq the default ,
1113: a call to
1114: .Fn unput
1115: destroys the contents of
1116: .Fa yytext ,
1.1 deraadt 1117: starting with its rightmost character and devouring one character to
1.16 jmc 1118: the left with each call.
1119: If the value of
1120: .Fa yytext
1121: should be preserved after a call to
1122: .Fn unput
1123: .Pq as in the above example ,
1124: it must either first be copied elsewhere, or the scanner must be built using
1125: .Dq %array
1126: instead (see
1127: .Sx HOW THE INPUT IS MATCHED ) .
1128: .Pp
1129: Finally, note that EOF cannot be put back
1.1 deraadt 1130: to attempt to mark the input stream with an end-of-file.
1.16 jmc 1131: .It input()
1132: Reads the next character from the input stream.
1133: For example, the following is one way to eat up C comments:
1134: .Bd -literal -offset indent
1135: %%
1136: "/*" {
1137: int c;
1138:
1139: for (;;) {
1140: while ((c = input()) != '*' && c != EOF)
1141: ; /* eat up text of comment */
1142:
1143: if (c == '*') {
1144: while ((c = input()) == '*')
1145: ;
1146: if (c == '/')
1147: break; /* found the end */
1148: }
1149:
1150: if (c == EOF) {
1151: errx(1, "EOF in comment");
1.1 deraadt 1152: break;
1153: }
1.16 jmc 1154: }
1155: }
1156: .Ed
1157: .Pp
1158: (Note that if the scanner is compiled using C++, then
1159: .Fn input
1.1 deraadt 1160: is instead referred to as
1.16 jmc 1161: .Fn yyinput ,
1162: in order to avoid a name clash with the C++ stream by the name of input.)
1163: .It YY_FLUSH_BUFFER
1164: Flushes the scanner's internal buffer
1165: so that the next time the scanner attempts to match a token,
1166: it will first refill the buffer using
1167: .Dv YY_INPUT
1168: (see
1169: .Sx THE GENERATED SCANNER ,
1170: below).
1171: This action is a special case of the more general
1172: .Fn yy_flush_buffer
1173: function, described below in the section
1174: .Sx MULTIPLE INPUT BUFFERS .
1175: .It yyterminate()
1176: Can be used in lieu of a return statement in an action.
1177: It terminates the scanner and returns a 0 to the scanner's caller, indicating
1178: .Qq all done .
1.1 deraadt 1179: By default,
1.16 jmc 1180: .Fn yyterminate
1181: is also called when an end-of-file is encountered.
1182: It is a macro and may be redefined.
1183: .El
1184: .Sh THE GENERATED SCANNER
1.1 deraadt 1185: The output of
1.16 jmc 1186: .Nm
1.1 deraadt 1187: is the file
1.16 jmc 1188: .Pa lex.yy.c ,
1.1 deraadt 1189: which contains the scanning routine
1.16 jmc 1190: .Fn yylex ,
1191: a number of tables used by it for matching tokens,
1192: and a number of auxiliary routines and macros.
1193: By default,
1194: .Fn yylex
1.1 deraadt 1195: is declared as follows:
1.16 jmc 1196: .Bd -unfilled -offset indent
1197: int yylex()
1198: {
1199: ... various definitions and the actions in here ...
1200: }
1201: .Ed
1202: .Pp
1203: (If the environment supports function prototypes, then it will
1204: be "int yylex(void)".)
1205: This definition may be changed by defining the
1206: .Dv YY_DECL
1207: macro.
1208: For example:
1209: .Bd -literal -offset indent
1210: #define YY_DECL float lexscan(a, b) float a, b;
1211: .Ed
1212: .Pp
1213: would give the scanning routine the name
1214: .Em lexscan ,
1215: returning a float, and taking two floats as arguments.
1216: Note that if arguments are given to the scanning routine using a
1217: K&R-style/non-prototyped function declaration,
1218: the definition must be terminated with a semi-colon
1219: .Pq Sq ;\& .
1220: .Pp
1.1 deraadt 1221: Whenever
1.16 jmc 1222: .Fn yylex
1.1 deraadt 1223: is called, it scans tokens from the global input file
1.16 jmc 1224: .Pa yyin
1225: .Pq which defaults to stdin .
1226: It continues until it either reaches an end-of-file
1227: .Pq at which point it returns the value 0
1228: or one of its actions executes a
1229: .Em return
1.1 deraadt 1230: statement.
1.16 jmc 1231: .Pp
1.1 deraadt 1232: If the scanner reaches an end-of-file, subsequent calls are undefined
1233: unless either
1.16 jmc 1234: .Em yyin
1235: is pointed at a new input file
1236: .Pq in which case scanning continues from that file ,
1237: or
1238: .Fn yyrestart
1.1 deraadt 1239: is called.
1.16 jmc 1240: .Fn yyrestart
1.1 deraadt 1241: takes one argument, a
1.16 jmc 1242: .Fa FILE *
1243: pointer (which can be nil, if
1244: .Dv YY_INPUT
1245: has been set up to scan from a source other than
1246: .Em yyin ) ,
1.1 deraadt 1247: and initializes
1.16 jmc 1248: .Em yyin
1249: for scanning from that file.
1250: Essentially there is no difference between just assigning
1251: .Em yyin
1.1 deraadt 1252: to a new input file or using
1.16 jmc 1253: .Fn yyrestart
1254: to do so; the latter is available for compatibility with previous versions of
1255: .Nm ,
1.1 deraadt 1256: and because it can be used to switch input files in the middle of scanning.
1.16 jmc 1257: It can also be used to throw away the current input buffer,
1258: by calling it with an argument of
1259: .Em yyin ;
1.1 deraadt 1260: but better is to use
1.16 jmc 1261: .Dv YY_FLUSH_BUFFER
1262: .Pq see above .
1.1 deraadt 1263: Note that
1.16 jmc 1264: .Fn yyrestart
1265: does not reset the start condition to
1266: .Em INITIAL
1267: (see
1268: .Sx START CONDITIONS ,
1269: below).
1270: .Pp
1.1 deraadt 1271: If
1.16 jmc 1272: .Fn yylex
1.1 deraadt 1273: stops scanning due to executing a
1.16 jmc 1274: .Em return
1.1 deraadt 1275: statement in one of the actions, the scanner may then be called again and it
1276: will resume scanning where it left off.
1.16 jmc 1277: .Pp
1278: By default
1279: .Pq and for purposes of efficiency ,
1280: the scanner uses block-reads rather than simple
1281: .Xr getc 3
1.1 deraadt 1282: calls to read characters from
1.16 jmc 1283: .Em yyin .
1.1 deraadt 1284: The nature of how it gets its input can be controlled by defining the
1.16 jmc 1285: .Dv YY_INPUT
1.1 deraadt 1286: macro.
1.16 jmc 1287: .Dv YY_INPUT Ns 's
1288: calling sequence is
1289: .Qq YY_INPUT(buf,result,max_size) .
1290: Its action is to place up to
1291: .Dv max_size
1.1 deraadt 1292: characters in the character array
1.16 jmc 1293: .Em buf
1.1 deraadt 1294: and return in the integer variable
1.16 jmc 1295: .Em result
1296: either the number of characters read or the constant
1297: .Dv YY_NULL
1298: (0 on
1299: .Ux
1300: systems)
1301: to indicate
1302: .Dv EOF .
1303: The default
1304: .Dv YY_INPUT
1305: reads from the global file-pointer
1306: .Qq yyin .
1307: .Pp
1308: A sample definition of
1309: .Dv YY_INPUT
1310: .Pq in the definitions section of the input file :
1311: .Bd -unfilled -offset indent
1312: %{
1313: #define YY_INPUT(buf,result,max_size) \e
1314: { \e
1315: int c = getchar(); \e
1316: result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1317: }
1318: %}
1319: .Ed
1320: .Pp
1.1 deraadt 1321: This definition will change the input processing to occur
1322: one character at a time.
1.16 jmc 1323: .Pp
1324: When the scanner receives an end-of-file indication from
1325: .Dv YY_INPUT ,
1.1 deraadt 1326: it then checks the
1.16 jmc 1327: .Fn yywrap
1328: function.
1329: If
1330: .Fn yywrap
1331: returns false
1332: .Pq zero ,
1333: then it is assumed that the function has gone ahead and set up
1334: .Em yyin
1335: to point to another input file, and scanning continues.
1336: If it returns true
1337: .Pq non-zero ,
1338: then the scanner terminates, returning 0 to its caller.
1339: Note that in either case, the start condition remains unchanged;
1340: it does not revert to
1341: .Em INITIAL .
1342: .Pp
1.1 deraadt 1343: If you do not supply your own version of
1.16 jmc 1344: .Fn yywrap ,
1.1 deraadt 1345: then you must either use
1.16 jmc 1346: .Dq %option noyywrap
1.1 deraadt 1347: (in which case the scanner behaves as though
1.16 jmc 1348: .Fn yywrap
1.1 deraadt 1349: returned 1), or you must link with
1.16 jmc 1350: .Fl lfl
1.1 deraadt 1351: to obtain the default version of the routine, which always returns 1.
1.16 jmc 1352: .Pp
1.1 deraadt 1353: Three routines are available for scanning from in-memory buffers rather
1354: than files:
1.16 jmc 1355: .Fn yy_scan_string ,
1356: .Fn yy_scan_bytes ,
1.1 deraadt 1357: and
1.16 jmc 1358: .Fn yy_scan_buffer .
1359: See the discussion of them below in the section
1360: .Sx MULTIPLE INPUT BUFFERS .
1361: .Pp
1.1 deraadt 1362: The scanner writes its
1.16 jmc 1363: .Em ECHO
1.1 deraadt 1364: output to the
1.16 jmc 1365: .Em yyout
1366: global
1367: .Pq default, stdout ,
1368: which may be redefined by the user simply by assigning it to some other
1369: .Va FILE
1.1 deraadt 1370: pointer.
1.16 jmc 1371: .Sh START CONDITIONS
1372: .Nm
1373: provides a mechanism for conditionally activating rules.
1374: Any rule whose pattern is prefixed with
1375: .Qq Aq sc
1376: will only be active when the scanner is in the start condition named
1377: .Qq sc .
1378: For example,
1379: .Bd -literal -offset indent
1380: <STRING>[^"]* { /* eat up the string body ... */
1381: ...
1382: }
1383: .Ed
1384: .Pp
1385: will be active only when the scanner is in the
1386: .Qq STRING
1387: start condition, and
1388: .Bd -literal -offset indent
1389: <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1390: ...
1391: }
1392: .Ed
1393: .Pp
1394: will be active only when the current start condition is either
1395: .Qq INITIAL ,
1396: .Qq STRING ,
1397: or
1398: .Qq QUOTE .
1399: .Pp
1400: Start conditions are declared in the definitions
1401: .Pq first
1402: section of the input using unindented lines beginning with either
1403: .Sq %s
1.1 deraadt 1404: or
1.16 jmc 1405: .Sq %x
1.1 deraadt 1406: followed by a list of names.
1407: The former declares
1.16 jmc 1408: .Em inclusive
1.1 deraadt 1409: start conditions, the latter
1.16 jmc 1410: .Em exclusive
1411: start conditions.
1412: A start condition is activated using the
1413: .Em BEGIN
1414: action.
1415: Until the next
1416: .Em BEGIN
1417: action is executed, rules with the given start condition will be active and
1.1 deraadt 1418: rules with other start conditions will be inactive.
1.16 jmc 1419: If the start condition is inclusive,
1.1 deraadt 1420: then rules with no start conditions at all will also be active.
1.16 jmc 1421: If it is exclusive,
1422: then only rules qualified with the start condition will be active.
1.1 deraadt 1423: A set of rules contingent on the same exclusive start condition
1424: describe a scanner which is independent of any of the other rules in the
1.16 jmc 1425: .Nm
1426: input.
1427: Because of this, exclusive start conditions make it easy to specify
1428: .Qq mini-scanners
1.1 deraadt 1429: which scan portions of the input that are syntactically different
1.16 jmc 1430: from the rest
1431: .Pq e.g., comments .
1432: .Pp
1.1 deraadt 1433: If the distinction between inclusive and exclusive start conditions
1434: is still a little vague, here's a simple example illustrating the
1.16 jmc 1435: connection between the two.
1436: The set of rules:
1437: .Bd -literal -offset indent
1438: %s example
1439: %%
1440:
1441: <example>foo do_something();
1442:
1443: bar something_else();
1444: .Ed
1445: .Pp
1.1 deraadt 1446: is equivalent to
1.16 jmc 1447: .Bd -literal -offset indent
1448: %x example
1449: %%
1450:
1451: <example>foo do_something();
1452:
1453: <INITIAL,example>bar something_else();
1454: .Ed
1455: .Pp
1.1 deraadt 1456: Without the
1.16 jmc 1457: .Aq INITIAL,example
1.1 deraadt 1458: qualifier, the
1.16 jmc 1459: .Dq bar
1460: pattern in the second example wouldn't be active
1461: .Pq i.e., couldn't match
1.1 deraadt 1462: when in start condition
1.16 jmc 1463: .Dq example .
1.1 deraadt 1464: If we just used
1.16 jmc 1465: .Aq example
1.1 deraadt 1466: to qualify
1.16 jmc 1467: .Dq bar ,
1.1 deraadt 1468: though, then it would only be active in
1.16 jmc 1469: .Dq example
1.1 deraadt 1470: and not in
1.16 jmc 1471: .Em INITIAL ,
1472: while in the first example it's active in both,
1473: because in the first example the
1474: .Dq example
1475: start condition is an inclusive
1476: .Pq Sq %s
1.1 deraadt 1477: start condition.
1.16 jmc 1478: .Pp
1.1 deraadt 1479: Also note that the special start-condition specifier
1.16 jmc 1480: .Sq Aq *
1481: matches every start condition.
1482: Thus, the above example could also have been written:
1483: .Bd -literal -offset indent
1484: %x example
1485: %%
1486:
1487: <example>foo do_something();
1488:
1489: <*>bar something_else();
1490: .Ed
1491: .Pp
1.1 deraadt 1492: The default rule (to
1.16 jmc 1493: .Em ECHO
1494: any unmatched character) remains active in start conditions.
1495: It is equivalent to:
1496: .Bd -literal -offset indent
1497: <*>.|\en ECHO;
1498: .Ed
1499: .Pp
1500: .Dq BEGIN(0)
1.1 deraadt 1501: returns to the original state where only the rules with
1.16 jmc 1502: no start conditions are active.
1503: This state can also be referred to as the start-condition
1504: .Em INITIAL ,
1505: so
1506: .Dq BEGIN(INITIAL)
1.1 deraadt 1507: is equivalent to
1.16 jmc 1508: .Dq BEGIN(0) .
1.1 deraadt 1509: (The parentheses around the start condition name are not required but
1510: are considered good style.)
1.16 jmc 1511: .Pp
1512: .Em BEGIN
1.1 deraadt 1513: actions can also be given as indented code at the beginning
1.16 jmc 1514: of the rules section.
1515: For example, the following will cause the scanner to enter the
1516: .Qq SPECIAL
1517: start condition whenever
1518: .Fn yylex
1.1 deraadt 1519: is called and the global variable
1.16 jmc 1520: .Fa enter_special
1.1 deraadt 1521: is true:
1.16 jmc 1522: .Bd -literal -offset indent
1523: int enter_special;
1.1 deraadt 1524:
1.16 jmc 1525: %x SPECIAL
1526: %%
1527: if (enter_special)
1.1 deraadt 1528: BEGIN(SPECIAL);
1529:
1.16 jmc 1530: <SPECIAL>blahblahblah
1531: \&...more rules follow...
1532: .Ed
1533: .Pp
1.1 deraadt 1534: To illustrate the uses of start conditions,
1535: here is a scanner which provides two different interpretations
1.16 jmc 1536: of a string like
1537: .Qq 123.456 .
1538: By default it will treat it as three tokens: the integer
1539: .Qq 123 ,
1540: a dot
1541: .Pq Sq .\& ,
1542: and the integer
1543: .Qq 456 .
1.1 deraadt 1544: But if the string is preceded earlier in the line by the string
1.16 jmc 1545: .Qq expect-floats
1546: it will treat it as a single token, the floating-point number 123.456:
1547: .Bd -literal -offset indent
1548: %{
1549: #include <math.h>
1550: %}
1551: %s expect
1552:
1553: %%
1554: expect-floats BEGIN(expect);
1555:
1556: <expect>[0-9]+"."[0-9]+ {
1557: printf("found a float, = %f\en",
1558: atof(yytext));
1559: }
1560: <expect>\en {
1561: /*
1562: * That's the end of the line, so
1563: * we need another "expect-number"
1564: * before we'll recognize any more
1565: * numbers.
1566: */
1567: BEGIN(INITIAL);
1568: }
1569:
1570: [0-9]+ {
1571: printf("found an integer, = %d\en",
1572: atoi(yytext));
1573: }
1574:
1575: "." printf("found a dot\en");
1576: .Ed
1577: .Pp
1578: Here is a scanner which recognizes
1579: .Pq and discards
1580: C comments while maintaining a count of the current input line:
1581: .Bd -literal -offset indent
1582: %x comment
1583: %%
1584: int line_num = 1;
1585:
1586: "/*" BEGIN(comment);
1587:
1588: <comment>[^*\en]* /* eat anything that's not a '*' */
1589: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1590: <comment>\en ++line_num;
1591: <comment>"*"+"/" BEGIN(INITIAL);
1592: .Ed
1593: .Pp
1.1 deraadt 1594: This scanner goes to a bit of trouble to match as much
1.16 jmc 1595: text as possible with each rule.
1596: In general, when attempting to write a high-speed scanner
1597: try to match as much as possible in each rule, as it's a big win.
1598: .Pp
1.10 deraadt 1599: Note that start-condition names are really integer values and
1.16 jmc 1600: can be stored as such.
1601: Thus, the above could be extended in the following fashion:
1602: .Bd -literal -offset indent
1603: %x comment foo
1604: %%
1605: int line_num = 1;
1606: int comment_caller;
1607:
1608: "/*" {
1609: comment_caller = INITIAL;
1610: BEGIN(comment);
1611: }
1612:
1613: \&...
1614:
1615: <foo>"/*" {
1616: comment_caller = foo;
1617: BEGIN(comment);
1618: }
1619:
1620: <comment>[^*\en]* /* eat anything that's not a '*' */
1621: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1622: <comment>\en ++line_num;
1623: <comment>"*"+"/" BEGIN(comment_caller);
1624: .Ed
1625: .Pp
1626: Furthermore, the current start condition can be accessed by using
1.1 deraadt 1627: the integer-valued
1.16 jmc 1628: .Dv YY_START
1629: macro.
1630: For example, the above assignments to
1631: .Em comment_caller
1.1 deraadt 1632: could instead be written
1.16 jmc 1633: .Pp
1634: .Dl comment_caller = YY_START;
1635: .Pp
1.1 deraadt 1636: Flex provides
1.16 jmc 1637: .Dv YYSTATE
1.1 deraadt 1638: as an alias for
1.16 jmc 1639: .Dv YY_START
1.1 deraadt 1640: (since that is what's used by AT&T
1.16 jmc 1641: .Nm lex ) .
1642: .Pp
1643: Note that start conditions do not have their own name-space;
1644: %s's and %x's declare names in the same fashion as #define's.
1645: .Pp
1.1 deraadt 1646: Finally, here's an example of how to match C-style quoted strings using
1.16 jmc 1647: exclusive start conditions, including expanded escape sequences
1648: (but not including checking for a string that's too long):
1649: .Bd -literal -offset indent
1650: %x str
1651:
1652: %%
1653: #define MAX_STR_CONST 1024
1654: char string_buf[MAX_STR_CONST];
1655: char *string_buf_ptr;
1656:
1657: \e" string_buf_ptr = string_buf; BEGIN(str);
1658:
1659: <str>\e" { /* saw closing quote - all done */
1660: BEGIN(INITIAL);
1661: *string_buf_ptr = '\e0';
1662: /*
1663: * return string constant token type and
1664: * value to parser
1665: */
1666: }
1667:
1668: <str>\en {
1669: /* error - unterminated string constant */
1670: /* generate error message */
1671: }
1672:
1673: <str>\e\e[0-7]{1,3} {
1674: /* octal escape sequence */
1675: int result;
1676:
1677: (void) sscanf(yytext + 1, "%o", &result);
1678:
1679: if (result > 0xff) {
1680: /* error, constant is out-of-bounds */
1681: } else
1682: *string_buf_ptr++ = result;
1683: }
1684:
1685: <str>\e\e[0-9]+ {
1686: /*
1687: * generate error - bad escape sequence; something
1688: * like '\e48' or '\e0777777'
1689: */
1690: }
1691:
1692: <str>\e\en *string_buf_ptr++ = '\en';
1693: <str>\e\et *string_buf_ptr++ = '\et';
1694: <str>\e\er *string_buf_ptr++ = '\er';
1695: <str>\e\eb *string_buf_ptr++ = '\eb';
1696: <str>\e\ef *string_buf_ptr++ = '\ef';
1697:
1698: <str>\e\e(.|\en) *string_buf_ptr++ = yytext[1];
1699:
1700: <str>[^\e\e\en\e"]+ {
1701: char *yptr = yytext;
1702:
1703: while (*yptr)
1704: *string_buf_ptr++ = *yptr++;
1705: }
1706: .Ed
1707: .Pp
1708: Often, such as in some of the examples above,
1709: a whole bunch of rules are all preceded by the same start condition(s).
1710: .Nm
1.1 deraadt 1711: makes this a little easier and cleaner by introducing a notion of
1712: start condition
1.16 jmc 1713: .Em scope .
1.1 deraadt 1714: A start condition scope is begun with:
1.16 jmc 1715: .Pp
1716: .Dl <SCs>{
1717: .Pp
1.1 deraadt 1718: where
1.16 jmc 1719: .Dq SCs
1720: is a list of one or more start conditions.
1721: Inside the start condition scope, every rule automatically has the prefix
1722: .Aq SCs
1.1 deraadt 1723: applied to it, until a
1.16 jmc 1724: .Sq }
1.1 deraadt 1725: which matches the initial
1.16 jmc 1726: .Sq { .
1.1 deraadt 1727: So, for example,
1.16 jmc 1728: .Bd -literal -offset indent
1729: <ESC>{
1730: "\e\en" return '\en';
1731: "\e\er" return '\er';
1732: "\e\ef" return '\ef';
1733: "\e\e0" return '\e0';
1734: }
1735: .Ed
1736: .Pp
1.1 deraadt 1737: is equivalent to:
1.16 jmc 1738: .Bd -literal -offset indent
1739: <ESC>"\e\en" return '\en';
1740: <ESC>"\e\er" return '\er';
1741: <ESC>"\e\ef" return '\ef';
1742: <ESC>"\e\e0" return '\e0';
1743: .Ed
1744: .Pp
1.1 deraadt 1745: Start condition scopes may be nested.
1.16 jmc 1746: .Pp
1.1 deraadt 1747: Three routines are available for manipulating stacks of start conditions:
1.16 jmc 1748: .Bl -tag -width Ds
1749: .It void yy_push_state(int new_state)
1750: Pushes the current start condition onto the top of the start condition
1.1 deraadt 1751: stack and switches to
1.16 jmc 1752: .Fa new_state
1753: as though
1754: .Dq BEGIN new_state
1755: had been used
1756: .Pq recall that start condition names are also integers .
1757: .It void yy_pop_state()
1758: Pops the top of the stack and switches to it via
1759: .Em BEGIN .
1760: .It int yy_top_state()
1761: Returns the top of the stack without altering the stack's contents.
1762: .El
1763: .Pp
1.1 deraadt 1764: The start condition stack grows dynamically and so has no built-in
1.16 jmc 1765: size limitation.
1766: If memory is exhausted, program execution aborts.
1767: .Pp
1768: To use start condition stacks, scanners must include a
1769: .Dq %option stack
1770: directive (see
1771: .Sx OPTIONS
1772: below).
1773: .Sh MULTIPLE INPUT BUFFERS
1774: Some scanners
1775: (such as those which support
1776: .Qq include
1777: files)
1778: require reading from several input streams.
1779: As
1780: .Nm
1.1 deraadt 1781: scanners do a large amount of buffering, one cannot control
1782: where the next input will be read from by simply writing a
1.16 jmc 1783: .Dv YY_INPUT
1.1 deraadt 1784: which is sensitive to the scanning context.
1.16 jmc 1785: .Dv YY_INPUT
1.1 deraadt 1786: is only called when the scanner reaches the end of its buffer, which
1.16 jmc 1787: may be a long time after scanning a statement such as an
1788: .Qq include
1.1 deraadt 1789: which requires switching the input source.
1.16 jmc 1790: .Pp
1.1 deraadt 1791: To negotiate these sorts of problems,
1.16 jmc 1792: .Nm
1.1 deraadt 1793: provides a mechanism for creating and switching between multiple
1.16 jmc 1794: input buffers.
1795: An input buffer is created by using:
1796: .Pp
1797: .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1798: .Pp
1.1 deraadt 1799: which takes a
1.16 jmc 1800: .Fa FILE
1801: pointer and a
1802: .Fa size
1803: and creates a buffer associated with the given file and large enough to hold
1804: .Fa size
1.1 deraadt 1805: characters (when in doubt, use
1.16 jmc 1806: .Dv YY_BUF_SIZE
1807: for the size).
1808: It returns a
1809: .Dv YY_BUFFER_STATE
1810: handle, which may then be passed to other routines
1811: .Pq see below .
1812: The
1813: .Dv YY_BUFFER_STATE
1.1 deraadt 1814: type is a pointer to an opaque
1.16 jmc 1815: .Dq struct yy_buffer_state
1816: structure, so
1817: .Dv YY_BUFFER_STATE
1818: variables may be safely initialized to
1819: .Dq ((YY_BUFFER_STATE) 0)
1820: if desired, and the opaque structure can also be referred to in order to
1821: correctly declare input buffers in source files other than that of scanners.
1822: Note that the
1823: .Fa FILE
1.1 deraadt 1824: pointer in the call to
1.16 jmc 1825: .Fn yy_create_buffer
1.1 deraadt 1826: is only used as the value of
1.16 jmc 1827: .Fa yyin
1.1 deraadt 1828: seen by
1.16 jmc 1829: .Dv YY_INPUT ;
1830: if
1831: .Dv YY_INPUT
1832: is redefined so that it no longer uses
1833: .Fa yyin ,
1834: then a nil
1835: .Fa FILE
1836: pointer can safely be passed to
1837: .Fn yy_create_buffer .
1838: To select a particular buffer to scan:
1839: .Pp
1840: .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1841: .Pp
1842: It switches the scanner's input buffer so subsequent tokens will
1.1 deraadt 1843: come from
1.16 jmc 1844: .Fa new_buffer .
1.1 deraadt 1845: Note that
1.16 jmc 1846: .Fn yy_switch_to_buffer
1847: may be used by
1848: .Fn yywrap
1849: to set things up for continued scanning,
1850: instead of opening a new file and pointing
1851: .Fa yyin
1852: at it.
1853: Note also that switching input sources via either
1854: .Fn yy_switch_to_buffer
1855: or
1856: .Fn yywrap
1857: does not change the start condition.
1858: .Pp
1859: .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1860: .Pp
1861: is used to reclaim the storage associated with a buffer.
1862: .Pf ( Fa buffer
1.1 deraadt 1863: can be nil, in which case the routine does nothing.)
1.16 jmc 1864: To clear the current contents of a buffer:
1865: .Pp
1866: .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1867: .Pp
1.1 deraadt 1868: This function discards the buffer's contents,
1.16 jmc 1869: so the next time the scanner attempts to match a token from the buffer,
1870: it will first fill the buffer anew using
1871: .Dv YY_INPUT .
1872: .Pp
1873: .Fn yy_new_buffer
1.1 deraadt 1874: is an alias for
1.16 jmc 1875: .Fn yy_create_buffer ,
1.1 deraadt 1876: provided for compatibility with the C++ use of
1.16 jmc 1877: .Em new
1.1 deraadt 1878: and
1.16 jmc 1879: .Em delete
1.1 deraadt 1880: for creating and destroying dynamic objects.
1.16 jmc 1881: .Pp
1.1 deraadt 1882: Finally, the
1.16 jmc 1883: .Dv YY_CURRENT_BUFFER
1.1 deraadt 1884: macro returns a
1.16 jmc 1885: .Dv YY_BUFFER_STATE
1.1 deraadt 1886: handle to the current buffer.
1.16 jmc 1887: .Pp
1.1 deraadt 1888: Here is an example of using these features for writing a scanner
1889: which expands include files (the
1.16 jmc 1890: .Aq Aq EOF
1.1 deraadt 1891: feature is discussed below):
1.16 jmc 1892: .Bd -literal -offset indent
1893: /*
1894: * the "incl" state is used for picking up the name
1895: * of an include file
1896: */
1897: %x incl
1898:
1899: %{
1900: #define MAX_INCLUDE_DEPTH 10
1901: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1902: int include_stack_ptr = 0;
1903: %}
1904:
1905: %%
1906: include BEGIN(incl);
1907:
1908: [a-z]+ ECHO;
1909: [^a-z\en]*\en? ECHO;
1910:
1911: <incl>[ \et]* /* eat the whitespace */
1912: <incl>[^ \et\en]+ { /* got the include file name */
1913: if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1914: errx(1, "Includes nested too deeply");
1915:
1916: include_stack[include_stack_ptr++] =
1917: YY_CURRENT_BUFFER;
1918:
1919: yyin = fopen(yytext, "r");
1920:
1921: if (yyin == NULL)
1922: err(1, NULL);
1.1 deraadt 1923:
1.16 jmc 1924: yy_switch_to_buffer(
1925: yy_create_buffer(yyin, YY_BUF_SIZE));
1.1 deraadt 1926:
1.16 jmc 1927: BEGIN(INITIAL);
1928: }
1.1 deraadt 1929:
1.16 jmc 1930: <<EOF>> {
1931: if (--include_stack_ptr < 0)
1.1 deraadt 1932: yyterminate();
1.16 jmc 1933: else {
1934: yy_delete_buffer(YY_CURRENT_BUFFER);
1.1 deraadt 1935: yy_switch_to_buffer(
1.16 jmc 1936: include_stack[include_stack_ptr]);
1937: }
1938: }
1939: .Ed
1940: .Pp
1.1 deraadt 1941: Three routines are available for setting up input buffers for
1.16 jmc 1942: scanning in-memory strings instead of files.
1943: All of them create a new input buffer for scanning the string,
1944: and return a corresponding
1945: .Dv YY_BUFFER_STATE
1946: handle (which should be deleted afterwards using
1947: .Fn yy_delete_buffer ) .
1948: They also switch to the new buffer using
1949: .Fn yy_switch_to_buffer ,
1.1 deraadt 1950: so the next call to
1.16 jmc 1951: .Fn yylex
1.1 deraadt 1952: will start scanning the string.
1.16 jmc 1953: .Bl -tag -width Ds
1954: .It yy_scan_string(const char *str)
1955: Scans a NUL-terminated string.
1956: .It yy_scan_bytes(const char *bytes, int len)
1957: Scans
1958: .Fa len
1959: bytes
1960: .Pq including possibly NUL's
1.1 deraadt 1961: starting at location
1.16 jmc 1962: .Fa bytes .
1963: .El
1964: .Pp
1965: Note that both of these functions create and scan a copy
1966: of the string or bytes.
1967: (This may be desirable, since
1968: .Fn yylex
1969: modifies the contents of the buffer it is scanning.)
1970: The copy can be avoided by using:
1971: .Bl -tag -width Ds
1972: .It yy_scan_buffer(char *base, yy_size_t size)
1973: Which scans the buffer starting at
1974: .Fa base ,
1.1 deraadt 1975: consisting of
1.16 jmc 1976: .Fa size
1977: bytes, the last two bytes of which must be
1978: .Dv YY_END_OF_BUFFER_CHAR
1979: .Pq ASCII NUL .
1980: These last two bytes are not scanned; thus, scanning consists of
1981: base[0] through base[size-2], inclusive.
1982: .Pp
1983: If
1984: .Fa base
1985: is not set up in this manner
1986: (i.e., forget the final two
1987: .Dv YY_END_OF_BUFFER_CHAR
1.1 deraadt 1988: bytes), then
1.16 jmc 1989: .Fn yy_scan_buffer
1.1 deraadt 1990: returns a nil pointer instead of creating a new input buffer.
1.16 jmc 1991: .Pp
1.1 deraadt 1992: The type
1.16 jmc 1993: .Fa yy_size_t
1994: is an integral type which can be cast to an integer expression
1.1 deraadt 1995: reflecting the size of the buffer.
1.16 jmc 1996: .El
1997: .Sh END-OF-FILE RULES
1998: The special rule
1999: .Qq Aq Aq EOF
2000: indicates actions which are to be taken when an end-of-file is encountered and
2001: .Fn yywrap
2002: returns non-zero
2003: .Pq i.e., indicates no further files to process .
2004: The action must finish by doing one of four things:
2005: .Bl -dash
2006: .It
2007: Assigning
2008: .Em yyin
2009: to a new input file
2010: (in previous versions of
2011: .Nm ,
2012: after doing the assignment, it was necessary to call the special action
2013: .Dv YY_NEW_FILE ;
2014: this is no longer necessary).
2015: .It
2016: Executing a
2017: .Em return
2018: statement.
2019: .It
2020: Executing the special
2021: .Fn yyterminate
2022: action.
2023: .It
2024: Switching to a new buffer using
2025: .Fn yy_switch_to_buffer
1.1 deraadt 2026: as shown in the example above.
1.16 jmc 2027: .El
2028: .Pp
2029: .Aq Aq EOF
2030: rules may not be used with other patterns;
2031: they may only be qualified with a list of start conditions.
2032: If an unqualified
2033: .Aq Aq EOF
2034: rule is given, it applies to all start conditions which do not already have
2035: .Aq Aq EOF
2036: actions.
2037: To specify an
2038: .Aq Aq EOF
2039: rule for only the initial start condition, use
2040: .Pp
2041: .Dl <INITIAL><<EOF>>
2042: .Pp
1.1 deraadt 2043: These rules are useful for catching things like unclosed comments.
2044: An example:
1.16 jmc 2045: .Bd -literal -offset indent
2046: %x quote
2047: %%
2048:
2049: \&...other rules for dealing with quotes...
2050:
2051: <quote><<EOF>> {
2052: error("unterminated quote");
2053: yyterminate();
2054: }
2055: <<EOF>> {
2056: if (*++filelist)
2057: yyin = fopen(*filelist, "r");
2058: else
2059: yyterminate();
2060: }
2061: .Ed
2062: .Sh MISCELLANEOUS MACROS
1.1 deraadt 2063: The macro
1.16 jmc 2064: .Dv YY_USER_ACTION
1.1 deraadt 2065: can be defined to provide an action
1.16 jmc 2066: which is always executed prior to the matched rule's action.
2067: For example,
1.1 deraadt 2068: it could be #define'd to call a routine to convert yytext to lower-case.
2069: When
1.16 jmc 2070: .Dv YY_USER_ACTION
1.1 deraadt 2071: is invoked, the variable
1.16 jmc 2072: .Fa yy_act
2073: gives the number of the matched rule
2074: .Pq rules are numbered starting with 1 .
2075: For example, to profile how often each rule is matched,
2076: the following would do the trick:
2077: .Pp
2078: .Dl #define YY_USER_ACTION ++ctr[yy_act]
2079: .Pp
1.1 deraadt 2080: where
1.16 jmc 2081: .Fa ctr
2082: is an array to hold the counts for the different rules.
2083: Note that the macro
2084: .Dv YY_NUM_RULES
2085: gives the total number of rules
2086: (including the default rule, even if
2087: .Fl s
2088: is used),
1.1 deraadt 2089: so a correct declaration for
1.16 jmc 2090: .Fa ctr
1.1 deraadt 2091: is:
1.16 jmc 2092: .Pp
2093: .Dl int ctr[YY_NUM_RULES];
2094: .Pp
1.1 deraadt 2095: The macro
1.16 jmc 2096: .Dv YY_USER_INIT
1.1 deraadt 2097: may be defined to provide an action which is always executed before
1.16 jmc 2098: the first scan
2099: .Pq and before the scanner's internal initializations are done .
1.1 deraadt 2100: For example, it could be used to call a routine to read
2101: in a data table or open a logging file.
1.16 jmc 2102: .Pp
1.1 deraadt 2103: The macro
1.16 jmc 2104: .Dv yy_set_interactive(is_interactive)
1.1 deraadt 2105: can be used to control whether the current buffer is considered
1.16 jmc 2106: .Em interactive .
1.1 deraadt 2107: An interactive buffer is processed more slowly,
2108: but must be used when the scanner's input source is indeed
2109: interactive to avoid problems due to waiting to fill buffers
2110: (see the discussion of the
1.16 jmc 2111: .Fl I
2112: flag below).
2113: A non-zero value in the macro invocation marks the buffer as interactive,
2114: a zero value as non-interactive.
2115: Note that use of this macro overrides
2116: .Dq %option always-interactive
2117: or
2118: .Dq %option never-interactive
2119: (see
2120: .Sx OPTIONS
2121: below).
2122: .Fn yy_set_interactive
1.1 deraadt 2123: must be invoked prior to beginning to scan the buffer that is
1.16 jmc 2124: .Pq or is not
2125: to be considered interactive.
2126: .Pp
1.1 deraadt 2127: The macro
1.16 jmc 2128: .Dv yy_set_bol(at_bol)
1.1 deraadt 2129: can be used to control whether the current buffer's scanning
2130: context for the next token match is done as though at the
1.16 jmc 2131: beginning of a line.
2132: A non-zero macro argument makes rules anchored with
2133: .Sq ^
2134: active, while a zero argument makes
2135: .Sq ^
2136: rules inactive.
2137: .Pp
1.1 deraadt 2138: The macro
1.16 jmc 2139: .Dv YY_AT_BOL
2140: returns true if the next token scanned from the current buffer will have
2141: .Sq ^
2142: rules active, false otherwise.
2143: .Pp
1.1 deraadt 2144: In the generated scanner, the actions are all gathered in one large
2145: switch statement and separated using
1.16 jmc 2146: .Dv YY_BREAK ,
2147: which may be redefined.
2148: By default, it is simply a
2149: .Qq break ,
2150: to separate each rule's action from the following rules.
1.1 deraadt 2151: Redefining
1.16 jmc 2152: .Dv YY_BREAK
1.1 deraadt 2153: allows, for example, C++ users to
1.16 jmc 2154: .Dq #define YY_BREAK
2155: to do nothing
2156: (while being very careful that every rule ends with a
2157: .Qq break
2158: or a
2159: .Qq return ! )
2160: to avoid suffering from unreachable statement warnings where because a rule's
2161: action ends with
2162: .Dq return ,
2163: the
2164: .Dv YY_BREAK
1.1 deraadt 2165: is inaccessible.
1.16 jmc 2166: .Sh VALUES AVAILABLE TO THE USER
1.1 deraadt 2167: This section summarizes the various values available to the user
2168: in the rule actions.
1.16 jmc 2169: .Bl -tag -width Ds
2170: .It char *yytext
2171: Holds the text of the current token.
2172: It may be modified but not lengthened
2173: .Pq characters cannot be appended to the end .
2174: .Pp
1.1 deraadt 2175: If the special directive
1.16 jmc 2176: .Dq %array
1.1 deraadt 2177: appears in the first section of the scanner description, then
1.16 jmc 2178: .Fa yytext
1.1 deraadt 2179: is instead declared
1.16 jmc 2180: .Dq char yytext[YYLMAX] ,
1.1 deraadt 2181: where
1.16 jmc 2182: .Dv YYLMAX
2183: is a macro definition that can be redefined in the first section
2184: to change the default value
2185: .Pq generally 8KB .
2186: Using
2187: .Dq %array
1.1 deraadt 2188: results in somewhat slower scanners, but the value of
1.16 jmc 2189: .Fa yytext
1.1 deraadt 2190: becomes immune to calls to
1.16 jmc 2191: .Fn input
1.1 deraadt 2192: and
1.16 jmc 2193: .Fn unput ,
1.1 deraadt 2194: which potentially destroy its value when
1.16 jmc 2195: .Fa yytext
2196: is a character pointer.
2197: The opposite of
2198: .Dq %array
1.1 deraadt 2199: is
1.16 jmc 2200: .Dq %pointer ,
1.1 deraadt 2201: which is the default.
1.16 jmc 2202: .Pp
2203: .Dq %array
2204: cannot be used when generating C++ scanner classes
1.1 deraadt 2205: (the
1.16 jmc 2206: .Fl +
1.1 deraadt 2207: flag).
1.16 jmc 2208: .It int yyleng
2209: Holds the length of the current token.
2210: .It FILE *yyin
2211: Is the file which by default
2212: .Nm
2213: reads from.
2214: It may be redefined, but doing so only makes sense before
2215: scanning begins or after an
2216: .Dv EOF
2217: has been encountered.
2218: Changing it in the midst of scanning will have unexpected results since
2219: .Nm
1.1 deraadt 2220: buffers its input; use
1.16 jmc 2221: .Fn yyrestart
1.1 deraadt 2222: instead.
2223: Once scanning terminates because an end-of-file
1.16 jmc 2224: has been seen,
2225: .Fa yyin
2226: can be assigned as the new input file
2227: and the scanner can be called again to continue scanning.
2228: .It void yyrestart(FILE *new_file)
2229: May be called to point
2230: .Fa yyin
2231: at the new input file.
2232: The switch-over to the new file is immediate
2233: .Pq any previously buffered-up input is lost .
2234: Note that calling
2235: .Fn yyrestart
1.1 deraadt 2236: with
1.16 jmc 2237: .Fa yyin
1.1 deraadt 2238: as an argument thus throws away the current input buffer and continues
2239: scanning the same input file.
1.16 jmc 2240: .It FILE *yyout
2241: Is the file to which
2242: .Em ECHO
2243: actions are done.
2244: It can be reassigned by the user.
2245: .It YY_CURRENT_BUFFER
2246: Returns a
2247: .Dv YY_BUFFER_STATE
1.1 deraadt 2248: handle to the current buffer.
1.16 jmc 2249: .It YY_START
2250: Returns an integer value corresponding to the current start condition.
2251: This value can subsequently be used with
2252: .Em BEGIN
1.1 deraadt 2253: to return to that start condition.
1.16 jmc 2254: .El
2255: .Sh INTERFACING WITH YACC
1.1 deraadt 2256: One of the main uses of
1.16 jmc 2257: .Nm
1.1 deraadt 2258: is as a companion to the
1.16 jmc 2259: .Xr yacc 1
1.1 deraadt 2260: parser-generator.
1.16 jmc 2261: yacc parsers expect to call a routine named
2262: .Fn yylex
2263: to find the next input token.
2264: The routine is supposed to return the type of the next token
2265: as well as putting any associated value in the global
1.17 jmc 2266: .Fa yylval ,
2267: which is defined externally,
2268: and can be a union or any other complex data structure.
1.1 deraadt 2269: To use
1.16 jmc 2270: .Nm
2271: with yacc, one specifies the
2272: .Fl d
2273: option to yacc to instruct it to generate the file
2274: .Pa y.tab.h
1.1 deraadt 2275: containing definitions of all the
1.16 jmc 2276: .Dq %tokens
2277: appearing in the yacc input.
2278: This file is then included in the
2279: .Nm
2280: scanner.
2281: For example, if one of the tokens is
2282: .Qq TOK_NUMBER ,
1.1 deraadt 2283: part of the scanner might look like:
1.16 jmc 2284: .Bd -literal -offset indent
2285: %{
2286: #include "y.tab.h"
2287: %}
2288:
2289: %%
2290:
2291: [0-9]+ yylval = atoi(yytext); return TOK_NUMBER;
2292: .Ed
2293: .Sh OPTIONS
2294: .Nm
1.1 deraadt 2295: has the following options:
1.16 jmc 2296: .Bl -tag -width Ds
2297: .It Fl 7
2298: Instructs
2299: .Nm
2300: to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2301: characters in its input.
2302: The advantage of using
2303: .Fl 7
1.1 deraadt 2304: is that the scanner's tables can be up to half the size of those generated
2305: using the
1.16 jmc 2306: .Fl 8
2307: option
2308: .Pq see below .
2309: The disadvantage is that such scanners often hang
1.1 deraadt 2310: or crash if their input contains an 8-bit character.
1.16 jmc 2311: .Pp
2312: Note, however, that unless generating a scanner using the
2313: .Fl Cf
1.1 deraadt 2314: or
1.16 jmc 2315: .Fl CF
1.1 deraadt 2316: table compression options, use of
1.16 jmc 2317: .Fl 7
2318: will save only a small amount of table space,
2319: and make the scanner considerably less portable.
2320: .Nm flex Ns 's
2321: default behavior is to generate an 8-bit scanner unless
2322: .Fl Cf
2323: or
2324: .Fl CF
2325: is specified, in which case
2326: .Nm
2327: defaults to generating 7-bit scanners unless it was
2328: configured to generate 8-bit scanners
2329: (as will often be the case with non-USA sites).
2330: It is possible tell whether
2331: .Nm
2332: generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2333: .Fl v
2334: output as described below.
2335: .Pp
2336: Note that if
2337: .Fl Cfe
2338: or
2339: .Fl CFe
2340: are used
2341: (the table compression options, but also using equivalence classes as
2342: discussed below),
2343: .Nm
2344: still defaults to generating an 8-bit scanner,
2345: since usually with these compression options full 8-bit tables
1.1 deraadt 2346: are not much more expensive than 7-bit tables.
1.16 jmc 2347: .It Fl 8
2348: Instructs
2349: .Nm
1.1 deraadt 2350: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1.16 jmc 2351: characters.
2352: This flag is only needed for scanners generated using
2353: .Fl Cf
1.1 deraadt 2354: or
1.16 jmc 2355: .Fl CF ,
2356: as otherwise
2357: .Nm
2358: defaults to generating an 8-bit scanner anyway.
2359: .Pp
1.1 deraadt 2360: See the discussion of
1.16 jmc 2361: .Fl 7
2362: above for
2363: .Nm flex Ns 's
2364: default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2365: .It Fl B
2366: Instructs
2367: .Nm
2368: to generate a
2369: .Em batch
2370: scanner, the opposite of
2371: .Em interactive
2372: scanners generated by
2373: .Fl I
2374: .Pq see below .
2375: In general,
2376: .Fl B
2377: is used when the scanner will never be used interactively,
2378: and you want to squeeze a little more performance out of it.
2379: If the aim is instead to squeeze out a lot more performance,
2380: use the
2381: .Fl Cf
2382: or
2383: .Fl CF
2384: options
2385: .Pq discussed below ,
2386: which turn on
2387: .Fl B
2388: automatically anyway.
2389: .It Fl b
2390: Generate backing-up information to
2391: .Pa lex.backup .
2392: This is a list of scanner states which require backing up
2393: and the input characters on which they do so.
2394: By adding rules one can remove backing-up states.
2395: If all backing-up states are eliminated and
2396: .Fl Cf
2397: or
2398: .Fl CF
2399: is used, the generated scanner will run faster (see the
2400: .Fl p
2401: flag).
2402: Only users who wish to squeeze every last cycle out of their
2403: scanners need worry about this option.
2404: (See the section on
2405: .Sx PERFORMANCE CONSIDERATIONS
2406: below.)
2407: .It Fl C Ns Op Cm aeFfmr
2408: Controls the degree of table compression and, more generally, trade-offs
1.1 deraadt 2409: between small scanners and fast scanners.
1.16 jmc 2410: .Bl -tag -width Ds
2411: .It Fl Ca
2412: Instructs
2413: .Nm
2414: to trade off larger tables in the generated scanner for faster performance
2415: because the elements of the tables are better aligned for memory access
2416: and computation.
2417: On some
2418: .Tn RISC
2419: architectures, fetching and manipulating longwords is more efficient
2420: than with smaller-sized units such as shortwords.
2421: This option can double the size of the tables used by the scanner.
2422: .It Fl Ce
2423: Directs
2424: .Nm
1.1 deraadt 2425: to construct
1.16 jmc 2426: .Em equivalence classes ,
2427: i.e., sets of characters which have identical lexical properties
2428: (for example, if the only appearance of digits in the
2429: .Nm
1.1 deraadt 2430: input is in the character class
1.16 jmc 2431: .Qq [0-9]
2432: then the digits
2433: .Sq 0 ,
2434: .Sq 1 ,
2435: .Sq ... ,
2436: .Sq 9
2437: will all be put in the same equivalence class).
2438: Equivalence classes usually give dramatic reductions in the final
2439: table/object file sizes
2440: .Pq typically a factor of 2\-5
2441: and are pretty cheap performance-wise
2442: .Pq one array look-up per character scanned .
2443: .It Fl CF
2444: Specifies that the alternate fast scanner representation
2445: (described below under the
2446: .Fl F
2447: option)
2448: should be used.
2449: This option cannot be used with
2450: .Fl + .
2451: .It Fl Cf
2452: Specifies that the
2453: .Em full
2454: scanner tables should be generated \-
2455: .Nm
2456: should not compress the tables by taking advantage of
2457: similar transition functions for different states.
2458: .It Fl \&Cm
2459: Directs
2460: .Nm
1.1 deraadt 2461: to construct
1.16 jmc 2462: .Em meta-equivalence classes ,
2463: which are sets of equivalence classes
2464: (or characters, if equivalence classes are not being used)
2465: that are commonly used together.
2466: Meta-equivalence classes are often a big win when using compressed tables,
2467: but they have a moderate performance impact
2468: (one or two
2469: .Qq if
2470: tests and one array look-up per character scanned).
2471: .It Fl Cr
2472: Causes the generated scanner to
2473: .Em bypass
2474: use of the standard I/O library
2475: .Pq stdio
2476: for input.
2477: Instead of calling
2478: .Xr fread 3
1.1 deraadt 2479: or
1.16 jmc 2480: .Xr getc 3 ,
1.1 deraadt 2481: the scanner will use the
1.16 jmc 2482: .Xr read 2
2483: system call,
2484: resulting in a performance gain which varies from system to system,
2485: but in general is probably negligible unless
2486: .Fl Cf
1.1 deraadt 2487: or
1.16 jmc 2488: .Fl CF
2489: are being used.
1.1 deraadt 2490: Using
1.16 jmc 2491: .Fl Cr
2492: can cause strange behavior if, for example, reading from
2493: .Fa yyin
2494: using stdio prior to calling the scanner
2495: (because the scanner will miss whatever text previous reads left
2496: in the stdio input buffer).
2497: .Pp
2498: .Fl Cr
2499: has no effect if
2500: .Dv YY_INPUT
2501: is defined
2502: (see
2503: .Sx THE GENERATED SCANNER
2504: above).
2505: .El
2506: .Pp
1.1 deraadt 2507: A lone
1.16 jmc 2508: .Fl C
1.1 deraadt 2509: specifies that the scanner tables should be compressed but neither
2510: equivalence classes nor meta-equivalence classes should be used.
1.16 jmc 2511: .Pp
1.1 deraadt 2512: The options
1.16 jmc 2513: .Fl Cf
1.1 deraadt 2514: or
1.16 jmc 2515: .Fl CF
1.1 deraadt 2516: and
1.16 jmc 2517: .Fl \&Cm
2518: do not make sense together \- there is no opportunity for meta-equivalence
2519: classes if the table is not being compressed.
2520: Otherwise the options may be freely mixed, and are cumulative.
2521: .Pp
1.1 deraadt 2522: The default setting is
1.16 jmc 2523: .Fl Cem
1.1 deraadt 2524: which specifies that
1.16 jmc 2525: .Nm
2526: should generate equivalence classes and meta-equivalence classes.
2527: This setting provides the highest degree of table compression.
2528: It is possible to trade off faster-executing scanners at the cost of
2529: larger tables with the following generally being true:
2530: .Bd -unfilled -offset indent
2531: slowest & smallest
2532: -Cem
2533: -Cm
2534: -Ce
2535: -C
2536: -C{f,F}e
2537: -C{f,F}
2538: -C{f,F}a
2539: fastest & largest
2540: .Ed
2541: .Pp
1.1 deraadt 2542: Note that scanners with the smallest tables are usually generated and
1.16 jmc 2543: compiled the quickest,
2544: so during development the default is usually best,
2545: maximal compression.
2546: .Pp
2547: .Fl Cfe
2548: is often a good compromise between speed and size for production scanners.
2549: .It Fl c
2550: A do-nothing, deprecated option included for
2551: .Tn POSIX
2552: compliance.
2553: .It Fl d
2554: Makes the generated scanner run in debug mode.
2555: Whenever a pattern is recognized and the global
2556: .Fa yy_flex_debug
2557: is non-zero
2558: .Pq which is the default ,
2559: the scanner will write to stderr a line of the form:
2560: .Pp
2561: .D1 --accepting rule at line 53 ("the matched text")
2562: .Pp
2563: The line number refers to the location of the rule in the file
2564: defining the scanner
2565: (i.e., the file that was fed to
2566: .Nm ) .
2567: Messages are also generated when the scanner backs up,
2568: accepts the default rule,
2569: reaches the end of its input buffer
2570: (or encounters a NUL;
2571: at this point, the two look the same as far as the scanner's concerned),
2572: or reaches an end-of-file.
2573: .It Fl F
2574: Specifies that the fast scanner table representation should be used
2575: .Pq and stdio bypassed .
2576: This representation is about as fast as the full table representation
2577: .Pq Fl f ,
2578: and for some sets of patterns will be considerably smaller
2579: .Pq and for others, larger .
2580: In general, if the pattern set contains both
2581: .Qq keywords
2582: and a catch-all,
2583: .Qq identifier
2584: rule, such as in the set:
2585: .Bd -unfilled -offset indent
2586: "case" return TOK_CASE;
2587: "switch" return TOK_SWITCH;
2588: \&...
2589: "default" return TOK_DEFAULT;
2590: [a-z]+ return TOK_ID;
2591: .Ed
2592: .Pp
2593: then it's better to use the full table representation.
2594: If only the
2595: .Qq identifier
2596: rule is present and a hash table or some such is used to detect the keywords,
2597: it's better to use
2598: .Fl F .
2599: .Pp
2600: This option is equivalent to
2601: .Fl CFr
2602: .Pq see above .
2603: It cannot be used with
2604: .Fl + .
2605: .It Fl f
2606: Specifies
2607: .Em fast scanner .
2608: No table compression is done and stdio is bypassed.
2609: The result is large but fast.
2610: This option is equivalent to
2611: .Fl Cfr
2612: .Pq see above .
2613: .It Fl h
2614: Generates a help summary of
2615: .Nm flex Ns 's
2616: options to stdout and then exits.
2617: .Fl ?\&
2618: and
2619: .Fl Fl help
2620: are synonyms for
2621: .Fl h .
2622: .It Fl I
2623: Instructs
2624: .Nm
2625: to generate an
2626: .Em interactive
2627: scanner.
2628: An interactive scanner is one that only looks ahead to decide
2629: what token has been matched if it absolutely must.
2630: It turns out that always looking one extra character ahead,
2631: even if the scanner has already seen enough text
2632: to disambiguate the current token, is a bit faster than
2633: only looking ahead when necessary.
2634: But scanners that always look ahead give dreadful interactive performance;
2635: for example, when a user types a newline,
2636: it is not recognized as a newline token until they enter
2637: .Em another
2638: token, which often means typing in another whole line.
2639: .Pp
2640: .Nm
2641: scanners default to
2642: .Em interactive
2643: unless
2644: .Fl Cf
2645: or
2646: .Fl CF
2647: table-compression options are specified
2648: .Pq see above .
2649: That's because if high-performance is most important,
2650: one of these options should be used,
2651: so if they weren't,
2652: .Nm
2653: assumes it is preferrable to trade off a bit of run-time performance for
2654: intuitive interactive behavior.
2655: Note also that
2656: .Fl I
2657: cannot be used in conjunction with
2658: .Fl Cf
2659: or
2660: .Fl CF .
2661: Thus, this option is not really needed; it is on by default for all those
2662: cases in which it is allowed.
2663: .Pp
2664: A scanner can be forced to not be interactive by using
2665: .Fl B
2666: .Pq see above .
2667: .It Fl i
2668: Instructs
2669: .Nm
2670: to generate a case-insensitive scanner.
2671: The case of letters given in the
2672: .Nm
2673: input patterns will be ignored,
2674: and tokens in the input will be matched regardless of case.
2675: The matched text given in
2676: .Fa yytext
2677: will have the preserved case
2678: .Pq i.e., it will not be folded .
2679: .It Fl L
2680: Instructs
2681: .Nm
2682: not to generate
2683: .Dq #line
2684: directives.
2685: Without this option,
2686: .Nm
2687: peppers the generated scanner with #line directives so error messages
2688: in the actions will be correctly located with respect to either the original
2689: .Nm
2690: input file
2691: (if the errors are due to code in the input file),
2692: or
2693: .Pa lex.yy.c
2694: (if the errors are
2695: .Nm flex Ns 's
2696: fault \- these sorts of errors should be reported to the email address
2697: given below).
2698: .It Fl l
2699: Turns on maximum compatibility with the original AT&T
2700: .Nm lex
2701: implementation.
2702: Note that this does not mean full compatibility.
2703: Use of this option costs a considerable amount of performance,
2704: and it cannot be used with the
2705: .Fl + , f , F , Cf ,
2706: or
2707: .Fl CF
2708: options.
2709: For details on the compatibilities it provides, see the section
2710: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
2711: below.
2712: This option also results in the name
2713: .Dv YY_FLEX_LEX_COMPAT
2714: being #define'd in the generated scanner.
2715: .It Fl n
2716: Another do-nothing, deprecated option included only for
2717: .Tn POSIX
2718: compliance.
2719: .It Fl o Ns Ar output
2720: Directs
2721: .Nm
2722: to write the scanner to the file
2723: .Ar output
1.1 deraadt 2724: instead of
1.16 jmc 2725: .Pa lex.yy.c .
2726: If
2727: .Fl o
2728: is combined with the
2729: .Fl t
2730: option, then the scanner is written to stdout but its
2731: .Dq #line
2732: directives
2733: (see the
2734: .Fl L
2735: option above)
2736: refer to the file
2737: .Ar output .
2738: .It Fl P Ns Ar prefix
2739: Changes the default
2740: .Qq yy
1.1 deraadt 2741: prefix used by
1.16 jmc 2742: .Nm
1.6 aaron 2743: for all globally visible variable and function names to instead be
1.16 jmc 2744: .Ar prefix .
1.1 deraadt 2745: For example,
1.16 jmc 2746: .Fl P Ns Ar foo
1.1 deraadt 2747: changes the name of
1.16 jmc 2748: .Fa yytext
1.1 deraadt 2749: to
1.16 jmc 2750: .Fa footext .
1.1 deraadt 2751: It also changes the name of the default output file from
1.16 jmc 2752: .Pa lex.yy.c
1.1 deraadt 2753: to
1.16 jmc 2754: .Pa lex.foo.c .
1.1 deraadt 2755: Here are all of the names affected:
1.16 jmc 2756: .Bd -unfilled -offset indent
2757: yy_create_buffer
2758: yy_delete_buffer
2759: yy_flex_debug
2760: yy_init_buffer
2761: yy_flush_buffer
2762: yy_load_buffer_state
2763: yy_switch_to_buffer
2764: yyin
2765: yyleng
2766: yylex
2767: yylineno
2768: yyout
2769: yyrestart
2770: yytext
2771: yywrap
2772: .Ed
2773: .Pp
2774: (If using a C++ scanner, then only
2775: .Fa yywrap
1.1 deraadt 2776: and
1.16 jmc 2777: .Fa yyFlexLexer
1.1 deraadt 2778: are affected.)
1.16 jmc 2779: Within the scanner itself, it is still possible to refer to the global variables
1.1 deraadt 2780: and functions using either version of their name; but externally, they
2781: have the modified name.
1.16 jmc 2782: .Pp
2783: This option allows multiple
2784: .Nm
2785: programs to be easily linked together into the same executable.
2786: Note, though, that using this option also renames
2787: .Fn yywrap ,
2788: so now either an
2789: .Pq appropriately named
2790: version of the routine for the scanner must be supplied, or
2791: .Dq %option noyywrap
2792: must be used, as linking with
2793: .Fl lfl
2794: no longer provides one by default.
2795: .It Fl p
2796: Generates a performance report to stderr.
2797: The report consists of comments regarding features of the
2798: .Nm
2799: input file which will cause a serious loss of performance in the resulting
2800: scanner.
2801: If the flag is specified twice,
2802: comments regarding features that lead to minor performance losses
2803: will also be reported>
2804: .Pp
2805: Note that the use of
2806: .Em REJECT ,
2807: .Dq %option yylineno ,
2808: and variable trailing context
2809: (see the
2810: .Sx BUGS
2811: section below)
2812: entails a substantial performance penalty; use of
2813: .Fn yymore ,
2814: the
2815: .Sq ^
2816: operator, and the
2817: .Fl I
2818: flag entail minor performance penalties.
2819: .It Fl S Ns Ar skeleton
2820: Overrides the default skeleton file from which
2821: .Nm
2822: constructs its scanners.
2823: This option is needed only for
2824: .Nm
1.1 deraadt 2825: maintenance or development.
1.16 jmc 2826: .It Fl s
2827: Causes the default rule
2828: .Pq that unmatched scanner input is echoed to stdout
2829: to be suppressed.
2830: If the scanner encounters input that does not
2831: match any of its rules, it aborts with an error.
2832: This option is useful for finding holes in a scanner's rule set.
2833: .It Fl T
2834: Makes
2835: .Nm
2836: run in
2837: .Em trace
2838: mode.
2839: It will generate a lot of messages to stderr concerning
2840: the form of the input and the resultant non-deterministic and deterministic
2841: finite automata.
2842: This option is mostly for use in maintaining
2843: .Nm .
2844: .It Fl t
2845: Instructs
2846: .Nm
2847: to write the scanner it generates to standard output instead of
2848: .Pa lex.yy.c .
2849: .It Fl V
2850: Prints the version number to stdout and exits.
2851: .Fl Fl version
2852: is a synonym for
2853: .Fl V .
2854: .It Fl v
2855: Specifies that
2856: .Nm
2857: should write to stderr
2858: a summary of statistics regarding the scanner it generates.
2859: Most of the statistics are meaningless to the casual
2860: .Nm
2861: user, but the first line identifies the version of
2862: .Nm
2863: (same as reported by
2864: .Fl V ) ,
2865: and the next line the flags used when generating the scanner,
2866: including those that are on by default.
2867: .It Fl w
2868: Suppresses warning messages.
2869: .It Fl +
2870: Specifies that
2871: .Nm
2872: should generate a C++ scanner class.
2873: See the section on
2874: .Sx GENERATING C++ SCANNERS
2875: below for details.
2876: .El
2877: .Pp
2878: .Nm
1.1 deraadt 2879: also provides a mechanism for controlling options within the
1.16 jmc 2880: scanner specification itself, rather than from the
2881: .Nm
2882: command-line.
1.1 deraadt 2883: This is done by including
1.16 jmc 2884: .Dq %option
1.1 deraadt 2885: directives in the first section of the scanner specification.
1.16 jmc 2886: Multiple options can be specified with a single
2887: .Dq %option
2888: directive, and multiple directives in the first section of the
2889: .Nm
2890: input file.
2891: .Pp
2892: Most options are given simply as names, optionally preceded by the word
2893: .Qq no
2894: .Pq with no intervening whitespace
2895: to negate their meaning.
2896: A number are equivalent to
2897: .Nm
2898: flags or their negation:
2899: .Bd -unfilled -offset indent
2900: 7bit -7 option
2901: 8bit -8 option
2902: align -Ca option
2903: backup -b option
2904: batch -B option
2905: c++ -+ option
2906:
2907: caseful or
2908: case-sensitive opposite of -i (default)
2909:
2910: case-insensitive or
2911: caseless -i option
2912:
2913: debug -d option
2914: default opposite of -s option
2915: ecs -Ce option
2916: fast -F option
2917: full -f option
2918: interactive -I option
2919: lex-compat -l option
2920: meta-ecs -Cm option
2921: perf-report -p option
2922: read -Cr option
2923: stdout -t option
2924: verbose -v option
2925: warn opposite of -w option
2926: (use "%option nowarn" for -w)
2927:
2928: array equivalent to "%array"
2929: pointer equivalent to "%pointer" (default)
2930: .Ed
2931: .Pp
2932: Some %option's provide features otherwise not available:
2933: .Bl -tag -width Ds
2934: .It always-interactive
2935: Instructs
2936: .Nm
2937: to generate a scanner which always considers its input
2938: .Qq interactive .
2939: Normally, on each new input file the scanner calls
2940: .Fn isatty
2941: in an attempt to determine whether the scanner's input source is interactive
2942: and thus should be read a character at a time.
2943: When this option is used, however, no such call is made.
2944: .It main
2945: Directs
2946: .Nm
2947: to provide a default
2948: .Fn main
1.1 deraadt 2949: program for the scanner, which simply calls
1.16 jmc 2950: .Fn yylex .
1.1 deraadt 2951: This option implies
1.16 jmc 2952: .Dq noyywrap
2953: .Pq see below .
2954: .It never-interactive
2955: Instructs
2956: .Nm
2957: to generate a scanner which never considers its input
2958: .Qq interactive
2959: (again, no call made to
2960: .Fn isatty ) .
1.1 deraadt 2961: This is the opposite of
1.16 jmc 2962: .Dq always-interactive .
2963: .It stack
2964: Enables the use of start condition stacks
2965: (see
2966: .Sx START CONDITIONS
2967: above).
2968: .It stdinit
2969: If set (i.e.,
2970: .Dq %option stdinit ) ,
1.1 deraadt 2971: initializes
1.16 jmc 2972: .Fa yyin
1.1 deraadt 2973: and
1.16 jmc 2974: .Fa yyout
2975: to stdin and stdout, instead of the default of
2976: .Dq nil .
1.1 deraadt 2977: Some existing
1.16 jmc 2978: .Nm lex
2979: programs depend on this behavior, even though it is not compliant with ANSI C,
2980: which does not require stdin and stdout to be compile-time constant.
2981: .It yylineno
2982: Directs
2983: .Nm
1.1 deraadt 2984: to generate a scanner that maintains the number of the current line
2985: read from its input in the global variable
1.16 jmc 2986: .Fa yylineno .
1.1 deraadt 2987: This option is implied by
1.16 jmc 2988: .Dq %option lex-compat .
2989: .It yywrap
2990: If unset (i.e.,
2991: .Dq %option noyywrap ) ,
1.1 deraadt 2992: makes the scanner not call
1.16 jmc 2993: .Fn yywrap
2994: upon an end-of-file, but simply assume that there are no more files to scan
2995: (until the user points
2996: .Fa yyin
1.1 deraadt 2997: at a new file and calls
1.16 jmc 2998: .Fn yylex
1.1 deraadt 2999: again).
1.16 jmc 3000: .El
3001: .Pp
3002: .Nm
3003: scans rule actions to determine whether the
3004: .Em REJECT
3005: or
3006: .Fn yymore
3007: features are being used.
3008: The
3009: .Dq reject
1.1 deraadt 3010: and
1.16 jmc 3011: .Dq yymore
3012: options are available to override its decision as to whether to use the
1.1 deraadt 3013: options, either by setting them (e.g.,
1.16 jmc 3014: .Dq %option reject )
3015: to indicate the feature is indeed used,
3016: or unsetting them to indicate it actually is not used
1.1 deraadt 3017: (e.g.,
1.16 jmc 3018: .Dq %option noyymore ) .
3019: .Pp
3020: Three options take string-delimited values, offset with
3021: .Sq = :
3022: .Pp
3023: .D1 %option outfile="ABC"
3024: .Pp
1.1 deraadt 3025: is equivalent to
1.16 jmc 3026: .Fl o Ns Ar ABC ,
1.1 deraadt 3027: and
1.16 jmc 3028: .Pp
3029: .D1 %option prefix="XYZ"
3030: .Pp
1.1 deraadt 3031: is equivalent to
1.16 jmc 3032: .Fl P Ns Ar XYZ .
1.1 deraadt 3033: Finally,
1.16 jmc 3034: .Pp
3035: .D1 %option yyclass="foo"
3036: .Pp
3037: only applies when generating a C++ scanner
3038: .Pf ( Fl +
3039: option).
3040: It informs
3041: .Nm
3042: that
3043: .Dq foo
3044: has been derived as a subclass of yyFlexLexer, so
3045: .Nm
3046: will place actions in the member function
3047: .Dq foo::yylex()
1.1 deraadt 3048: instead of
1.16 jmc 3049: .Dq yyFlexLexer::yylex() .
1.1 deraadt 3050: It also generates a
1.16 jmc 3051: .Dq yyFlexLexer::yylex()
1.1 deraadt 3052: member function that emits a run-time error (by invoking
1.16 jmc 3053: .Dq yyFlexLexer::LexerError() )
1.1 deraadt 3054: if called.
1.16 jmc 3055: See
3056: .Sx GENERATING C++ SCANNERS ,
3057: below, for additional information.
3058: .Pp
3059: A number of options are available for
3060: .Xr lint 1
3061: purists who want to suppress the appearance of unneeded routines
3062: in the generated scanner.
3063: Each of the following, if unset
1.1 deraadt 3064: (e.g.,
1.16 jmc 3065: .Dq %option nounput ) ,
3066: results in the corresponding routine not appearing in the generated scanner:
3067: .Bd -unfilled -offset indent
3068: input, unput
3069: yy_push_state, yy_pop_state, yy_top_state
3070: yy_scan_buffer, yy_scan_bytes, yy_scan_string
3071: .Ed
3072: .Pp
1.1 deraadt 3073: (though
1.16 jmc 3074: .Fn yy_push_state
3075: and friends won't appear anyway unless
3076: .Dq %option stack
3077: is being used).
3078: .Sh PERFORMANCE CONSIDERATIONS
1.1 deraadt 3079: The main design goal of
1.16 jmc 3080: .Nm
3081: is that it generate high-performance scanners.
3082: It has been optimized for dealing well with large sets of rules.
3083: Aside from the effects on scanner speed of the table compression
3084: .Fl C
1.1 deraadt 3085: options outlined above,
1.16 jmc 3086: there are a number of options/actions which degrade performance.
3087: These are, from most expensive to least:
3088: .Bd -unfilled -offset indent
3089: REJECT
3090: %option yylineno
3091: arbitrary trailing context
3092:
3093: pattern sets that require backing up
3094: %array
3095: %option interactive
3096: %option always-interactive
3097:
3098: \&'^' beginning-of-line operator
3099: yymore()
3100: .Ed
3101: .Pp
3102: with the first three all being quite expensive
3103: and the last two being quite cheap.
3104: Note also that
3105: .Fn unput
3106: is implemented as a routine call that potentially does quite a bit of work,
3107: while
3108: .Fn yyless
3109: is a quite-cheap macro; so if just putting back some excess text,
3110: use
3111: .Fn yyless .
3112: .Pp
3113: .Em REJECT
1.1 deraadt 3114: should be avoided at all costs when performance is important.
3115: It is a particularly expensive option.
1.16 jmc 3116: .Pp
1.1 deraadt 3117: Getting rid of backing up is messy and often may be an enormous
1.16 jmc 3118: amount of work for a complicated scanner.
3119: In principal, one begins by using the
3120: .Fl b
1.1 deraadt 3121: flag to generate a
1.16 jmc 3122: .Pa lex.backup
3123: file.
3124: For example, on the input
3125: .Bd -literal -offset indent
3126: %%
3127: foo return TOK_KEYWORD;
3128: foobar return TOK_KEYWORD;
3129: .Ed
3130: .Pp
1.1 deraadt 3131: the file looks like:
1.16 jmc 3132: .Bd -literal -offset indent
3133: State #6 is non-accepting -
3134: associated rule line numbers:
3135: 2 3
3136: out-transitions: [ o ]
3137: jam-transitions: EOF [ \e001-n p-\e177 ]
3138:
3139: State #8 is non-accepting -
3140: associated rule line numbers:
3141: 3
3142: out-transitions: [ a ]
3143: jam-transitions: EOF [ \e001-` b-\e177 ]
3144:
3145: State #9 is non-accepting -
3146: associated rule line numbers:
3147: 3
3148: out-transitions: [ r ]
3149: jam-transitions: EOF [ \e001-q s-\e177 ]
3150:
3151: Compressed tables always back up.
3152: .Ed
3153: .Pp
1.1 deraadt 3154: The first few lines tell us that there's a scanner state in
1.16 jmc 3155: which it can make a transition on an
3156: .Sq o
3157: but not on any other character,
3158: and that in that state the currently scanned text does not match any rule.
3159: The state occurs when trying to match the rules found
1.1 deraadt 3160: at lines 2 and 3 in the input file.
1.16 jmc 3161: If the scanner is in that state and then reads something other than an
3162: .Sq o ,
3163: it will have to back up to find a rule which is matched.
3164: With a bit of headscratching one can see that this must be the
3165: state it's in when it has seen
3166: .Sq fo .
3167: When this has happened, if anything other than another
3168: .Sq o
3169: is seen, the scanner will have to back up to simply match the
3170: .Sq f
3171: .Pq by the default rule .
3172: .Pp
3173: The comment regarding State #8 indicates there's a problem when
3174: .Qq foob
3175: has been scanned.
3176: Indeed, on any character other than an
3177: .Sq a ,
3178: the scanner will have to back up to accept
3179: .Qq foo .
3180: Similarly, the comment for State #9 concerns when
3181: .Qq fooba
3182: has been scanned and an
3183: .Sq r
3184: does not follow.
3185: .Pp
1.1 deraadt 3186: The final comment reminds us that there's no point going to
1.16 jmc 3187: all the trouble of removing backing up from the rules unless we're using
3188: .Fl Cf
1.1 deraadt 3189: or
1.16 jmc 3190: .Fl CF ,
1.1 deraadt 3191: since there's no performance gain doing so with compressed scanners.
1.16 jmc 3192: .Pp
3193: The way to remove the backing up is to add
3194: .Qq error
3195: rules:
3196: .Bd -literal -offset indent
3197: %%
3198: foo return TOK_KEYWORD;
3199: foobar return TOK_KEYWORD;
3200:
3201: fooba |
3202: foob |
3203: fo {
3204: /* false alarm, not really a keyword */
3205: return TOK_ID;
3206: }
3207: .Ed
3208: .Pp
3209: Eliminating backing up among a list of keywords can also be done using a
3210: .Qq catch-all
3211: rule:
3212: .Bd -literal -offset indent
3213: %%
3214: foo return TOK_KEYWORD;
3215: foobar return TOK_KEYWORD;
3216:
3217: [a-z]+ return TOK_ID;
3218: .Ed
3219: .Pp
1.1 deraadt 3220: This is usually the best solution when appropriate.
1.16 jmc 3221: .Pp
1.1 deraadt 3222: Backing up messages tend to cascade.
1.16 jmc 3223: With a complicated set of rules it's not uncommon to get hundreds of messages.
3224: If one can decipher them, though,
3225: it often only takes a dozen or so rules to eliminate the backing up
3226: (though it's easy to make a mistake and have an error rule accidentally match
3227: a valid token; a possible future
3228: .Nm
1.1 deraadt 3229: feature will be to automatically add rules to eliminate backing up).
1.16 jmc 3230: .Pp
3231: It's important to keep in mind that the benefits of eliminating
3232: backing up are gained only if
3233: .Em every
3234: instance of backing up is eliminated.
3235: Leaving just one gains nothing.
3236: .Pp
3237: .Em Variable
3238: trailing context
3239: (where both the leading and trailing parts do not have a fixed length)
3240: entails almost the same performance loss as
3241: .Em REJECT
3242: .Pq i.e., substantial .
3243: So when possible a rule like:
3244: .Bd -literal -offset indent
3245: %%
3246: mouse|rat/(cat|dog) run();
3247: .Ed
3248: .Pp
1.1 deraadt 3249: is better written:
1.16 jmc 3250: .Bd -literal -offset indent
3251: %%
3252: mouse/cat|dog run();
3253: rat/cat|dog run();
3254: .Ed
3255: .Pp
1.1 deraadt 3256: or as
1.16 jmc 3257: .Bd -literal -offset indent
3258: %%
3259: mouse|rat/cat run();
3260: mouse|rat/dog run();
3261: .Ed
3262: .Pp
3263: Note that here the special
3264: .Sq |\&
3265: action does not provide any savings, and can even make things worse (see
3266: .Sx BUGS
3267: below).
3268: .Pp
1.1 deraadt 3269: Another area where the user can increase a scanner's performance
1.16 jmc 3270: .Pq and one that's easier to implement
3271: arises from the fact that the longer the tokens matched,
3272: the faster the scanner will run.
1.1 deraadt 3273: This is because with long tokens the processing of most input
1.16 jmc 3274: characters takes place in the
3275: .Pq short
3276: inner scanning loop, and does not often have to go through the additional work
3277: of setting up the scanning environment (e.g.,
3278: .Fa yytext )
3279: for the action.
3280: Recall the scanner for C comments:
3281: .Bd -literal -offset indent
3282: %x comment
3283: %%
3284: int line_num = 1;
3285:
3286: "/*" BEGIN(comment);
3287:
3288: <comment>[^*\en]*
3289: <comment>"*"+[^*/\en]*
3290: <comment>\en ++line_num;
3291: <comment>"*"+"/" BEGIN(INITIAL);
3292: .Ed
3293: .Pp
1.1 deraadt 3294: This could be sped up by writing it as:
1.16 jmc 3295: .Bd -literal -offset indent
3296: %x comment
3297: %%
3298: int line_num = 1;
3299:
3300: "/*" BEGIN(comment);
3301:
3302: <comment>[^*\en]*
3303: <comment>[^*\en]*\en ++line_num;
3304: <comment>"*"+[^*/\en]*
3305: <comment>"*"+[^*/\en]*\en ++line_num;
3306: <comment>"*"+"/" BEGIN(INITIAL);
3307: .Ed
3308: .Pp
3309: Now instead of each newline requiring the processing of another action,
3310: recognizing the newlines is
3311: .Qq distributed
3312: over the other rules to keep the matched text as long as possible.
3313: Note that adding rules does
3314: .Em not
3315: slow down the scanner!
3316: The speed of the scanner is independent of the number of rules or
3317: (modulo the considerations given at the beginning of this section)
3318: how complicated the rules are with regard to operators such as
3319: .Sq *
3320: and
3321: .Sq |\& .
3322: .Pp
3323: A final example in speeding up a scanner:
3324: scan through a file containing identifiers and keywords, one per line
3325: and with no other extraneous characters, and recognize all the keywords.
3326: A natural first approach is:
3327: .Bd -literal -offset indent
3328: %%
3329: asm |
3330: auto |
3331: break |
3332: \&... etc ...
3333: volatile |
3334: while /* it's a keyword */
3335:
3336: \&.|\en /* it's not a keyword */
3337: .Ed
3338: .Pp
1.1 deraadt 3339: To eliminate the back-tracking, introduce a catch-all rule:
1.16 jmc 3340: .Bd -literal -offset indent
3341: %%
3342: asm |
3343: auto |
3344: break |
3345: \&... etc ...
3346: volatile |
3347: while /* it's a keyword */
3348:
3349: [a-z]+ |
3350: \&.|\en /* it's not a keyword */
3351: .Ed
3352: .Pp
1.1 deraadt 3353: Now, if it's guaranteed that there's exactly one word per line,
3354: then we can reduce the total number of matches by a half by
1.16 jmc 3355: merging in the recognition of newlines with that of the other tokens:
3356: .Bd -literal -offset indent
3357: %%
3358: asm\en |
3359: auto\en |
3360: break\en |
3361: \&... etc ...
3362: volatile\en |
3363: while\en /* it's a keyword */
3364:
3365: [a-z]+\en |
3366: \&.|\en /* it's not a keyword */
3367: .Ed
3368: .Pp
3369: One has to be careful here,
3370: as we have now reintroduced backing up into the scanner.
3371: In particular, while we know that there will never be any characters
3372: in the input stream other than letters or newlines,
3373: .Nm
1.1 deraadt 3374: can't figure this out, and it will plan for possibly needing to back up
1.16 jmc 3375: when it has scanned a token like
3376: .Qq auto
3377: and then the next character is something other than a newline or a letter.
3378: Previously it would then just match the
3379: .Qq auto
3380: rule and be done, but now it has no
3381: .Qq auto
3382: rule, only an
3383: .Qq auto\en
3384: rule.
3385: To eliminate the possibility of backing up,
1.1 deraadt 3386: we could either duplicate all rules but without final newlines, or,
3387: since we never expect to encounter such an input and therefore don't
1.16 jmc 3388: how it's classified, we can introduce one more catch-all rule,
3389: this one which doesn't include a newline:
3390: .Bd -literal -offset indent
3391: %%
3392: asm\en |
3393: auto\en |
3394: break\en |
3395: \&... etc ...
3396: volatile\en |
3397: while\en /* it's a keyword */
3398:
3399: [a-z]+\en |
3400: [a-z]+ |
3401: \&.|\en /* it's not a keyword */
3402: .Ed
3403: .Pp
1.1 deraadt 3404: Compiled with
1.16 jmc 3405: .Fl Cf ,
1.1 deraadt 3406: this is about as fast as one can get a
1.16 jmc 3407: .Nm
1.1 deraadt 3408: scanner to go for this particular problem.
1.16 jmc 3409: .Pp
1.1 deraadt 3410: A final note:
1.16 jmc 3411: .Nm
3412: is slow when matching NUL's,
3413: particularly when a token contains multiple NUL's.
3414: It's best to write rules which match short
1.1 deraadt 3415: amounts of text if it's anticipated that the text will often include NUL's.
1.16 jmc 3416: .Pp
1.1 deraadt 3417: Another final note regarding performance: as mentioned above in the section
1.16 jmc 3418: .Sx HOW THE INPUT IS MATCHED ,
3419: dynamically resizing
3420: .Fa yytext
1.1 deraadt 3421: to accommodate huge tokens is a slow process because it presently requires that
1.16 jmc 3422: the
3423: .Pq huge
3424: token be rescanned from the beginning.
3425: Thus if performance is vital, it is better to attempt to match
3426: .Qq large
3427: quantities of text but not
3428: .Qq huge
3429: quantities, where the cutoff between the two is at about 8K characters/token.
3430: .Sh GENERATING C++ SCANNERS
3431: .Nm
3432: provides two different ways to generate scanners for use with C++.
3433: The first way is to simply compile a scanner generated by
3434: .Nm
3435: using a C++ compiler instead of a C compiler.
3436: This should not generate any compilation errors
3437: (please report any found to the email address given in the
3438: .Sx AUTHORS
3439: section below).
3440: C++ code can then be used in rule actions instead of C code.
3441: Note that the default input source for scanners remains
3442: .Fa yyin ,
1.1 deraadt 3443: and default echoing is still done to
1.16 jmc 3444: .Fa yyout .
1.1 deraadt 3445: Both of these remain
1.16 jmc 3446: .Fa FILE *
3447: variables and not C++ streams.
3448: .Pp
3449: .Nm
3450: can also be used to generate a C++ scanner class, using the
3451: .Fl +
1.1 deraadt 3452: option (or, equivalently,
1.16 jmc 3453: .Dq %option c++ ) ,
3454: which is automatically specified if the name of the flex executable ends in a
3455: .Sq + ,
3456: such as
3457: .Nm flex++ .
3458: When using this option,
3459: .Nm
3460: defaults to generating the scanner to the file
3461: .Pa lex.yy.cc
1.1 deraadt 3462: instead of
1.16 jmc 3463: .Pa lex.yy.c .
1.1 deraadt 3464: The generated scanner includes the header file
1.16 jmc 3465: .Aq Pa g++/FlexLexer.h ,
1.1 deraadt 3466: which defines the interface to two C++ classes.
1.16 jmc 3467: .Pp
1.1 deraadt 3468: The first class,
1.16 jmc 3469: .Em FlexLexer ,
3470: provides an abstract base class defining the general scanner class interface.
3471: It provides the following member functions:
3472: .Bl -tag -width Ds
3473: .It const char* YYText()
3474: Returns the text of the most recently matched token, the equivalent of
3475: .Fa yytext .
3476: .It int YYLeng()
3477: Returns the length of the most recently matched token, the equivalent of
3478: .Fa yyleng .
3479: .It int lineno() const
3480: Returns the current input line number
1.1 deraadt 3481: (see
1.16 jmc 3482: .Dq %option yylineno ) ,
3483: or 1 if
3484: .Dq %option yylineno
1.1 deraadt 3485: was not used.
1.16 jmc 3486: .It void set_debug(int flag)
3487: Sets the debugging flag for the scanner, equivalent to assigning to
3488: .Fa yy_flex_debug
3489: (see the
3490: .Sx OPTIONS
3491: section above).
3492: Note that the scanner must be built using
3493: .Dq %option debug
1.1 deraadt 3494: to include debugging information in it.
1.16 jmc 3495: .It int debug() const
3496: Returns the current setting of the debugging flag.
3497: .El
3498: .Pp
1.1 deraadt 3499: Also provided are member functions equivalent to
1.16 jmc 3500: .Fn yy_switch_to_buffer ,
3501: .Fn yy_create_buffer
1.1 deraadt 3502: (though the first argument is an
1.18 espie 3503: .Fa std::istream*
1.1 deraadt 3504: object pointer and not a
1.16 jmc 3505: .Fa FILE* ) ,
3506: .Fn yy_flush_buffer ,
3507: .Fn yy_delete_buffer ,
1.1 deraadt 3508: and
1.16 jmc 3509: .Fn yyrestart
1.10 deraadt 3510: (again, the first argument is an
1.18 espie 3511: .Fa std::istream*
1.1 deraadt 3512: object pointer).
1.16 jmc 3513: .Pp
1.1 deraadt 3514: The second class defined in
1.16 jmc 3515: .Aq Pa g++/FlexLexer.h
1.1 deraadt 3516: is
1.16 jmc 3517: .Fa yyFlexLexer ,
1.1 deraadt 3518: which is derived from
1.16 jmc 3519: .Fa FlexLexer .
1.1 deraadt 3520: It defines the following additional member functions:
1.16 jmc 3521: .Bl -tag -width Ds
1.18 espie 3522: .It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
1.16 jmc 3523: Constructs a
3524: .Fa yyFlexLexer
3525: object using the given streams for input and output.
3526: If not specified, the streams default to
3527: .Fa cin
1.1 deraadt 3528: and
1.16 jmc 3529: .Fa cout ,
1.1 deraadt 3530: respectively.
1.16 jmc 3531: .It virtual int yylex()
3532: Performs the same role as
3533: .Fn yylex
1.1 deraadt 3534: does for ordinary flex scanners: it scans the input stream, consuming
1.16 jmc 3535: tokens, until a rule's action returns a value.
3536: If subclass
3537: .Sq S
3538: is derived from
3539: .Fa yyFlexLexer ,
3540: in order to access the member functions and variables of
3541: .Sq S
1.1 deraadt 3542: inside
1.16 jmc 3543: .Fn yylex ,
3544: use
3545: .Dq %option yyclass="S"
1.1 deraadt 3546: to inform
1.16 jmc 3547: .Nm
3548: that the
3549: .Sq S
3550: subclass will be used instead of
3551: .Fa yyFlexLexer .
1.1 deraadt 3552: In this case, rather than generating
1.16 jmc 3553: .Dq yyFlexLexer::yylex() ,
3554: .Nm
1.1 deraadt 3555: generates
1.16 jmc 3556: .Dq S::yylex()
1.1 deraadt 3557: (and also generates a dummy
1.16 jmc 3558: .Dq yyFlexLexer::yylex()
1.1 deraadt 3559: that calls
1.16 jmc 3560: .Dq yyFlexLexer::LexerError()
1.1 deraadt 3561: if called).
1.18 espie 3562: .It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
1.16 jmc 3563: Reassigns
3564: .Fa yyin
1.1 deraadt 3565: to
1.16 jmc 3566: .Fa new_in
3567: .Pq if non-nil
1.1 deraadt 3568: and
1.16 jmc 3569: .Fa yyout
1.1 deraadt 3570: to
1.16 jmc 3571: .Fa new_out
3572: .Pq ditto ,
3573: deleting the previous input buffer if
3574: .Fa yyin
1.1 deraadt 3575: is reassigned.
1.18 espie 3576: .It int yylex(std::istream* new_in, std::ostream* new_out = 0)
1.16 jmc 3577: First switches the input streams via
3578: .Dq switch_streams(new_in, new_out)
1.1 deraadt 3579: and then returns the value of
1.16 jmc 3580: .Fn yylex .
3581: .El
3582: .Pp
1.1 deraadt 3583: In addition,
1.16 jmc 3584: .Fa yyFlexLexer
3585: defines the following protected virtual functions which can be redefined
1.1 deraadt 3586: in derived classes to tailor the scanner:
1.16 jmc 3587: .Bl -tag -width Ds
3588: .It virtual int LexerInput(char* buf, int max_size)
3589: Reads up to
3590: .Fa max_size
1.1 deraadt 3591: characters into
1.16 jmc 3592: .Fa buf
3593: and returns the number of characters read.
3594: To indicate end-of-input, return 0 characters.
3595: Note that
3596: .Qq interactive
3597: scanners (see the
3598: .Fl B
1.1 deraadt 3599: and
1.16 jmc 3600: .Fl I
1.1 deraadt 3601: flags) define the macro
1.16 jmc 3602: .Dv YY_INTERACTIVE .
3603: If
3604: .Fn LexerInput
3605: has been redefined, and it's necessary to take different actions depending on
3606: whether or not the scanner might be scanning an interactive input source,
3607: it's possible to test for the presence of this name via
3608: .Dq #ifdef .
3609: .It virtual void LexerOutput(const char* buf, int size)
3610: Writes out
3611: .Fa size
1.1 deraadt 3612: characters from the buffer
1.16 jmc 3613: .Fa buf ,
3614: which, while NUL-terminated, may also contain
3615: .Qq internal
3616: NUL's if the scanner's rules can match text with NUL's in them.
3617: .It virtual void LexerError(const char* msg)
3618: Reports a fatal error message.
3619: The default version of this function writes the message to the stream
3620: .Fa cerr
1.1 deraadt 3621: and exits.
1.16 jmc 3622: .El
3623: .Pp
1.1 deraadt 3624: Note that a
1.16 jmc 3625: .Fa yyFlexLexer
3626: object contains its entire scanning state.
3627: Thus such objects can be used to create reentrant scanners.
3628: Multiple instances of the same
3629: .Fa yyFlexLexer
3630: class can be instantiated, and multiple C++ scanner classes can be combined
1.1 deraadt 3631: in the same program using the
1.16 jmc 3632: .Fl P
1.1 deraadt 3633: option discussed above.
1.16 jmc 3634: .Pp
1.1 deraadt 3635: Finally, note that the
1.16 jmc 3636: .Dq %array
3637: feature is not available to C++ scanner classes;
3638: .Dq %pointer
3639: must be used
3640: .Pq the default .
3641: .Pp
1.1 deraadt 3642: Here is an example of a simple C++ scanner:
1.16 jmc 3643: .Bd -literal -offset indent
3644: // An example of using the flex C++ scanner class.
1.1 deraadt 3645:
1.16 jmc 3646: %{
3647: #include <errno.h>
3648: int mylineno = 0;
3649: %}
1.1 deraadt 3650:
1.16 jmc 3651: string \e"[^\en"]+\e"
1.1 deraadt 3652:
1.16 jmc 3653: ws [ \et]+
1.1 deraadt 3654:
1.16 jmc 3655: alpha [A-Za-z]
3656: dig [0-9]
3657: name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3658: num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3659: num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3660: number {num1}|{num2}
1.1 deraadt 3661:
1.16 jmc 3662: %%
1.1 deraadt 3663:
1.16 jmc 3664: {ws} /* skip blanks and tabs */
1.1 deraadt 3665:
1.16 jmc 3666: "/*" {
3667: int c;
1.1 deraadt 3668:
1.16 jmc 3669: while ((c = yyinput()) != 0) {
3670: if(c == '\en')
1.1 deraadt 3671: ++mylineno;
1.16 jmc 3672: else if(c == '*') {
3673: if ((c = yyinput()) == '/')
1.1 deraadt 3674: break;
3675: else
3676: unput(c);
3677: }
1.16 jmc 3678: }
3679: }
1.1 deraadt 3680:
1.16 jmc 3681: {number} cout << "number " << YYText() << '\en';
1.1 deraadt 3682:
1.16 jmc 3683: \en mylineno++;
1.1 deraadt 3684:
1.16 jmc 3685: {name} cout << "name " << YYText() << '\en';
1.1 deraadt 3686:
1.16 jmc 3687: {string} cout << "string " << YYText() << '\en';
3688:
3689: %%
3690:
3691: int main(int /* argc */, char** /* argv */)
3692: {
3693: FlexLexer* lexer = new yyFlexLexer;
3694: while(lexer->yylex() != 0)
3695: ;
3696: return 0;
3697: }
3698: .Ed
3699: .Pp
3700: To create multiple
3701: .Pq different
3702: lexer classes, use the
3703: .Fl P
3704: flag
3705: (or the
3706: .Dq prefix=
3707: option)
3708: to rename each
3709: .Fa yyFlexLexer
1.1 deraadt 3710: to some other
1.16 jmc 3711: .Fa xxFlexLexer .
3712: .Aq Pa g++/FlexLexer.h
3713: can then be included in other sources once per lexer class, first renaming
3714: .Fa yyFlexLexer
1.1 deraadt 3715: as follows:
1.16 jmc 3716: .Bd -literal -offset indent
3717: #undef yyFlexLexer
3718: #define yyFlexLexer xxFlexLexer
3719: #include <g++/FlexLexer.h>
3720:
3721: #undef yyFlexLexer
3722: #define yyFlexLexer zzFlexLexer
3723: #include <g++/FlexLexer.h>
3724: .Ed
3725: .Pp
3726: If, for example,
3727: .Dq %option prefix="xx"
3728: is used for one scanner and
3729: .Dq %option prefix="zz"
3730: is used for the other.
3731: .Pp
3732: .Sy IMPORTANT :
3733: the present form of the scanning class is experimental
1.7 aaron 3734: and may change considerably between major releases.
1.16 jmc 3735: .Sh INCOMPATIBILITIES WITH LEX AND POSIX
3736: .Nm
1.1 deraadt 3737: is a rewrite of the AT&T Unix
1.16 jmc 3738: .Nm lex
3739: tool
3740: (the two implementations do not share any code, though),
3741: with some extensions and incompatibilities, both of which are of concern
3742: to those who wish to write scanners acceptable to either implementation.
3743: .Nm
3744: is fully compliant with the
3745: .Tn POSIX
3746: .Nm lex
1.1 deraadt 3747: specification, except that when using
1.16 jmc 3748: .Dq %pointer
3749: .Pq the default ,
3750: a call to
3751: .Fn unput
1.1 deraadt 3752: destroys the contents of
1.16 jmc 3753: .Fa yytext ,
3754: which is counter to the
3755: .Tn POSIX
3756: specification.
3757: .Pp
3758: In this section we discuss all of the known areas of incompatibility between
3759: .Nm ,
3760: AT&T
3761: .Nm lex ,
3762: and the
3763: .Tn POSIX
3764: specification.
3765: .Pp
3766: .Nm flex Ns 's
3767: .Fl l
1.1 deraadt 3768: option turns on maximum compatibility with the original AT&T
1.16 jmc 3769: .Nm lex
1.1 deraadt 3770: implementation, at the cost of a major loss in the generated scanner's
1.16 jmc 3771: performance.
3772: We note below which incompatibilities can be overcome using the
3773: .Fl l
1.1 deraadt 3774: option.
1.16 jmc 3775: .Pp
3776: .Nm
1.1 deraadt 3777: is fully compatible with
1.16 jmc 3778: .Nm lex
1.1 deraadt 3779: with the following exceptions:
1.16 jmc 3780: .Bl -dash
3781: .It
1.1 deraadt 3782: The undocumented
1.16 jmc 3783: .Nm lex
1.1 deraadt 3784: scanner internal variable
1.16 jmc 3785: .Fa yylineno
1.1 deraadt 3786: is not supported unless
1.16 jmc 3787: .Fl l
1.1 deraadt 3788: or
1.16 jmc 3789: .Dq %option yylineno
1.1 deraadt 3790: is used.
1.16 jmc 3791: .Pp
3792: .Fa yylineno
1.1 deraadt 3793: should be maintained on a per-buffer basis, rather than a per-scanner
1.16 jmc 3794: .Pq single global variable
3795: basis.
3796: .Pp
3797: .Fa yylineno
3798: is not part of the
3799: .Tn POSIX
3800: specification.
3801: .It
1.1 deraadt 3802: The
1.16 jmc 3803: .Fn input
1.1 deraadt 3804: routine is not redefinable, though it may be called to read characters
1.16 jmc 3805: following whatever has been matched by a rule.
3806: If
3807: .Fn input
3808: encounters an end-of-file, the normal
3809: .Fn yywrap
3810: processing is done.
3811: A
3812: .Dq real
3813: end-of-file is returned by
3814: .Fn input
1.1 deraadt 3815: as
1.16 jmc 3816: .Dv EOF .
3817: .Pp
1.1 deraadt 3818: Input is instead controlled by defining the
1.16 jmc 3819: .Dv YY_INPUT
1.1 deraadt 3820: macro.
1.16 jmc 3821: .Pp
1.1 deraadt 3822: The
1.16 jmc 3823: .Nm
1.1 deraadt 3824: restriction that
1.16 jmc 3825: .Fn input
3826: cannot be redefined is in accordance with the
3827: .Tn POSIX
3828: specification, which simply does not specify any way of controlling the
1.1 deraadt 3829: scanner's input other than by making an initial assignment to
1.16 jmc 3830: .Fa yyin .
3831: .It
1.1 deraadt 3832: The
1.16 jmc 3833: .Fn unput
3834: routine is not redefinable.
3835: This restriction is in accordance with
3836: .Tn POSIX .
3837: .It
3838: .Nm
1.1 deraadt 3839: scanners are not as reentrant as
1.16 jmc 3840: .Nm lex
3841: scanners.
3842: In particular, if a scanner is interactive and
3843: an interrupt handler long-jumps out of the scanner,
3844: and the scanner is subsequently called again,
3845: the following error message may be displayed:
3846: .Pp
3847: .D1 fatal flex scanner internal error--end of buffer missed
3848: .Pp
1.1 deraadt 3849: To reenter the scanner, first use
1.16 jmc 3850: .Pp
3851: .Dl yyrestart(yyin);
3852: .Pp
3853: Note that this call will throw away any buffered input;
3854: usually this isn't a problem with an interactive scanner.
3855: .Pp
3856: Also note that flex C++ scanner classes are reentrant,
3857: so if using C++ is an option , they should be used instead.
3858: See
3859: .Sx GENERATING C++ SCANNERS
3860: above for details.
3861: .It
3862: .Fn output
1.1 deraadt 3863: is not supported.
3864: Output from the
1.16 jmc 3865: .Em ECHO
1.1 deraadt 3866: macro is done to the file-pointer
1.16 jmc 3867: .Fa yyout
3868: .Pq default stdout .
3869: .Pp
3870: .Fn output
3871: is not part of the
3872: .Tn POSIX
3873: specification.
3874: .It
3875: .Nm lex
3876: does not support exclusive start conditions
3877: .Pq %x ,
3878: though they are in the
3879: .Tn POSIX
3880: specification.
3881: .It
1.1 deraadt 3882: When definitions are expanded,
1.16 jmc 3883: .Nm
1.1 deraadt 3884: encloses them in parentheses.
1.16 jmc 3885: With
3886: .Nm lex ,
3887: the following:
3888: .Bd -literal -offset indent
3889: NAME [A-Z][A-Z0-9]*
3890: %%
3891: foo{NAME}? printf("Found it\en");
3892: %%
3893: .Ed
3894: .Pp
3895: will not match the string
3896: .Qq foo
3897: because when the macro is expanded the rule is equivalent to
3898: .Qq foo[A-Z][A-Z0-9]*?
3899: and the precedence is such that the
3900: .Sq ?\&
3901: is associated with
3902: .Qq [A-Z0-9]* .
3903: With
3904: .Nm ,
1.1 deraadt 3905: the rule will be expanded to
1.16 jmc 3906: .Qq foo([A-Z][A-Z0-9]*)?
3907: and so the string
3908: .Qq foo
3909: will match.
3910: .Pp
1.1 deraadt 3911: Note that if the definition begins with
1.16 jmc 3912: .Sq ^
1.1 deraadt 3913: or ends with
1.16 jmc 3914: .Sq $
3915: then it is not expanded with parentheses, to allow these operators to appear in
3916: definitions without losing their special meanings.
3917: But the
3918: .Sq Aq s ,
3919: .Sq / ,
1.1 deraadt 3920: and
1.16 jmc 3921: .Aq Aq EOF
1.1 deraadt 3922: operators cannot be used in a
1.16 jmc 3923: .Nm
1.1 deraadt 3924: definition.
1.16 jmc 3925: .Pp
1.1 deraadt 3926: Using
1.16 jmc 3927: .Fl l
1.1 deraadt 3928: results in the
1.16 jmc 3929: .Nm lex
1.1 deraadt 3930: behavior of no parentheses around the definition.
1.16 jmc 3931: .Pp
3932: The
3933: .Tn POSIX
3934: specification is that the definition be enclosed in parentheses.
3935: .It
1.1 deraadt 3936: Some implementations of
1.16 jmc 3937: .Nm lex
3938: allow a rule's action to begin on a separate line,
3939: if the rule's pattern has trailing whitespace:
3940: .Bd -literal -offset indent
3941: %%
3942: foo|bar<space here>
3943: { foobar_action(); }
3944: .Ed
3945: .Pp
3946: .Nm
1.1 deraadt 3947: does not support this feature.
1.16 jmc 3948: .It
1.1 deraadt 3949: The
1.16 jmc 3950: .Nm lex
3951: .Sq %r
3952: .Pq generate a Ratfor scanner
3953: option is not supported.
3954: It is not part of the
3955: .Tn POSIX
3956: specification.
3957: .It
1.1 deraadt 3958: After a call to
1.16 jmc 3959: .Fn unput ,
3960: .Fa yytext
3961: is undefined until the next token is matched,
3962: unless the scanner was built using
3963: .Dq %array .
1.1 deraadt 3964: This is not the case with
1.16 jmc 3965: .Nm lex
3966: or the
3967: .Tn POSIX
3968: specification.
3969: The
3970: .Fl l
1.1 deraadt 3971: option does away with this incompatibility.
1.16 jmc 3972: .It
1.1 deraadt 3973: The precedence of the
1.16 jmc 3974: .Sq {}
3975: .Pq numeric range
3976: operator is different.
3977: .Nm lex
3978: interprets
3979: .Qq abc{1,3}
3980: as match one, two, or three occurrences of
3981: .Sq abc ,
3982: whereas
3983: .Nm
3984: interprets it as match
3985: .Sq ab
3986: followed by one, two, or three occurrences of
3987: .Sq c .
3988: The latter is in agreement with the
3989: .Tn POSIX
3990: specification.
3991: .It
1.1 deraadt 3992: The precedence of the
1.16 jmc 3993: .Sq ^
1.1 deraadt 3994: operator is different.
1.16 jmc 3995: .Nm lex
3996: interprets
3997: .Qq ^foo|bar
3998: as match either
3999: .Sq foo
4000: at the beginning of a line, or
4001: .Sq bar
4002: anywhere, whereas
4003: .Nm
4004: interprets it as match either
4005: .Sq foo
4006: or
4007: .Sq bar
4008: if they come at the beginning of a line.
4009: The latter is in agreement with the
4010: .Tn POSIX
4011: specification.
4012: .It
1.1 deraadt 4013: The special table-size declarations such as
1.16 jmc 4014: .Sq %a
1.1 deraadt 4015: supported by
1.16 jmc 4016: .Nm lex
1.1 deraadt 4017: are not required by
1.16 jmc 4018: .Nm
1.1 deraadt 4019: scanners;
1.16 jmc 4020: .Nm
1.1 deraadt 4021: ignores them.
1.16 jmc 4022: .It
1.1 deraadt 4023: The name
1.16 jmc 4024: .Dv FLEX_SCANNER
1.1 deraadt 4025: is #define'd so scanners may be written for use with either
1.16 jmc 4026: .Nm
1.1 deraadt 4027: or
1.16 jmc 4028: .Nm lex .
1.1 deraadt 4029: Scanners also include
1.16 jmc 4030: .Dv YY_FLEX_MAJOR_VERSION
1.1 deraadt 4031: and
1.16 jmc 4032: .Dv YY_FLEX_MINOR_VERSION
1.1 deraadt 4033: indicating which version of
1.16 jmc 4034: .Nm
1.1 deraadt 4035: generated the scanner
1.16 jmc 4036: (for example, for the 2.5 release, these defines would be 2 and 5,
1.1 deraadt 4037: respectively).
1.16 jmc 4038: .El
4039: .Pp
1.1 deraadt 4040: The following
1.16 jmc 4041: .Nm
1.1 deraadt 4042: features are not included in
1.16 jmc 4043: .Nm lex
4044: or the
4045: .Tn POSIX
4046: specification:
4047: .Bd -unfilled -offset indent
4048: C++ scanners
4049: %option
4050: start condition scopes
4051: start condition stacks
4052: interactive/non-interactive scanners
4053: yy_scan_string() and friends
4054: yyterminate()
4055: yy_set_interactive()
4056: yy_set_bol()
4057: YY_AT_BOL()
4058: <<EOF>>
4059: <*>
4060: YY_DECL
4061: YY_START
4062: YY_USER_ACTION
4063: YY_USER_INIT
4064: #line directives
4065: %{}'s around actions
4066: multiple actions on a line
4067: .Ed
4068: .Pp
4069: plus almost all of the
4070: .Nm
4071: flags.
1.1 deraadt 4072: The last feature in the list refers to the fact that with
1.16 jmc 4073: .Nm
4074: Multiple actions ican be placed on the same line,
4075: separated with semi-colons, while with
4076: .Nm lex ,
1.1 deraadt 4077: the following
1.16 jmc 4078: .Pp
4079: .Dl foo handle_foo(); ++num_foos_seen;
4080: .Pp
4081: is
4082: .Pq rather surprisingly
4083: truncated to
4084: .Pp
4085: .Dl foo handle_foo();
4086: .Pp
4087: .Nm
4088: does not truncate the action.
4089: Actions that are not enclosed in braces
4090: are simply terminated at the end of the line.
4091: .Sh FILES
4092: .Bl -tag -width "<g++/FlexLexer.h>"
4093: .It flex.skl
4094: Skeleton scanner.
4095: This file is only used when building flex, not when
4096: .Nm
4097: executes.
4098: .It lex.backup
4099: Backing-up information for the
4100: .Fl b
4101: flag (called
4102: .Pa lex.bck
4103: on some systems).
4104: .It lex.yy.c
4105: Generated scanner
4106: (called
4107: .Pa lexyy.c
4108: on some systems).
4109: .It lex.yy.cc
4110: Generated C++ scanner class, when using
4111: .Fl + .
4112: .It Aq g++/FlexLexer.h
4113: Header file defining the C++ scanner base class,
4114: .Fa FlexLexer ,
4115: and its derived class,
4116: .Fa yyFlexLexer .
4117: .It /usr/lib/libl.*
4118: .Nm
4119: libraries.
4120: The
4121: .Pa /usr/lib/libfl.*\&
4122: libraries are links to these.
4123: Scanners must be linked using either
4124: .Fl \&ll
4125: or
4126: .Fl lfl .
4127: .El
4128: .Sh DIAGNOSTICS
4129: .Bl -diag
4130: .It warning, rule cannot be matched
4131: Indicates that the given rule cannot be matched because it follows other rules
4132: that will always match the same text as it.
4133: For example, in the following
4134: .Dq foo
4135: cannot be matched because it comes after an identifier
4136: .Qq catch-all
4137: rule:
4138: .Bd -literal -offset indent
4139: [a-z]+ got_identifier();
4140: foo got_foo();
4141: .Ed
4142: .Pp
1.1 deraadt 4143: Using
1.16 jmc 4144: .Em REJECT
1.1 deraadt 4145: in a scanner suppresses this warning.
1.16 jmc 4146: .It "warning, \-s option given but default rule can be matched"
4147: Means that it is possible
4148: .Pq perhaps only in a particular start condition
4149: that the default rule
4150: .Pq match any single character
4151: is the only one that will match a particular input.
4152: Since
4153: .Fl s
1.1 deraadt 4154: was given, presumably this is not intended.
1.16 jmc 4155: .It reject_used_but_not_detected undefined
4156: .It yymore_used_but_not_detected undefined
4157: These errors can occur at compile time.
4158: They indicate that the scanner uses
4159: .Em REJECT
1.1 deraadt 4160: or
1.16 jmc 4161: .Fn yymore
1.1 deraadt 4162: but that
1.16 jmc 4163: .Nm
1.1 deraadt 4164: failed to notice the fact, meaning that
1.16 jmc 4165: .Nm
1.1 deraadt 4166: scanned the first two sections looking for occurrences of these actions
1.16 jmc 4167: and failed to find any, but somehow they snuck in
4168: .Pq via an #include file, for example .
4169: Use
4170: .Dq %option reject
4171: or
4172: .Dq %option yymore
4173: to indicate to
4174: .Nm
4175: that these features are really needed.
4176: .It flex scanner jammed
4177: A scanner compiled with
4178: .Fl s
4179: has encountered an input string which wasn't matched by any of its rules.
4180: This error can also occur due to internal problems.
4181: .It token too large, exceeds YYLMAX
4182: The scanner uses
4183: .Dq %array
1.1 deraadt 4184: and one of its rules matched a string longer than the
1.16 jmc 4185: .Dv YYLMAX
4186: constant
4187: .Pq 8K bytes by default .
4188: The value can be increased by #define'ing
4189: .Dv YYLMAX
4190: in the definitions section of
4191: .Nm
1.1 deraadt 4192: input.
1.16 jmc 4193: .It "scanner requires \-8 flag to use the character 'x'"
4194: The scanner specification includes recognizing the 8-bit character
4195: .Sq x
4196: and the
4197: .Fl 8
4198: flag was not specified, and defaulted to 7-bit because the
4199: .Fl Cf
4200: or
4201: .Fl CF
4202: table compression options were used.
4203: See the discussion of the
4204: .Fl 7
1.1 deraadt 4205: flag for details.
1.16 jmc 4206: .It flex scanner push-back overflow
4207: unput() was used to push back so much text that the scanner's buffer
4208: could not hold both the pushed-back text and the current token in
4209: .Fa yytext .
4210: Ideally the scanner should dynamically resize the buffer in this case,
4211: but at present it does not.
4212: .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4213: The scanner was working on matching an extremely large token and needed
4214: to expand the input buffer.
4215: This doesn't work with scanners that use
4216: .Em REJECT .
4217: .It "fatal flex scanner internal error--end of buffer missed"
1.1 deraadt 4218: This can occur in an scanner which is reentered after a long-jump
1.16 jmc 4219: has jumped out
4220: .Pq or over
4221: the scanner's activation frame.
4222: Before reentering the scanner, use:
4223: .Pp
4224: .Dl yyrestart(yyin);
4225: .Pp
1.1 deraadt 4226: or, as noted above, switch to using the C++ scanner class.
1.16 jmc 4227: .It "too many start conditions in <> construct!"
4228: More start conditions than exist were listed in a <> construct
4229: (so at least one of them must have been listed twice).
4230: .El
4231: .Sh SEE ALSO
4232: .Xr awk 1 ,
4233: .Xr lex 1 ,
4234: .Xr sed 1 ,
4235: .Xr yacc 1
4236: .Pp
1.19 jmc 4237: "Lex \- A Lexical Analyzer Generator",
4238: .Pa /usr/share/doc/psd/16.lex/ .
1.16 jmc 4239: .Rs
4240: .%A John Levine
4241: .%A Tony Mason
4242: .%A Doug Brown
4243: .%B Lex & Yacc
4244: .%I O'Reilly and Associates
4245: .%N 2nd edition
4246: .Re
4247: .Rs
4248: .%A Alfred Aho
4249: .%A Ravi Sethi
4250: .%A Jeffrey Ullman
4251: .%B Compilers: Principles, Techniques and Tools
4252: .%I Addison-Wesley
4253: .%D 1986
4254: .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4255: .Re
4256: .Sh AUTHORS
1.1 deraadt 4257: Vern Paxson, with the help of many ideas and much inspiration from
1.16 jmc 4258: Van Jacobson.
4259: Original version by Jef Poskanzer.
4260: The fast table representation is a partial implementation of a design done by
4261: Van Jacobson.
4262: The implementation was done by Kevin Gong and Vern Paxson.
4263: .Pp
1.1 deraadt 4264: Thanks to the many
1.16 jmc 4265: .Nm
1.1 deraadt 4266: beta-testers, feedbackers, and contributors, especially Francois Pinard,
4267: Casey Leedom,
4268: Robert Abramovitz,
4269: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4270: Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4271: Karl Berry, Peter A. Bigot, Simon Blanchard,
4272: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4273: Brian Clapper, J.T. Conklin,
4274: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11 deraadt 4275: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1 deraadt 4276: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4277: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4278: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4279: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4280: Jan Hajic, Charles Hemphill, NORO Hideo,
4281: Jarkko Hietaniemi, Scott Hofmann,
4282: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4283: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4284: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4285: Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4286: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4287: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4288: David Loffredo, Mike Long,
4289: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4290: Bengt Martensson, Chris Metcalf,
4291: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4292: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4293: Richard Ohnemus, Karsten Pahnke,
1.16 jmc 4294: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4295: Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
1.1 deraadt 4296: Frederic Raimbault, Pat Rankin, Rick Richardson,
4297: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4298: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4299: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4300: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4301: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4302: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
1.16 jmc 4303: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4304: Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4305: and those whose names have slipped my marginal mail-archiving skills
4306: but whose contributions are appreciated all the
1.1 deraadt 4307: same.
1.16 jmc 4308: .Pp
1.1 deraadt 4309: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4310: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4311: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4312: distribution headaches.
1.16 jmc 4313: .Pp
4314: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4315: to Benson Margulies and Fred Burke for C++ support;
4316: to Kent Williams and Tom Epperly for C++ class support;
4317: to Ove Ewerlid for support of NUL's;
4318: and to Eric Hughes for support of multiple buffers.
4319: .Pp
1.1 deraadt 4320: This work was primarily done when I was with the Real Time Systems Group
1.16 jmc 4321: at the Lawrence Berkeley Laboratory in Berkeley, CA.
4322: Many thanks to all there for the support I received.
4323: .Pp
4324: Send comments to
4325: .Aq vern@ee.lbl.gov .
4326: .Sh BUGS
4327: Some trailing context patterns cannot be properly matched and generate
4328: warning messages
4329: .Pq "dangerous trailing context" .
4330: These are patterns where the ending of the first part of the rule
4331: matches the beginning of the second part, such as
4332: .Qq zx*/xy* ,
4333: where the
4334: .Sq x*
4335: matches the
4336: .Sq x
4337: at the beginning of the trailing context.
4338: (Note that the POSIX draft states that the text matched by such patterns
4339: is undefined.)
4340: .Pp
4341: For some trailing context rules, parts which are actually fixed-length are
4342: not recognized as such, leading to the above mentioned performance loss.
4343: In particular, parts using
4344: .Sq |\&
4345: or
4346: .Sq {n}
4347: (such as
4348: .Qq foo{3} )
4349: are always considered variable-length.
4350: .Pp
4351: Combining trailing context with the special
4352: .Sq |\&
4353: action can result in fixed trailing context being turned into
4354: the more expensive variable trailing context.
4355: For example, in the following:
4356: .Bd -literal -offset indent
4357: %%
4358: abc |
4359: xyz/def
4360: .Ed
4361: .Pp
4362: Use of
4363: .Fn unput
4364: invalidates yytext and yyleng, unless the
4365: .Dq %array
4366: directive
4367: or the
4368: .Fl l
4369: option has been used.
4370: .Pp
4371: Pattern-matching of NUL's is substantially slower than matching other
4372: characters.
4373: .Pp
4374: Dynamic resizing of the input buffer is slow, as it entails rescanning
4375: all the text matched so far by the current
4376: .Pq generally huge
4377: token.
4378: .Pp
4379: Due to both buffering of input and read-ahead,
4380: it is not possible to intermix calls to
4381: .Aq Pa stdio.h
4382: routines, such as, for example,
4383: .Fn getchar ,
4384: with
4385: .Nm
4386: rules and expect it to work.
4387: Call
4388: .Fn input
4389: instead.
4390: .Pp
4391: The total table entries listed by the
4392: .Fl v
4393: flag excludes the number of table entries needed to determine
4394: what rule has been matched.
4395: The number of entries is equal to the number of DFA states
4396: if the scanner does not use
4397: .Em REJECT ,
4398: and somewhat greater than the number of states if it does.
4399: .Pp
4400: .Em REJECT
4401: cannot be used with the
4402: .Fl f
4403: or
4404: .Fl F
4405: options.
4406: .Pp
4407: The
4408: .Nm
4409: internal algorithms need documentation.