Annotation of src/usr.bin/lex/flex.1, Revision 1.34
1.34 ! schwarze 1: .\" $OpenBSD: flex.1,v 1.33 2013/01/18 21:48:43 jmc Exp $
1.16 jmc 2: .\"
1.12 jmc 3: .\" Copyright (c) 1990 The Regents of the University of California.
4: .\" All rights reserved.
1.2 deraadt 5: .\"
1.12 jmc 6: .\" This code is derived from software contributed to Berkeley by
7: .\" Vern Paxson.
8: .\"
9: .\" The United States Government has rights in this work pursuant
10: .\" to contract no. DE-AC03-76SF00098 between the United States
11: .\" Department of Energy and the University of California.
12: .\"
13: .\" Redistribution and use in source and binary forms, with or without
1.13 millert 14: .\" modification, are permitted provided that the following conditions
15: .\" are met:
16: .\"
17: .\" 1. Redistributions of source code must retain the above copyright
18: .\" notice, this list of conditions and the following disclaimer.
19: .\" 2. Redistributions in binary form must reproduce the above copyright
20: .\" notice, this list of conditions and the following disclaimer in the
21: .\" documentation and/or other materials provided with the distribution.
22: .\"
23: .\" Neither the name of the University nor the names of its contributors
24: .\" may be used to endorse or promote products derived from this software
25: .\" without specific prior written permission.
26: .\"
27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30: .\" PURPOSE.
1.16 jmc 31: .\"
1.34 ! schwarze 32: .Dd $Mdocdate: January 18 2013 $
1.16 jmc 33: .Dt FLEX 1
34: .Os
35: .Sh NAME
36: .Nm flex
37: .Nd fast lexical analyzer generator
38: .Sh SYNOPSIS
39: .Nm
1.28 jmc 40: .Bk -words
1.31 jmc 41: .Op Fl 78BbdFfhIiLlnpsTtVvw+?
1.16 jmc 42: .Op Fl C Ns Op Cm aeFfmr
43: .Op Fl Fl help
44: .Op Fl Fl version
1.28 jmc 45: .Op Fl o Ns Ar output
46: .Op Fl P Ns Ar prefix
47: .Op Fl S Ns Ar skeleton
48: .Op Ar
49: .Ek
1.21 jmc 50: .Sh DESCRIPTION
51: .Nm
52: is a tool for generating
53: .Em scanners :
54: programs which recognize lexical patterns in text.
55: .Nm
56: reads the given input files, or its standard input if no file names are given,
57: for a description of a scanner to generate.
58: The description is in the form of pairs of regular expressions and C code,
59: called
60: .Em rules .
61: .Nm
62: generates as output a C source file,
63: .Pa lex.yy.c ,
64: which defines a routine
65: .Fn yylex .
66: This file is compiled and linked with the
67: .Fl lfl
68: library to produce an executable.
69: When the executable is run, it analyzes its input for occurrences
70: of the regular expressions.
71: Whenever it finds one, it executes the corresponding C code.
72: .Pp
1.16 jmc 73: The manual includes both tutorial and reference sections:
74: .Bl -ohang
75: .It Sy Some Simple Examples
76: .It Sy Format of the Input File
77: .It Sy Patterns
78: The extended regular expressions used by
79: .Nm .
80: .It Sy How the Input is Matched
81: The rules for determining what has been matched.
82: .It Sy Actions
83: How to specify what to do when a pattern is matched.
84: .It Sy The Generated Scanner
85: Details regarding the scanner that
86: .Nm
87: produces;
88: how to control the input source.
89: .It Sy Start Conditions
90: Introducing context into scanners, and managing
91: .Qq mini-scanners .
92: .It Sy Multiple Input Buffers
93: How to manipulate multiple input sources;
94: how to scan from strings instead of files.
95: .It Sy End-of-File Rules
96: Special rules for matching the end of the input.
97: .It Sy Miscellaneous Macros
98: A summary of macros available to the actions.
99: .It Sy Values Available to the User
100: A summary of values available to the actions.
101: .It Sy Interfacing with Yacc
102: Connecting flex scanners together with
103: .Xr yacc 1
104: parsers.
105: .It Sy Options
106: .Nm
107: command-line options, and the
108: .Dq %option
109: directive.
110: .It Sy Performance Considerations
111: How to make scanners go as fast as possible.
112: .It Sy Generating C++ Scanners
113: The
114: .Pq experimental
115: facility for generating C++ scanner classes.
116: .It Sy Incompatibilities with Lex and POSIX
117: How
118: .Nm
119: differs from AT&T lex and the
120: .Tn POSIX
121: lex standard.
122: .It Sy Files
123: Files used by
124: .Nm .
125: .It Sy Diagnostics
126: Those error messages produced by
127: .Nm
128: .Pq or scanners it generates
129: whose meanings might not be apparent.
130: .It Sy See Also
131: Other documentation, related tools.
132: .It Sy Authors
133: Includes contact information.
134: .It Sy Bugs
135: Known problems with
136: .Nm .
137: .El
138: .Sh SOME SIMPLE EXAMPLES
1.1 deraadt 139: First some simple examples to get the flavor of how one uses
1.16 jmc 140: .Nm .
1.1 deraadt 141: The following
1.16 jmc 142: .Nm
1.1 deraadt 143: input specifies a scanner which whenever it encounters the string
1.16 jmc 144: .Qq username
145: will replace it with the user's login name:
146: .Bd -literal -offset indent
147: %%
148: username printf("%s", getlogin());
149: .Ed
150: .Pp
1.1 deraadt 151: By default, any text not matched by a
1.16 jmc 152: .Nm
153: scanner is copied to the output, so the net effect of this scanner is
154: to copy its input file to its output with each occurrence of
155: .Qq username
156: expanded.
157: In this input, there is just one rule.
158: .Qq username
159: is the
160: .Em pattern
161: and the
162: .Qq printf
163: is the
164: .Em action .
165: The
166: .Qq %%
167: marks the beginning of the rules.
168: .Pp
1.1 deraadt 169: Here's another simple example:
1.16 jmc 170: .Bd -literal -offset indent
1.20 pvalchev 171: %{
1.16 jmc 172: int num_lines = 0, num_chars = 0;
1.20 pvalchev 173: %}
1.1 deraadt 174:
1.16 jmc 175: %%
176: \en ++num_lines; ++num_chars;
177: \&. ++num_chars;
178:
179: %%
180: main()
181: {
182: yylex();
183: printf("# of lines = %d, # of chars = %d\en",
184: num_lines, num_chars);
185: }
186: .Ed
187: .Pp
1.1 deraadt 188: This scanner counts the number of characters and the number
1.16 jmc 189: of lines in its input
190: (it produces no output other than the final report on the counts).
191: The first line declares two globals,
192: .Qq num_lines
193: and
194: .Qq num_chars ,
195: which are accessible both inside
196: .Fn yylex
1.1 deraadt 197: and in the
1.16 jmc 198: .Fn main
199: routine declared after the second
200: .Qq %% .
201: There are two rules, one which matches a newline
202: .Pq \&"\en\&"
203: and increments both the line count and the character count,
204: and one which matches any character other than a newline
205: (indicated by the
206: .Qq \&.
207: regular expression).
208: .Pp
1.1 deraadt 209: A somewhat more complicated example:
1.16 jmc 210: .Bd -literal -offset indent
211: /* scanner for a toy Pascal-like language */
1.1 deraadt 212:
1.16 jmc 213: %{
214: /* need this for the call to atof() below */
215: #include <math.h>
216: %}
1.1 deraadt 217:
1.16 jmc 218: DIGIT [0-9]
219: ID [a-z][a-z0-9]*
1.1 deraadt 220:
1.16 jmc 221: %%
1.1 deraadt 222:
1.16 jmc 223: {DIGIT}+ {
224: printf("An integer: %s (%d)\en", yytext,
225: atoi(yytext));
226: }
1.1 deraadt 227:
1.16 jmc 228: {DIGIT}+"."{DIGIT}* {
229: printf("A float: %s (%g)\en", yytext,
230: atof(yytext));
231: }
1.1 deraadt 232:
1.16 jmc 233: if|then|begin|end|procedure|function {
234: printf("A keyword: %s\en", yytext);
235: }
1.1 deraadt 236:
1.16 jmc 237: {ID} printf("An identifier: %s\en", yytext);
1.1 deraadt 238:
1.16 jmc 239: "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
1.1 deraadt 240:
1.16 jmc 241: "{"[^}\en]*"}" /* eat up one-line comments */
1.1 deraadt 242:
1.16 jmc 243: [ \et\en]+ /* eat up whitespace */
1.1 deraadt 244:
1.16 jmc 245: \&. printf("Unrecognized character: %s\en", yytext);
1.1 deraadt 246:
1.16 jmc 247: %%
1.1 deraadt 248:
1.16 jmc 249: main(int argc, char *argv[])
250: {
251: ++argv; --argc; /* skip over program name */
252: if (argc > 0)
253: yyin = fopen(argv[0], "r");
1.1 deraadt 254: else
255: yyin = stdin;
1.7 aaron 256:
1.1 deraadt 257: yylex();
1.16 jmc 258: }
259: .Ed
260: .Pp
261: This is the beginnings of a simple scanner for a language like Pascal.
262: It identifies different types of
263: .Em tokens
1.1 deraadt 264: and reports on what it has seen.
1.16 jmc 265: .Pp
266: The details of this example will be explained in the following sections.
267: .Sh FORMAT OF THE INPUT FILE
1.1 deraadt 268: The
1.16 jmc 269: .Nm
1.1 deraadt 270: input file consists of three sections, separated by a line with just
1.16 jmc 271: .Qq %%
1.1 deraadt 272: in it:
1.16 jmc 273: .Bd -unfilled -offset indent
274: definitions
275: %%
276: rules
277: %%
278: user code
279: .Ed
280: .Pp
1.1 deraadt 281: The
1.16 jmc 282: .Em definitions
1.1 deraadt 283: section contains declarations of simple
1.16 jmc 284: .Em name
1.1 deraadt 285: definitions to simplify the scanner specification, and declarations of
1.16 jmc 286: .Em start conditions ,
1.1 deraadt 287: which are explained in a later section.
1.16 jmc 288: .Pp
1.1 deraadt 289: Name definitions have the form:
1.16 jmc 290: .Pp
291: .D1 name definition
292: .Pp
293: The
294: .Qq name
295: is a word beginning with a letter or an underscore
296: .Pq Sq _
297: followed by zero or more letters, digits,
298: .Sq _ ,
299: or
300: .Sq -
301: .Pq dash .
1.8 aaron 302: The definition is taken to begin at the first non-whitespace character
1.1 deraadt 303: following the name and continuing to the end of the line.
1.16 jmc 304: The definition can subsequently be referred to using
305: .Qq {name} ,
306: which will expand to
307: .Qq (definition) .
308: For example:
309: .Bd -literal -offset indent
310: DIGIT [0-9]
311: ID [a-z][a-z0-9]*
312: .Ed
313: .Pp
314: This defines
315: .Qq DIGIT
316: to be a regular expression which matches a single digit, and
317: .Qq ID
318: to be a regular expression which matches a letter
1.1 deraadt 319: followed by zero-or-more letters-or-digits.
320: A subsequent reference to
1.16 jmc 321: .Pp
322: .Dl {DIGIT}+"."{DIGIT}*
323: .Pp
1.1 deraadt 324: is identical to
1.16 jmc 325: .Pp
326: .Dl ([0-9])+"."([0-9])*
327: .Pp
328: and matches one-or-more digits followed by a
329: .Sq .\&
330: followed by zero-or-more digits.
331: .Pp
1.1 deraadt 332: The
1.16 jmc 333: .Em rules
1.1 deraadt 334: section of the
1.16 jmc 335: .Nm
1.1 deraadt 336: input contains a series of rules of the form:
1.16 jmc 337: .Pp
338: .D1 pattern action
339: .Pp
340: The pattern must be unindented and the action must begin
1.1 deraadt 341: on the same line.
1.16 jmc 342: .Pp
1.1 deraadt 343: See below for a further description of patterns and actions.
1.16 jmc 344: .Pp
1.1 deraadt 345: Finally, the user code section is simply copied to
1.16 jmc 346: .Pa lex.yy.c
1.1 deraadt 347: verbatim.
1.16 jmc 348: It is used for companion routines which call or are called by the scanner.
349: The presence of this section is optional;
1.1 deraadt 350: if it is missing, the second
1.16 jmc 351: .Qq %%
352: in the input file may be skipped too.
353: .Pp
354: In the definitions and rules sections, any indented text or text enclosed in
355: .Sq %{
1.1 deraadt 356: and
1.16 jmc 357: .Sq %}
358: is copied verbatim to the output
359: .Pq with the %{}'s removed .
1.1 deraadt 360: The %{}'s must appear unindented on lines by themselves.
1.16 jmc 361: .Pp
1.1 deraadt 362: In the rules section,
1.16 jmc 363: any indented or %{} text appearing before the first rule may be used to
364: declare variables which are local to the scanning routine and
365: .Pq after the declarations
1.1 deraadt 366: code which is to be executed whenever the scanning routine is entered.
367: Other indented or %{} text in the rule section is still copied to the output,
368: but its meaning is not well-defined and it may well cause compile-time
369: errors (this feature is present for
1.16 jmc 370: .Tn POSIX
1.1 deraadt 371: compliance; see below for other such features).
1.16 jmc 372: .Pp
373: In the definitions section
374: .Pq but not in the rules section ,
375: an unindented comment
376: (i.e., a line beginning with
377: .Qq /* )
378: is also copied verbatim to the output up to the next
379: .Qq */ .
380: .Sh PATTERNS
1.1 deraadt 381: The patterns in the input are written using an extended set of regular
1.16 jmc 382: expressions.
383: These are:
384: .Bl -tag -width "XXXXXXXX"
385: .It x
386: Match the character
387: .Sq x .
388: .It .\&
389: Any character
390: .Pq byte
391: except newline.
392: .It [xyz]
393: A
394: .Qq character class ;
395: in this case, the pattern matches either an
396: .Sq x ,
397: a
398: .Sq y ,
399: or a
400: .Sq z .
401: .It [abj-oZ]
402: A
403: .Qq character class
404: with a range in it; matches an
405: .Sq a ,
406: a
407: .Sq b ,
408: any letter from
409: .Sq j
410: through
411: .Sq o ,
412: or a
413: .Sq Z .
414: .It [^A-Z]
415: A
416: .Qq negated character class ,
417: i.e., any character but those in the class.
418: In this case, any character EXCEPT an uppercase letter.
419: .It [^A-Z\en]
420: Any character EXCEPT an uppercase letter or a newline.
421: .It r*
422: Zero or more r's, where
423: .Sq r
424: is any regular expression.
425: .It r+
426: One or more r's.
427: .It r?
428: Zero or one r's (that is,
429: .Qq an optional r ) .
430: .It r{2,5}
431: Anywhere from two to five r's.
432: .It r{2,}
433: Two or more r's.
434: .It r{4}
435: Exactly 4 r's.
436: .It {name}
437: The expansion of the
438: .Qq name
439: definition
440: .Pq see above .
441: .It \&"[xyz]\e\&"foo\&"
442: The literal string: [xyz]"foo.
443: .It \eX
444: If
445: .Sq X
446: is an
447: .Sq a ,
448: .Sq b ,
449: .Sq f ,
450: .Sq n ,
451: .Sq r ,
452: .Sq t ,
453: or
454: .Sq v ,
455: then the ANSI-C interpretation of
456: .Sq \eX .
457: Otherwise, a literal
458: .Sq X
459: (used to escape operators such as
460: .Sq * ) .
461: .It \e0
462: A NUL character
463: .Pq ASCII code 0 .
464: .It \e123
465: The character with octal value 123.
466: .It \ex2a
467: The character with hexadecimal value 2a.
468: .It (r)
469: Match an
470: .Sq r ;
471: parentheses are used to override precedence
472: .Pq see below .
473: .It rs
474: The regular expression
475: .Sq r
476: followed by the regular expression
477: .Sq s ;
478: called
479: .Qq concatenation .
480: .It r|s
481: Either an
482: .Sq r
483: or an
484: .Sq s .
485: .It r/s
486: An
487: .Sq r ,
488: but only if it is followed by an
489: .Sq s .
490: The text matched by
491: .Sq s
492: is included when determining whether this rule is the
493: .Qq longest match ,
494: but is then returned to the input before the action is executed.
495: So the action only sees the text matched by
496: .Sq r .
497: This type of pattern is called
498: .Qq trailing context .
499: (There are some combinations of r/s that
500: .Nm
501: cannot match correctly; see notes in the
502: .Sx BUGS
503: section below regarding
504: .Qq dangerous trailing context . )
505: .It ^r
506: An
507: .Sq r ,
508: but only at the beginning of a line
509: (i.e., just starting to scan, or right after a newline has been scanned).
510: .It r$
511: An
512: .Sq r ,
513: but only at the end of a line
514: .Pq i.e., just before a newline .
515: Equivalent to
516: .Qq r/\en .
517: .Pp
518: Note that
519: .Nm flex Ns 's
520: notion of
521: .Qq newline
522: is exactly whatever the C compiler used to compile
523: .Nm
524: interprets
525: .Sq \en
526: as.
527: .\" In particular, on some DOS systems you must either filter out \er's in the
528: .\" input yourself, or explicitly use r/\er\en for
529: .\" .Qq r$ .
530: .It <s>r
531: An
532: .Sq r ,
533: but only in start condition
534: .Sq s
535: .Pq see below for discussion of start conditions .
536: .It <s1,s2,s3>r
537: The same, but in any of start conditions s1, s2, or s3.
538: .It <*>r
539: An
540: .Sq r
541: in any start condition, even an exclusive one.
542: .It <<EOF>>
543: An end-of-file.
544: .It <s1,s2><<EOF>>
545: An end-of-file when in start condition s1 or s2.
546: .El
547: .Pp
1.1 deraadt 548: Note that inside of a character class, all regular expression operators
1.16 jmc 549: lose their special meaning except escape
550: .Pq Sq \e
551: and the character class operators,
552: .Sq - ,
553: .Sq ]\& ,
554: and, at the beginning of the class,
555: .Sq ^ .
556: .Pp
1.1 deraadt 557: The regular expressions listed above are grouped according to
558: precedence, from highest precedence at the top to lowest at the bottom.
1.16 jmc 559: Those grouped together have equal precedence.
560: For example,
561: .Pp
562: .D1 foo|bar*
563: .Pp
1.1 deraadt 564: is the same as
1.16 jmc 565: .Pp
566: .D1 (foo)|(ba(r*))
567: .Pp
568: since the
569: .Sq *
570: operator has higher precedence than concatenation,
571: and concatenation higher than alternation
572: .Pq Sq |\& .
573: This pattern therefore matches
574: .Em either
575: the string
576: .Qq foo
577: .Em or
578: the string
579: .Qq ba
580: followed by zero-or-more r's.
581: To match
582: .Qq foo
583: or zero-or-more "bar"'s,
584: use:
585: .Pp
586: .D1 foo|(bar)*
587: .Pp
1.1 deraadt 588: and to match zero-or-more "foo"'s-or-"bar"'s:
1.16 jmc 589: .Pp
590: .D1 (foo|bar)*
591: .Pp
1.1 deraadt 592: In addition to characters and ranges of characters, character classes
593: can also contain character class
1.16 jmc 594: .Em expressions .
1.1 deraadt 595: These are expressions enclosed inside
1.16 jmc 596: .Sq [:
597: and
598: .Sq :]
599: delimiters (which themselves must appear between the
1.26 schwarze 600: .Sq \&[
1.1 deraadt 601: and
1.16 jmc 602: .Sq ]\&
603: of the
1.1 deraadt 604: character class; other elements may occur inside the character class, too).
605: The valid expressions are:
1.16 jmc 606: .Bd -unfilled -offset indent
607: [:alnum:] [:alpha:] [:blank:]
608: [:cntrl:] [:digit:] [:graph:]
609: [:lower:] [:print:] [:punct:]
610: [:space:] [:upper:] [:xdigit:]
611: .Ed
612: .Pp
1.1 deraadt 613: These expressions all designate a set of characters equivalent to
614: the corresponding standard C
1.16 jmc 615: .Fn isXXX
616: function.
617: For example, [:alnum:] designates those characters for which
618: .Xr isalnum 3
619: returns true \- i.e., any alphabetic or numeric.
1.1 deraadt 620: Some systems don't provide
1.16 jmc 621: .Xr isblank 3 ,
622: so
623: .Nm
624: defines [:blank:] as a blank or a tab.
625: .Pp
1.1 deraadt 626: For example, the following character classes are all equivalent:
1.16 jmc 627: .Bd -unfilled -offset indent
628: [[:alnum:]]
629: [[:alpha:][:digit:]]
630: [[:alpha:]0-9]
631: [a-zA-Z0-9]
632: .Ed
633: .Pp
634: If the scanner is case-insensitive (the
635: .Fl i
636: flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
637: .Pp
1.1 deraadt 638: Some notes on patterns:
1.16 jmc 639: .Bl -dash
640: .It
641: A negated character class such as the example
642: .Qq [^A-Z]
643: above will match a newline unless "\en"
644: .Pq or an equivalent escape sequence
645: is one of the characters explicitly present in the negated character class
646: (e.g.,
647: .Qq [^A-Z\en] ) .
648: This is unlike how many other regular expression tools treat negated character
649: classes, but unfortunately the inconsistency is historically entrenched.
650: Matching newlines means that a pattern like
651: .Qq [^"]*
652: can match the entire input unless there's another quote in the input.
653: .It
654: A rule can have at most one instance of trailing context
655: (the
656: .Sq /
657: operator or the
658: .Sq $
659: operator).
660: The start condition,
661: .Sq ^ ,
662: and
663: .Qq <<EOF>>
664: patterns can only occur at the beginning of a pattern, and, as well as with
665: .Sq /
666: and
667: .Sq $ ,
668: cannot be grouped inside parentheses.
669: A
670: .Sq ^
671: which does not occur at the beginning of a rule or a
672: .Sq $
673: which does not occur at the end of a rule loses its special properties
674: and is treated as a normal character.
675: .It
1.1 deraadt 676: The following are illegal:
1.16 jmc 677: .Bd -unfilled -offset indent
678: foo/bar$
679: <sc1>foo<sc2>bar
680: .Ed
681: .Pp
682: Note that the first of these, can be written
683: .Qq foo/bar\en .
684: .It
685: The following will result in
686: .Sq $
687: or
688: .Sq ^
689: being treated as a normal character:
690: .Bd -unfilled -offset indent
691: foo|(bar$)
692: foo|^bar
693: .Ed
694: .Pp
695: If what's wanted is a
696: .Qq foo
697: or a bar-followed-by-a-newline, the following could be used
698: (the special
699: .Sq |\&
700: action is explained below):
701: .Bd -unfilled -offset indent
702: foo |
703: bar$ /* action goes here */
704: .Ed
705: .Pp
1.1 deraadt 706: A similar trick will work for matching a foo or a
707: bar-at-the-beginning-of-a-line.
1.16 jmc 708: .El
709: .Sh HOW THE INPUT IS MATCHED
710: When the generated scanner is run,
711: it analyzes its input looking for strings which match any of its patterns.
712: If it finds more than one match,
713: it takes the one matching the most text
714: (for trailing context rules, this includes the length of the trailing part,
715: even though it will then be returned to the input).
716: If it finds two or more matches of the same length,
717: the rule listed first in the
718: .Nm
1.1 deraadt 719: input file is chosen.
1.16 jmc 720: .Pp
1.1 deraadt 721: Once the match is determined, the text corresponding to the match
722: (called the
1.16 jmc 723: .Em token )
1.1 deraadt 724: is made available in the global character pointer
1.16 jmc 725: .Fa yytext ,
1.1 deraadt 726: and its length in the global integer
1.16 jmc 727: .Fa yyleng .
1.1 deraadt 728: The
1.16 jmc 729: .Em action
730: corresponding to the matched pattern is then executed
731: .Pq a more detailed description of actions follows ,
732: and then the remaining input is scanned for another match.
733: .Pp
734: If no match is found, then the default rule is executed:
735: the next character in the input is considered matched and
736: copied to the standard output.
737: Thus, the simplest legal
738: .Nm
1.1 deraadt 739: input is:
1.16 jmc 740: .Pp
741: .D1 %%
742: .Pp
743: which generates a scanner that simply copies its input
744: .Pq one character at a time
745: to its output.
746: .Pp
1.1 deraadt 747: Note that
1.16 jmc 748: .Fa yytext
749: can be defined in two different ways:
750: either as a character pointer or as a character array.
751: Which definition
752: .Nm
753: uses can be controlled by including one of the special directives
754: .Dq %pointer
755: or
756: .Dq %array
757: in the first
758: .Pq definitions
759: section of flex input.
760: The default is
761: .Dq %pointer ,
762: unless the
763: .Fl l
764: lex compatibility option is used, in which case
765: .Fa yytext
1.1 deraadt 766: will be an array.
767: The advantage of using
1.16 jmc 768: .Dq %pointer
1.1 deraadt 769: is substantially faster scanning and no buffer overflow when matching
1.16 jmc 770: very large tokens
771: .Pq unless not enough dynamic memory is available .
772: The disadvantage is that actions are restricted in how they can modify
773: .Fa yytext
774: .Pq see the next section ,
775: and calls to the
776: .Fn unput
1.10 deraadt 777: function destroy the present contents of
1.16 jmc 778: .Fa yytext ,
1.1 deraadt 779: which can be a considerable porting headache when moving between different
1.16 jmc 780: .Nm lex
1.1 deraadt 781: versions.
1.16 jmc 782: .Pp
1.1 deraadt 783: The advantage of
1.16 jmc 784: .Dq %array
785: is that
786: .Fa yytext
787: can be modified as much as wanted, and calls to
788: .Fn unput
1.1 deraadt 789: do not destroy
1.16 jmc 790: .Fa yytext
791: .Pq see below .
792: Furthermore, existing
793: .Nm lex
1.1 deraadt 794: programs sometimes access
1.16 jmc 795: .Fa yytext
1.1 deraadt 796: externally using declarations of the form:
1.16 jmc 797: .Pp
798: .D1 extern char yytext[];
799: .Pp
1.1 deraadt 800: This definition is erroneous when used with
1.16 jmc 801: .Dq %pointer ,
1.1 deraadt 802: but correct for
1.16 jmc 803: .Dq %array .
804: .Pp
805: .Dq %array
1.1 deraadt 806: defines
1.16 jmc 807: .Fa yytext
1.1 deraadt 808: to be an array of
1.16 jmc 809: .Dv YYLMAX
810: characters, which defaults to a fairly large value.
811: The size can be changed by simply #define'ing
812: .Dv YYLMAX
813: to a different value in the first section of
814: .Nm
815: input.
816: As mentioned above, with
817: .Dq %pointer
818: yytext grows dynamically to accommodate large tokens.
819: While this means a
820: .Dq %pointer
821: scanner can accommodate very large tokens
822: .Pq such as matching entire blocks of comments ,
823: bear in mind that each time the scanner must resize
824: .Fa yytext
1.1 deraadt 825: it also must rescan the entire token from the beginning, so matching such
826: tokens can prove slow.
1.16 jmc 827: .Fa yytext
828: presently does not dynamically grow if a call to
829: .Fn unput
1.1 deraadt 830: results in too much text being pushed back; instead, a run-time error results.
1.16 jmc 831: .Pp
832: Also note that
833: .Dq %array
834: cannot be used with C++ scanner classes
835: .Pq the c++ option; see below .
836: .Sh ACTIONS
837: Each pattern in a rule has a corresponding action,
838: which can be any arbitrary C statement.
839: The pattern ends at the first non-escaped whitespace character;
840: the remainder of the line is its action.
841: If the action is empty,
842: then when the pattern is matched the input token is simply discarded.
843: For example, here is the specification for a program
844: which deletes all occurrences of
845: .Qq zap me
846: from its input:
847: .Bd -literal -offset indent
848: %%
849: "zap me"
850: .Ed
851: .Pp
1.1 deraadt 852: (It will copy all other characters in the input to the output since
853: they will be matched by the default rule.)
1.16 jmc 854: .Pp
1.1 deraadt 855: Here is a program which compresses multiple blanks and tabs down to
856: a single blank, and throws away whitespace found at the end of a line:
1.16 jmc 857: .Bd -literal -offset indent
858: %%
859: [ \et]+ putchar(' ');
860: [ \et]+$ /* ignore this token */
861: .Ed
862: .Pp
863: If the action contains a
864: .Sq { ,
865: then the action spans till the balancing
866: .Sq }
1.1 deraadt 867: is found, and the action may cross multiple lines.
1.16 jmc 868: .Nm
1.1 deraadt 869: knows about C strings and comments and won't be fooled by braces found
870: within them, but also allows actions to begin with
1.16 jmc 871: .Sq %{
1.1 deraadt 872: and will consider the action to be all the text up to the next
1.16 jmc 873: .Sq %}
874: .Pq regardless of ordinary braces inside the action .
875: .Pp
876: An action consisting solely of a vertical bar
877: .Pq Sq |\&
878: means
879: .Qq same as the action for the next rule .
880: See below for an illustration.
881: .Pp
882: Actions can include arbitrary C code,
883: including return statements to return a value to whatever routine called
884: .Fn yylex .
1.1 deraadt 885: Each time
1.16 jmc 886: .Fn yylex
887: is called, it continues processing tokens from where it last left off
888: until it either reaches the end of the file or executes a return.
889: .Pp
1.1 deraadt 890: Actions are free to modify
1.16 jmc 891: .Fa yytext
892: except for lengthening it
893: (adding characters to its end \- these will overwrite later characters in the
894: input stream).
895: This, however, does not apply when using
896: .Dq %array
897: .Pq see above ;
898: in that case,
899: .Fa yytext
1.1 deraadt 900: may be freely modified in any way.
1.16 jmc 901: .Pp
1.1 deraadt 902: Actions are free to modify
1.16 jmc 903: .Fa yyleng
1.1 deraadt 904: except they should not do so if the action also includes use of
1.16 jmc 905: .Fn yymore
906: .Pq see below .
907: .Pp
1.1 deraadt 908: There are a number of special directives which can be included within
909: an action:
1.16 jmc 910: .Bl -tag -width Ds
911: .It ECHO
912: Copies
913: .Fa yytext
914: to the scanner's output.
915: .It BEGIN
916: Followed by the name of a start condition, places the scanner in the
917: corresponding start condition
918: .Pq see below .
919: .It REJECT
920: Directs the scanner to proceed on to the
921: .Qq second best
922: rule which matched the input
923: .Pq or a prefix of the input .
924: The rule is chosen as described above in
925: .Sx HOW THE INPUT IS MATCHED ,
926: and
927: .Fa yytext
1.1 deraadt 928: and
1.16 jmc 929: .Fa yyleng
1.1 deraadt 930: set up appropriately.
931: It may either be one which matched as much text
932: as the originally chosen rule but came later in the
1.16 jmc 933: .Nm
1.1 deraadt 934: input file, or one which matched less text.
935: For example, the following will both count the
1.16 jmc 936: words in the input and call the routine
937: .Fn special
938: whenever
939: .Qq frob
940: is seen:
941: .Bd -literal -offset indent
942: int word_count = 0;
943: %%
944:
945: frob special(); REJECT;
946: [^ \et\en]+ ++word_count;
947: .Ed
948: .Pp
1.1 deraadt 949: Without the
1.16 jmc 950: .Em REJECT ,
951: any "frob"'s in the input would not be counted as words,
952: since the scanner normally executes only one action per token.
1.1 deraadt 953: Multiple
1.16 jmc 954: .Em REJECT Ns 's
955: are allowed,
956: each one finding the next best choice to the currently active rule.
957: For example, when the following scanner scans the token
958: .Qq abcd ,
959: it will write
960: .Qq abcdabcaba
961: to the output:
962: .Bd -literal -offset indent
963: %%
964: a |
965: ab |
966: abc |
967: abcd ECHO; REJECT;
968: \&.|\en /* eat up any unmatched character */
969: .Ed
970: .Pp
1.1 deraadt 971: (The first three rules share the fourth's action since they use
1.16 jmc 972: the special
973: .Sq |\&
974: action.)
975: .Em REJECT
1.1 deraadt 976: is a particularly expensive feature in terms of scanner performance;
1.16 jmc 977: if it is used in any of the scanner's actions it will slow down
978: all of the scanner's matching.
979: Furthermore,
980: .Em REJECT
1.1 deraadt 981: cannot be used with the
1.16 jmc 982: .Fl Cf
1.1 deraadt 983: or
1.16 jmc 984: .Fl CF
985: options
986: .Pq see below .
987: .Pp
1.1 deraadt 988: Note also that unlike the other special actions,
1.16 jmc 989: .Em REJECT
1.1 deraadt 990: is a
1.16 jmc 991: .Em branch ;
992: code immediately following it in the action will not be executed.
993: .It yymore()
994: Tells the scanner that the next time it matches a rule, the corresponding
995: token should be appended onto the current value of
996: .Fa yytext
997: rather than replacing it.
998: For example, given the input
999: .Qq mega-kludge
1000: the following will write
1001: .Qq mega-mega-kludge
1002: to the output:
1003: .Bd -literal -offset indent
1004: %%
1005: mega- ECHO; yymore();
1006: kludge ECHO;
1007: .Ed
1008: .Pp
1009: First
1010: .Qq mega-
1011: is matched and echoed to the output.
1012: Then
1013: .Qq kludge
1014: is matched, but the previous
1015: .Qq mega-
1016: is still hanging around at the beginning of
1017: .Fa yytext
1.1 deraadt 1018: so the
1.16 jmc 1019: .Em ECHO
1020: for the
1021: .Qq kludge
1022: rule will actually write
1023: .Qq mega-kludge .
1024: .Pp
1.1 deraadt 1025: Two notes regarding use of
1.16 jmc 1026: .Fn yymore :
1.1 deraadt 1027: First,
1.16 jmc 1028: .Fn yymore
1.1 deraadt 1029: depends on the value of
1.16 jmc 1030: .Fa yyleng
1031: correctly reflecting the size of the current token, so
1032: .Fa yyleng
1033: must not be modified when using
1034: .Fn yymore .
1.1 deraadt 1035: Second, the presence of
1.16 jmc 1036: .Fn yymore
1.1 deraadt 1037: in the scanner's action entails a minor performance penalty in the
1038: scanner's matching speed.
1.16 jmc 1039: .It yyless(n)
1040: Returns all but the first
1041: .Ar n
1.1 deraadt 1042: characters of the current token back to the input stream, where they
1043: will be rescanned when the scanner looks for the next match.
1.16 jmc 1044: .Fa yytext
1.1 deraadt 1045: and
1.16 jmc 1046: .Fa yyleng
1.1 deraadt 1047: are adjusted appropriately (e.g.,
1.16 jmc 1048: .Fa yyleng
1.1 deraadt 1049: will now be equal to
1.16 jmc 1050: .Ar n ) .
1051: For example, on the input
1052: .Qq foobar
1053: the following will write out
1054: .Qq foobarbar :
1055: .Bd -literal -offset indent
1056: %%
1057: foobar ECHO; yyless(3);
1058: [a-z]+ ECHO;
1059: .Ed
1060: .Pp
1.1 deraadt 1061: An argument of 0 to
1.16 jmc 1062: .Fa yyless
1063: will cause the entire current input string to be scanned again.
1064: Unless how the scanner will subsequently process its input has been changed
1065: (using
1066: .Em BEGIN ,
1067: for example),
1068: this will result in an endless loop.
1069: .Pp
1.1 deraadt 1070: Note that
1.16 jmc 1071: .Fa yyless
1072: is a macro and can only be used in the
1073: .Nm
1074: input file, not from other source files.
1075: .It unput(c)
1076: Puts the character
1077: .Ar c
1078: back into the input stream.
1079: It will be the next character scanned.
1.1 deraadt 1080: The following action will take the current token and cause it
1081: to be rescanned enclosed in parentheses.
1.16 jmc 1082: .Bd -literal -offset indent
1083: {
1084: int i;
1085: char *yycopy;
1086:
1087: /* Copy yytext because unput() trashes yytext */
1088: if ((yycopy = strdup(yytext)) == NULL)
1089: err(1, NULL);
1090: unput(')');
1091: for (i = yyleng - 1; i >= 0; --i)
1092: unput(yycopy[i]);
1093: unput('(');
1094: free(yycopy);
1095: }
1096: .Ed
1097: .Pp
1.1 deraadt 1098: Note that since each
1.16 jmc 1099: .Fn unput
1100: puts the given character back at the beginning of the input stream,
1101: pushing back strings must be done back-to-front.
1102: .Pp
1.1 deraadt 1103: An important potential problem when using
1.16 jmc 1104: .Fn unput
1105: is that if using
1106: .Dq %pointer
1107: .Pq the default ,
1108: a call to
1109: .Fn unput
1110: destroys the contents of
1111: .Fa yytext ,
1.1 deraadt 1112: starting with its rightmost character and devouring one character to
1.16 jmc 1113: the left with each call.
1114: If the value of
1115: .Fa yytext
1116: should be preserved after a call to
1117: .Fn unput
1118: .Pq as in the above example ,
1119: it must either first be copied elsewhere, or the scanner must be built using
1120: .Dq %array
1121: instead (see
1122: .Sx HOW THE INPUT IS MATCHED ) .
1123: .Pp
1124: Finally, note that EOF cannot be put back
1.1 deraadt 1125: to attempt to mark the input stream with an end-of-file.
1.16 jmc 1126: .It input()
1127: Reads the next character from the input stream.
1128: For example, the following is one way to eat up C comments:
1129: .Bd -literal -offset indent
1130: %%
1131: "/*" {
1132: int c;
1133:
1134: for (;;) {
1135: while ((c = input()) != '*' && c != EOF)
1136: ; /* eat up text of comment */
1137:
1138: if (c == '*') {
1139: while ((c = input()) == '*')
1140: ;
1141: if (c == '/')
1142: break; /* found the end */
1143: }
1144:
1145: if (c == EOF) {
1146: errx(1, "EOF in comment");
1.1 deraadt 1147: break;
1148: }
1.16 jmc 1149: }
1150: }
1151: .Ed
1152: .Pp
1153: (Note that if the scanner is compiled using C++, then
1154: .Fn input
1.1 deraadt 1155: is instead referred to as
1.16 jmc 1156: .Fn yyinput ,
1157: in order to avoid a name clash with the C++ stream by the name of input.)
1158: .It YY_FLUSH_BUFFER
1159: Flushes the scanner's internal buffer
1160: so that the next time the scanner attempts to match a token,
1161: it will first refill the buffer using
1162: .Dv YY_INPUT
1163: (see
1164: .Sx THE GENERATED SCANNER ,
1165: below).
1166: This action is a special case of the more general
1167: .Fn yy_flush_buffer
1168: function, described below in the section
1169: .Sx MULTIPLE INPUT BUFFERS .
1170: .It yyterminate()
1171: Can be used in lieu of a return statement in an action.
1172: It terminates the scanner and returns a 0 to the scanner's caller, indicating
1173: .Qq all done .
1.1 deraadt 1174: By default,
1.16 jmc 1175: .Fn yyterminate
1176: is also called when an end-of-file is encountered.
1177: It is a macro and may be redefined.
1178: .El
1179: .Sh THE GENERATED SCANNER
1.1 deraadt 1180: The output of
1.16 jmc 1181: .Nm
1.1 deraadt 1182: is the file
1.16 jmc 1183: .Pa lex.yy.c ,
1.1 deraadt 1184: which contains the scanning routine
1.16 jmc 1185: .Fn yylex ,
1186: a number of tables used by it for matching tokens,
1187: and a number of auxiliary routines and macros.
1188: By default,
1189: .Fn yylex
1.1 deraadt 1190: is declared as follows:
1.16 jmc 1191: .Bd -unfilled -offset indent
1192: int yylex()
1193: {
1194: ... various definitions and the actions in here ...
1195: }
1196: .Ed
1197: .Pp
1198: (If the environment supports function prototypes, then it will
1199: be "int yylex(void)".)
1200: This definition may be changed by defining the
1201: .Dv YY_DECL
1202: macro.
1203: For example:
1204: .Bd -literal -offset indent
1205: #define YY_DECL float lexscan(a, b) float a, b;
1206: .Ed
1207: .Pp
1208: would give the scanning routine the name
1209: .Em lexscan ,
1210: returning a float, and taking two floats as arguments.
1211: Note that if arguments are given to the scanning routine using a
1212: K&R-style/non-prototyped function declaration,
1213: the definition must be terminated with a semi-colon
1214: .Pq Sq ;\& .
1215: .Pp
1.1 deraadt 1216: Whenever
1.16 jmc 1217: .Fn yylex
1.1 deraadt 1218: is called, it scans tokens from the global input file
1.16 jmc 1219: .Pa yyin
1220: .Pq which defaults to stdin .
1221: It continues until it either reaches an end-of-file
1222: .Pq at which point it returns the value 0
1223: or one of its actions executes a
1224: .Em return
1.1 deraadt 1225: statement.
1.16 jmc 1226: .Pp
1.1 deraadt 1227: If the scanner reaches an end-of-file, subsequent calls are undefined
1228: unless either
1.16 jmc 1229: .Em yyin
1230: is pointed at a new input file
1231: .Pq in which case scanning continues from that file ,
1232: or
1233: .Fn yyrestart
1.1 deraadt 1234: is called.
1.16 jmc 1235: .Fn yyrestart
1.1 deraadt 1236: takes one argument, a
1.16 jmc 1237: .Fa FILE *
1238: pointer (which can be nil, if
1239: .Dv YY_INPUT
1240: has been set up to scan from a source other than
1241: .Em yyin ) ,
1.1 deraadt 1242: and initializes
1.16 jmc 1243: .Em yyin
1244: for scanning from that file.
1245: Essentially there is no difference between just assigning
1246: .Em yyin
1.1 deraadt 1247: to a new input file or using
1.16 jmc 1248: .Fn yyrestart
1249: to do so; the latter is available for compatibility with previous versions of
1250: .Nm ,
1.1 deraadt 1251: and because it can be used to switch input files in the middle of scanning.
1.16 jmc 1252: It can also be used to throw away the current input buffer,
1253: by calling it with an argument of
1254: .Em yyin ;
1.1 deraadt 1255: but better is to use
1.16 jmc 1256: .Dv YY_FLUSH_BUFFER
1257: .Pq see above .
1.1 deraadt 1258: Note that
1.16 jmc 1259: .Fn yyrestart
1260: does not reset the start condition to
1261: .Em INITIAL
1262: (see
1263: .Sx START CONDITIONS ,
1264: below).
1265: .Pp
1.1 deraadt 1266: If
1.16 jmc 1267: .Fn yylex
1.1 deraadt 1268: stops scanning due to executing a
1.16 jmc 1269: .Em return
1.1 deraadt 1270: statement in one of the actions, the scanner may then be called again and it
1271: will resume scanning where it left off.
1.16 jmc 1272: .Pp
1273: By default
1274: .Pq and for purposes of efficiency ,
1275: the scanner uses block-reads rather than simple
1276: .Xr getc 3
1.1 deraadt 1277: calls to read characters from
1.16 jmc 1278: .Em yyin .
1.1 deraadt 1279: The nature of how it gets its input can be controlled by defining the
1.16 jmc 1280: .Dv YY_INPUT
1.1 deraadt 1281: macro.
1.16 jmc 1282: .Dv YY_INPUT Ns 's
1283: calling sequence is
1284: .Qq YY_INPUT(buf,result,max_size) .
1285: Its action is to place up to
1286: .Dv max_size
1.1 deraadt 1287: characters in the character array
1.16 jmc 1288: .Em buf
1.1 deraadt 1289: and return in the integer variable
1.16 jmc 1290: .Em result
1291: either the number of characters read or the constant
1292: .Dv YY_NULL
1293: (0 on
1294: .Ux
1295: systems)
1296: to indicate
1297: .Dv EOF .
1298: The default
1299: .Dv YY_INPUT
1300: reads from the global file-pointer
1301: .Qq yyin .
1302: .Pp
1303: A sample definition of
1304: .Dv YY_INPUT
1305: .Pq in the definitions section of the input file :
1306: .Bd -unfilled -offset indent
1307: %{
1308: #define YY_INPUT(buf,result,max_size) \e
1309: { \e
1310: int c = getchar(); \e
1311: result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1312: }
1313: %}
1314: .Ed
1315: .Pp
1.1 deraadt 1316: This definition will change the input processing to occur
1317: one character at a time.
1.16 jmc 1318: .Pp
1319: When the scanner receives an end-of-file indication from
1320: .Dv YY_INPUT ,
1.1 deraadt 1321: it then checks the
1.16 jmc 1322: .Fn yywrap
1323: function.
1324: If
1325: .Fn yywrap
1326: returns false
1327: .Pq zero ,
1328: then it is assumed that the function has gone ahead and set up
1329: .Em yyin
1330: to point to another input file, and scanning continues.
1331: If it returns true
1332: .Pq non-zero ,
1333: then the scanner terminates, returning 0 to its caller.
1334: Note that in either case, the start condition remains unchanged;
1335: it does not revert to
1336: .Em INITIAL .
1337: .Pp
1.1 deraadt 1338: If you do not supply your own version of
1.16 jmc 1339: .Fn yywrap ,
1.1 deraadt 1340: then you must either use
1.16 jmc 1341: .Dq %option noyywrap
1.1 deraadt 1342: (in which case the scanner behaves as though
1.16 jmc 1343: .Fn yywrap
1.1 deraadt 1344: returned 1), or you must link with
1.16 jmc 1345: .Fl lfl
1.1 deraadt 1346: to obtain the default version of the routine, which always returns 1.
1.16 jmc 1347: .Pp
1.1 deraadt 1348: Three routines are available for scanning from in-memory buffers rather
1349: than files:
1.16 jmc 1350: .Fn yy_scan_string ,
1351: .Fn yy_scan_bytes ,
1.1 deraadt 1352: and
1.16 jmc 1353: .Fn yy_scan_buffer .
1354: See the discussion of them below in the section
1355: .Sx MULTIPLE INPUT BUFFERS .
1356: .Pp
1.1 deraadt 1357: The scanner writes its
1.16 jmc 1358: .Em ECHO
1.1 deraadt 1359: output to the
1.16 jmc 1360: .Em yyout
1361: global
1362: .Pq default, stdout ,
1363: which may be redefined by the user simply by assigning it to some other
1364: .Va FILE
1.1 deraadt 1365: pointer.
1.16 jmc 1366: .Sh START CONDITIONS
1367: .Nm
1368: provides a mechanism for conditionally activating rules.
1369: Any rule whose pattern is prefixed with
1370: .Qq Aq sc
1371: will only be active when the scanner is in the start condition named
1372: .Qq sc .
1373: For example,
1374: .Bd -literal -offset indent
1375: <STRING>[^"]* { /* eat up the string body ... */
1376: ...
1377: }
1378: .Ed
1379: .Pp
1380: will be active only when the scanner is in the
1381: .Qq STRING
1382: start condition, and
1383: .Bd -literal -offset indent
1384: <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1385: ...
1386: }
1387: .Ed
1388: .Pp
1389: will be active only when the current start condition is either
1390: .Qq INITIAL ,
1391: .Qq STRING ,
1392: or
1393: .Qq QUOTE .
1394: .Pp
1395: Start conditions are declared in the definitions
1396: .Pq first
1397: section of the input using unindented lines beginning with either
1398: .Sq %s
1.1 deraadt 1399: or
1.16 jmc 1400: .Sq %x
1.1 deraadt 1401: followed by a list of names.
1402: The former declares
1.16 jmc 1403: .Em inclusive
1.1 deraadt 1404: start conditions, the latter
1.16 jmc 1405: .Em exclusive
1406: start conditions.
1407: A start condition is activated using the
1408: .Em BEGIN
1409: action.
1410: Until the next
1411: .Em BEGIN
1412: action is executed, rules with the given start condition will be active and
1.1 deraadt 1413: rules with other start conditions will be inactive.
1.16 jmc 1414: If the start condition is inclusive,
1.1 deraadt 1415: then rules with no start conditions at all will also be active.
1.16 jmc 1416: If it is exclusive,
1417: then only rules qualified with the start condition will be active.
1.1 deraadt 1418: A set of rules contingent on the same exclusive start condition
1419: describe a scanner which is independent of any of the other rules in the
1.16 jmc 1420: .Nm
1421: input.
1422: Because of this, exclusive start conditions make it easy to specify
1423: .Qq mini-scanners
1.1 deraadt 1424: which scan portions of the input that are syntactically different
1.16 jmc 1425: from the rest
1426: .Pq e.g., comments .
1427: .Pp
1.1 deraadt 1428: If the distinction between inclusive and exclusive start conditions
1429: is still a little vague, here's a simple example illustrating the
1.16 jmc 1430: connection between the two.
1431: The set of rules:
1432: .Bd -literal -offset indent
1433: %s example
1434: %%
1435:
1436: <example>foo do_something();
1437:
1438: bar something_else();
1439: .Ed
1440: .Pp
1.1 deraadt 1441: is equivalent to
1.16 jmc 1442: .Bd -literal -offset indent
1443: %x example
1444: %%
1445:
1446: <example>foo do_something();
1447:
1448: <INITIAL,example>bar something_else();
1449: .Ed
1450: .Pp
1.1 deraadt 1451: Without the
1.16 jmc 1452: .Aq INITIAL,example
1.1 deraadt 1453: qualifier, the
1.16 jmc 1454: .Dq bar
1455: pattern in the second example wouldn't be active
1456: .Pq i.e., couldn't match
1.1 deraadt 1457: when in start condition
1.16 jmc 1458: .Dq example .
1.1 deraadt 1459: If we just used
1.16 jmc 1460: .Aq example
1.1 deraadt 1461: to qualify
1.16 jmc 1462: .Dq bar ,
1.1 deraadt 1463: though, then it would only be active in
1.16 jmc 1464: .Dq example
1.1 deraadt 1465: and not in
1.16 jmc 1466: .Em INITIAL ,
1467: while in the first example it's active in both,
1468: because in the first example the
1469: .Dq example
1470: start condition is an inclusive
1471: .Pq Sq %s
1.1 deraadt 1472: start condition.
1.16 jmc 1473: .Pp
1.1 deraadt 1474: Also note that the special start-condition specifier
1.16 jmc 1475: .Sq Aq *
1476: matches every start condition.
1477: Thus, the above example could also have been written:
1478: .Bd -literal -offset indent
1479: %x example
1480: %%
1481:
1482: <example>foo do_something();
1483:
1484: <*>bar something_else();
1485: .Ed
1486: .Pp
1.1 deraadt 1487: The default rule (to
1.16 jmc 1488: .Em ECHO
1489: any unmatched character) remains active in start conditions.
1490: It is equivalent to:
1491: .Bd -literal -offset indent
1492: <*>.|\en ECHO;
1493: .Ed
1494: .Pp
1495: .Dq BEGIN(0)
1.1 deraadt 1496: returns to the original state where only the rules with
1.16 jmc 1497: no start conditions are active.
1498: This state can also be referred to as the start-condition
1499: .Em INITIAL ,
1500: so
1501: .Dq BEGIN(INITIAL)
1.1 deraadt 1502: is equivalent to
1.16 jmc 1503: .Dq BEGIN(0) .
1.1 deraadt 1504: (The parentheses around the start condition name are not required but
1505: are considered good style.)
1.16 jmc 1506: .Pp
1507: .Em BEGIN
1.1 deraadt 1508: actions can also be given as indented code at the beginning
1.16 jmc 1509: of the rules section.
1510: For example, the following will cause the scanner to enter the
1511: .Qq SPECIAL
1512: start condition whenever
1513: .Fn yylex
1.1 deraadt 1514: is called and the global variable
1.16 jmc 1515: .Fa enter_special
1.1 deraadt 1516: is true:
1.16 jmc 1517: .Bd -literal -offset indent
1518: int enter_special;
1.1 deraadt 1519:
1.16 jmc 1520: %x SPECIAL
1521: %%
1522: if (enter_special)
1.1 deraadt 1523: BEGIN(SPECIAL);
1524:
1.16 jmc 1525: <SPECIAL>blahblahblah
1526: \&...more rules follow...
1527: .Ed
1528: .Pp
1.1 deraadt 1529: To illustrate the uses of start conditions,
1530: here is a scanner which provides two different interpretations
1.16 jmc 1531: of a string like
1532: .Qq 123.456 .
1533: By default it will treat it as three tokens: the integer
1534: .Qq 123 ,
1535: a dot
1536: .Pq Sq .\& ,
1537: and the integer
1538: .Qq 456 .
1.1 deraadt 1539: But if the string is preceded earlier in the line by the string
1.16 jmc 1540: .Qq expect-floats
1541: it will treat it as a single token, the floating-point number 123.456:
1542: .Bd -literal -offset indent
1543: %{
1544: #include <math.h>
1545: %}
1546: %s expect
1547:
1548: %%
1549: expect-floats BEGIN(expect);
1550:
1551: <expect>[0-9]+"."[0-9]+ {
1552: printf("found a float, = %f\en",
1553: atof(yytext));
1554: }
1555: <expect>\en {
1556: /*
1557: * That's the end of the line, so
1558: * we need another "expect-number"
1559: * before we'll recognize any more
1560: * numbers.
1561: */
1562: BEGIN(INITIAL);
1563: }
1564:
1565: [0-9]+ {
1566: printf("found an integer, = %d\en",
1567: atoi(yytext));
1568: }
1569:
1570: "." printf("found a dot\en");
1571: .Ed
1572: .Pp
1573: Here is a scanner which recognizes
1574: .Pq and discards
1575: C comments while maintaining a count of the current input line:
1576: .Bd -literal -offset indent
1577: %x comment
1578: %%
1579: int line_num = 1;
1580:
1581: "/*" BEGIN(comment);
1582:
1583: <comment>[^*\en]* /* eat anything that's not a '*' */
1584: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1585: <comment>\en ++line_num;
1586: <comment>"*"+"/" BEGIN(INITIAL);
1587: .Ed
1588: .Pp
1.1 deraadt 1589: This scanner goes to a bit of trouble to match as much
1.16 jmc 1590: text as possible with each rule.
1591: In general, when attempting to write a high-speed scanner
1592: try to match as much as possible in each rule, as it's a big win.
1593: .Pp
1.10 deraadt 1594: Note that start-condition names are really integer values and
1.16 jmc 1595: can be stored as such.
1596: Thus, the above could be extended in the following fashion:
1597: .Bd -literal -offset indent
1598: %x comment foo
1599: %%
1600: int line_num = 1;
1601: int comment_caller;
1602:
1603: "/*" {
1604: comment_caller = INITIAL;
1605: BEGIN(comment);
1606: }
1607:
1608: \&...
1609:
1610: <foo>"/*" {
1611: comment_caller = foo;
1612: BEGIN(comment);
1613: }
1614:
1615: <comment>[^*\en]* /* eat anything that's not a '*' */
1616: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1617: <comment>\en ++line_num;
1618: <comment>"*"+"/" BEGIN(comment_caller);
1619: .Ed
1620: .Pp
1621: Furthermore, the current start condition can be accessed by using
1.1 deraadt 1622: the integer-valued
1.16 jmc 1623: .Dv YY_START
1624: macro.
1625: For example, the above assignments to
1626: .Em comment_caller
1.1 deraadt 1627: could instead be written
1.16 jmc 1628: .Pp
1629: .Dl comment_caller = YY_START;
1630: .Pp
1.1 deraadt 1631: Flex provides
1.16 jmc 1632: .Dv YYSTATE
1.1 deraadt 1633: as an alias for
1.16 jmc 1634: .Dv YY_START
1.1 deraadt 1635: (since that is what's used by AT&T
1.16 jmc 1636: .Nm lex ) .
1637: .Pp
1638: Note that start conditions do not have their own name-space;
1639: %s's and %x's declare names in the same fashion as #define's.
1640: .Pp
1.1 deraadt 1641: Finally, here's an example of how to match C-style quoted strings using
1.16 jmc 1642: exclusive start conditions, including expanded escape sequences
1643: (but not including checking for a string that's too long):
1644: .Bd -literal -offset indent
1645: %x str
1646:
1647: %%
1648: #define MAX_STR_CONST 1024
1649: char string_buf[MAX_STR_CONST];
1650: char *string_buf_ptr;
1651:
1652: \e" string_buf_ptr = string_buf; BEGIN(str);
1653:
1654: <str>\e" { /* saw closing quote - all done */
1655: BEGIN(INITIAL);
1656: *string_buf_ptr = '\e0';
1657: /*
1658: * return string constant token type and
1659: * value to parser
1660: */
1661: }
1662:
1663: <str>\en {
1664: /* error - unterminated string constant */
1665: /* generate error message */
1666: }
1667:
1668: <str>\e\e[0-7]{1,3} {
1669: /* octal escape sequence */
1670: int result;
1671:
1672: (void) sscanf(yytext + 1, "%o", &result);
1673:
1674: if (result > 0xff) {
1675: /* error, constant is out-of-bounds */
1676: } else
1677: *string_buf_ptr++ = result;
1678: }
1679:
1680: <str>\e\e[0-9]+ {
1681: /*
1682: * generate error - bad escape sequence; something
1683: * like '\e48' or '\e0777777'
1684: */
1685: }
1686:
1687: <str>\e\en *string_buf_ptr++ = '\en';
1688: <str>\e\et *string_buf_ptr++ = '\et';
1689: <str>\e\er *string_buf_ptr++ = '\er';
1690: <str>\e\eb *string_buf_ptr++ = '\eb';
1691: <str>\e\ef *string_buf_ptr++ = '\ef';
1692:
1693: <str>\e\e(.|\en) *string_buf_ptr++ = yytext[1];
1694:
1695: <str>[^\e\e\en\e"]+ {
1696: char *yptr = yytext;
1697:
1698: while (*yptr)
1699: *string_buf_ptr++ = *yptr++;
1700: }
1701: .Ed
1702: .Pp
1703: Often, such as in some of the examples above,
1704: a whole bunch of rules are all preceded by the same start condition(s).
1705: .Nm
1.1 deraadt 1706: makes this a little easier and cleaner by introducing a notion of
1707: start condition
1.16 jmc 1708: .Em scope .
1.1 deraadt 1709: A start condition scope is begun with:
1.16 jmc 1710: .Pp
1711: .Dl <SCs>{
1712: .Pp
1.1 deraadt 1713: where
1.16 jmc 1714: .Dq SCs
1715: is a list of one or more start conditions.
1716: Inside the start condition scope, every rule automatically has the prefix
1717: .Aq SCs
1.1 deraadt 1718: applied to it, until a
1.16 jmc 1719: .Sq }
1.1 deraadt 1720: which matches the initial
1.16 jmc 1721: .Sq { .
1.1 deraadt 1722: So, for example,
1.16 jmc 1723: .Bd -literal -offset indent
1724: <ESC>{
1725: "\e\en" return '\en';
1726: "\e\er" return '\er';
1727: "\e\ef" return '\ef';
1728: "\e\e0" return '\e0';
1729: }
1730: .Ed
1731: .Pp
1.1 deraadt 1732: is equivalent to:
1.16 jmc 1733: .Bd -literal -offset indent
1734: <ESC>"\e\en" return '\en';
1735: <ESC>"\e\er" return '\er';
1736: <ESC>"\e\ef" return '\ef';
1737: <ESC>"\e\e0" return '\e0';
1738: .Ed
1739: .Pp
1.1 deraadt 1740: Start condition scopes may be nested.
1.16 jmc 1741: .Pp
1.1 deraadt 1742: Three routines are available for manipulating stacks of start conditions:
1.16 jmc 1743: .Bl -tag -width Ds
1744: .It void yy_push_state(int new_state)
1745: Pushes the current start condition onto the top of the start condition
1.1 deraadt 1746: stack and switches to
1.16 jmc 1747: .Fa new_state
1748: as though
1749: .Dq BEGIN new_state
1750: had been used
1751: .Pq recall that start condition names are also integers .
1752: .It void yy_pop_state()
1753: Pops the top of the stack and switches to it via
1754: .Em BEGIN .
1755: .It int yy_top_state()
1756: Returns the top of the stack without altering the stack's contents.
1757: .El
1758: .Pp
1.1 deraadt 1759: The start condition stack grows dynamically and so has no built-in
1.16 jmc 1760: size limitation.
1761: If memory is exhausted, program execution aborts.
1762: .Pp
1763: To use start condition stacks, scanners must include a
1764: .Dq %option stack
1765: directive (see
1766: .Sx OPTIONS
1767: below).
1768: .Sh MULTIPLE INPUT BUFFERS
1769: Some scanners
1770: (such as those which support
1771: .Qq include
1772: files)
1773: require reading from several input streams.
1774: As
1775: .Nm
1.1 deraadt 1776: scanners do a large amount of buffering, one cannot control
1777: where the next input will be read from by simply writing a
1.16 jmc 1778: .Dv YY_INPUT
1.1 deraadt 1779: which is sensitive to the scanning context.
1.16 jmc 1780: .Dv YY_INPUT
1.1 deraadt 1781: is only called when the scanner reaches the end of its buffer, which
1.16 jmc 1782: may be a long time after scanning a statement such as an
1783: .Qq include
1.1 deraadt 1784: which requires switching the input source.
1.16 jmc 1785: .Pp
1.1 deraadt 1786: To negotiate these sorts of problems,
1.16 jmc 1787: .Nm
1.1 deraadt 1788: provides a mechanism for creating and switching between multiple
1.16 jmc 1789: input buffers.
1790: An input buffer is created by using:
1791: .Pp
1792: .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1793: .Pp
1.1 deraadt 1794: which takes a
1.16 jmc 1795: .Fa FILE
1796: pointer and a
1797: .Fa size
1798: and creates a buffer associated with the given file and large enough to hold
1799: .Fa size
1.1 deraadt 1800: characters (when in doubt, use
1.16 jmc 1801: .Dv YY_BUF_SIZE
1802: for the size).
1803: It returns a
1804: .Dv YY_BUFFER_STATE
1805: handle, which may then be passed to other routines
1806: .Pq see below .
1807: The
1808: .Dv YY_BUFFER_STATE
1.1 deraadt 1809: type is a pointer to an opaque
1.16 jmc 1810: .Dq struct yy_buffer_state
1811: structure, so
1812: .Dv YY_BUFFER_STATE
1813: variables may be safely initialized to
1814: .Dq ((YY_BUFFER_STATE) 0)
1815: if desired, and the opaque structure can also be referred to in order to
1816: correctly declare input buffers in source files other than that of scanners.
1817: Note that the
1818: .Fa FILE
1.1 deraadt 1819: pointer in the call to
1.16 jmc 1820: .Fn yy_create_buffer
1.1 deraadt 1821: is only used as the value of
1.16 jmc 1822: .Fa yyin
1.1 deraadt 1823: seen by
1.16 jmc 1824: .Dv YY_INPUT ;
1825: if
1826: .Dv YY_INPUT
1827: is redefined so that it no longer uses
1828: .Fa yyin ,
1829: then a nil
1830: .Fa FILE
1831: pointer can safely be passed to
1832: .Fn yy_create_buffer .
1833: To select a particular buffer to scan:
1834: .Pp
1835: .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1836: .Pp
1837: It switches the scanner's input buffer so subsequent tokens will
1.1 deraadt 1838: come from
1.16 jmc 1839: .Fa new_buffer .
1.1 deraadt 1840: Note that
1.16 jmc 1841: .Fn yy_switch_to_buffer
1842: may be used by
1843: .Fn yywrap
1844: to set things up for continued scanning,
1845: instead of opening a new file and pointing
1846: .Fa yyin
1847: at it.
1848: Note also that switching input sources via either
1849: .Fn yy_switch_to_buffer
1850: or
1851: .Fn yywrap
1852: does not change the start condition.
1853: .Pp
1854: .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1855: .Pp
1856: is used to reclaim the storage associated with a buffer.
1857: .Pf ( Fa buffer
1.1 deraadt 1858: can be nil, in which case the routine does nothing.)
1.16 jmc 1859: To clear the current contents of a buffer:
1860: .Pp
1861: .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1862: .Pp
1.1 deraadt 1863: This function discards the buffer's contents,
1.16 jmc 1864: so the next time the scanner attempts to match a token from the buffer,
1865: it will first fill the buffer anew using
1866: .Dv YY_INPUT .
1867: .Pp
1868: .Fn yy_new_buffer
1.1 deraadt 1869: is an alias for
1.16 jmc 1870: .Fn yy_create_buffer ,
1.1 deraadt 1871: provided for compatibility with the C++ use of
1.16 jmc 1872: .Em new
1.1 deraadt 1873: and
1.16 jmc 1874: .Em delete
1.1 deraadt 1875: for creating and destroying dynamic objects.
1.16 jmc 1876: .Pp
1.1 deraadt 1877: Finally, the
1.16 jmc 1878: .Dv YY_CURRENT_BUFFER
1.1 deraadt 1879: macro returns a
1.16 jmc 1880: .Dv YY_BUFFER_STATE
1.1 deraadt 1881: handle to the current buffer.
1.16 jmc 1882: .Pp
1.1 deraadt 1883: Here is an example of using these features for writing a scanner
1884: which expands include files (the
1.16 jmc 1885: .Aq Aq EOF
1.1 deraadt 1886: feature is discussed below):
1.16 jmc 1887: .Bd -literal -offset indent
1888: /*
1889: * the "incl" state is used for picking up the name
1890: * of an include file
1891: */
1892: %x incl
1893:
1894: %{
1895: #define MAX_INCLUDE_DEPTH 10
1896: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1897: int include_stack_ptr = 0;
1898: %}
1899:
1900: %%
1901: include BEGIN(incl);
1902:
1903: [a-z]+ ECHO;
1904: [^a-z\en]*\en? ECHO;
1905:
1906: <incl>[ \et]* /* eat the whitespace */
1907: <incl>[^ \et\en]+ { /* got the include file name */
1908: if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1909: errx(1, "Includes nested too deeply");
1910:
1911: include_stack[include_stack_ptr++] =
1912: YY_CURRENT_BUFFER;
1913:
1914: yyin = fopen(yytext, "r");
1915:
1916: if (yyin == NULL)
1917: err(1, NULL);
1.1 deraadt 1918:
1.16 jmc 1919: yy_switch_to_buffer(
1920: yy_create_buffer(yyin, YY_BUF_SIZE));
1.1 deraadt 1921:
1.16 jmc 1922: BEGIN(INITIAL);
1923: }
1.1 deraadt 1924:
1.16 jmc 1925: <<EOF>> {
1926: if (--include_stack_ptr < 0)
1.1 deraadt 1927: yyterminate();
1.16 jmc 1928: else {
1929: yy_delete_buffer(YY_CURRENT_BUFFER);
1.1 deraadt 1930: yy_switch_to_buffer(
1.16 jmc 1931: include_stack[include_stack_ptr]);
1932: }
1933: }
1934: .Ed
1935: .Pp
1.1 deraadt 1936: Three routines are available for setting up input buffers for
1.16 jmc 1937: scanning in-memory strings instead of files.
1938: All of them create a new input buffer for scanning the string,
1939: and return a corresponding
1940: .Dv YY_BUFFER_STATE
1941: handle (which should be deleted afterwards using
1942: .Fn yy_delete_buffer ) .
1943: They also switch to the new buffer using
1944: .Fn yy_switch_to_buffer ,
1.1 deraadt 1945: so the next call to
1.16 jmc 1946: .Fn yylex
1.1 deraadt 1947: will start scanning the string.
1.16 jmc 1948: .Bl -tag -width Ds
1949: .It yy_scan_string(const char *str)
1950: Scans a NUL-terminated string.
1951: .It yy_scan_bytes(const char *bytes, int len)
1952: Scans
1953: .Fa len
1954: bytes
1955: .Pq including possibly NUL's
1.1 deraadt 1956: starting at location
1.16 jmc 1957: .Fa bytes .
1958: .El
1959: .Pp
1960: Note that both of these functions create and scan a copy
1961: of the string or bytes.
1962: (This may be desirable, since
1963: .Fn yylex
1964: modifies the contents of the buffer it is scanning.)
1965: The copy can be avoided by using:
1966: .Bl -tag -width Ds
1967: .It yy_scan_buffer(char *base, yy_size_t size)
1968: Which scans the buffer starting at
1969: .Fa base ,
1.1 deraadt 1970: consisting of
1.16 jmc 1971: .Fa size
1972: bytes, the last two bytes of which must be
1973: .Dv YY_END_OF_BUFFER_CHAR
1974: .Pq ASCII NUL .
1975: These last two bytes are not scanned; thus, scanning consists of
1976: base[0] through base[size-2], inclusive.
1977: .Pp
1978: If
1979: .Fa base
1980: is not set up in this manner
1981: (i.e., forget the final two
1982: .Dv YY_END_OF_BUFFER_CHAR
1.1 deraadt 1983: bytes), then
1.16 jmc 1984: .Fn yy_scan_buffer
1.1 deraadt 1985: returns a nil pointer instead of creating a new input buffer.
1.16 jmc 1986: .Pp
1.1 deraadt 1987: The type
1.16 jmc 1988: .Fa yy_size_t
1989: is an integral type which can be cast to an integer expression
1.1 deraadt 1990: reflecting the size of the buffer.
1.16 jmc 1991: .El
1992: .Sh END-OF-FILE RULES
1993: The special rule
1994: .Qq Aq Aq EOF
1995: indicates actions which are to be taken when an end-of-file is encountered and
1996: .Fn yywrap
1997: returns non-zero
1998: .Pq i.e., indicates no further files to process .
1999: The action must finish by doing one of four things:
2000: .Bl -dash
2001: .It
2002: Assigning
2003: .Em yyin
2004: to a new input file
2005: (in previous versions of
2006: .Nm ,
2007: after doing the assignment, it was necessary to call the special action
2008: .Dv YY_NEW_FILE ;
2009: this is no longer necessary).
2010: .It
2011: Executing a
2012: .Em return
2013: statement.
2014: .It
2015: Executing the special
2016: .Fn yyterminate
2017: action.
2018: .It
2019: Switching to a new buffer using
2020: .Fn yy_switch_to_buffer
1.1 deraadt 2021: as shown in the example above.
1.16 jmc 2022: .El
2023: .Pp
2024: .Aq Aq EOF
2025: rules may not be used with other patterns;
2026: they may only be qualified with a list of start conditions.
2027: If an unqualified
2028: .Aq Aq EOF
2029: rule is given, it applies to all start conditions which do not already have
2030: .Aq Aq EOF
2031: actions.
2032: To specify an
2033: .Aq Aq EOF
2034: rule for only the initial start condition, use
2035: .Pp
2036: .Dl <INITIAL><<EOF>>
2037: .Pp
1.1 deraadt 2038: These rules are useful for catching things like unclosed comments.
2039: An example:
1.16 jmc 2040: .Bd -literal -offset indent
2041: %x quote
2042: %%
2043:
2044: \&...other rules for dealing with quotes...
2045:
2046: <quote><<EOF>> {
2047: error("unterminated quote");
2048: yyterminate();
2049: }
2050: <<EOF>> {
2051: if (*++filelist)
2052: yyin = fopen(*filelist, "r");
2053: else
2054: yyterminate();
2055: }
2056: .Ed
2057: .Sh MISCELLANEOUS MACROS
1.1 deraadt 2058: The macro
1.16 jmc 2059: .Dv YY_USER_ACTION
1.1 deraadt 2060: can be defined to provide an action
1.16 jmc 2061: which is always executed prior to the matched rule's action.
2062: For example,
1.1 deraadt 2063: it could be #define'd to call a routine to convert yytext to lower-case.
2064: When
1.16 jmc 2065: .Dv YY_USER_ACTION
1.1 deraadt 2066: is invoked, the variable
1.16 jmc 2067: .Fa yy_act
2068: gives the number of the matched rule
2069: .Pq rules are numbered starting with 1 .
2070: For example, to profile how often each rule is matched,
2071: the following would do the trick:
2072: .Pp
2073: .Dl #define YY_USER_ACTION ++ctr[yy_act]
2074: .Pp
1.1 deraadt 2075: where
1.16 jmc 2076: .Fa ctr
2077: is an array to hold the counts for the different rules.
2078: Note that the macro
2079: .Dv YY_NUM_RULES
2080: gives the total number of rules
2081: (including the default rule, even if
2082: .Fl s
2083: is used),
1.1 deraadt 2084: so a correct declaration for
1.16 jmc 2085: .Fa ctr
1.1 deraadt 2086: is:
1.16 jmc 2087: .Pp
2088: .Dl int ctr[YY_NUM_RULES];
2089: .Pp
1.1 deraadt 2090: The macro
1.16 jmc 2091: .Dv YY_USER_INIT
1.1 deraadt 2092: may be defined to provide an action which is always executed before
1.16 jmc 2093: the first scan
2094: .Pq and before the scanner's internal initializations are done .
1.1 deraadt 2095: For example, it could be used to call a routine to read
2096: in a data table or open a logging file.
1.16 jmc 2097: .Pp
1.1 deraadt 2098: The macro
1.16 jmc 2099: .Dv yy_set_interactive(is_interactive)
1.1 deraadt 2100: can be used to control whether the current buffer is considered
1.16 jmc 2101: .Em interactive .
1.1 deraadt 2102: An interactive buffer is processed more slowly,
2103: but must be used when the scanner's input source is indeed
2104: interactive to avoid problems due to waiting to fill buffers
2105: (see the discussion of the
1.16 jmc 2106: .Fl I
2107: flag below).
2108: A non-zero value in the macro invocation marks the buffer as interactive,
2109: a zero value as non-interactive.
2110: Note that use of this macro overrides
2111: .Dq %option always-interactive
2112: or
2113: .Dq %option never-interactive
2114: (see
2115: .Sx OPTIONS
2116: below).
2117: .Fn yy_set_interactive
1.1 deraadt 2118: must be invoked prior to beginning to scan the buffer that is
1.16 jmc 2119: .Pq or is not
2120: to be considered interactive.
2121: .Pp
1.1 deraadt 2122: The macro
1.16 jmc 2123: .Dv yy_set_bol(at_bol)
1.1 deraadt 2124: can be used to control whether the current buffer's scanning
2125: context for the next token match is done as though at the
1.16 jmc 2126: beginning of a line.
2127: A non-zero macro argument makes rules anchored with
2128: .Sq ^
2129: active, while a zero argument makes
2130: .Sq ^
2131: rules inactive.
2132: .Pp
1.1 deraadt 2133: The macro
1.16 jmc 2134: .Dv YY_AT_BOL
2135: returns true if the next token scanned from the current buffer will have
2136: .Sq ^
2137: rules active, false otherwise.
2138: .Pp
1.1 deraadt 2139: In the generated scanner, the actions are all gathered in one large
2140: switch statement and separated using
1.16 jmc 2141: .Dv YY_BREAK ,
2142: which may be redefined.
2143: By default, it is simply a
2144: .Qq break ,
2145: to separate each rule's action from the following rules.
1.1 deraadt 2146: Redefining
1.16 jmc 2147: .Dv YY_BREAK
1.1 deraadt 2148: allows, for example, C++ users to
1.16 jmc 2149: .Dq #define YY_BREAK
2150: to do nothing
2151: (while being very careful that every rule ends with a
2152: .Qq break
2153: or a
2154: .Qq return ! )
2155: to avoid suffering from unreachable statement warnings where because a rule's
2156: action ends with
2157: .Dq return ,
2158: the
2159: .Dv YY_BREAK
1.1 deraadt 2160: is inaccessible.
1.16 jmc 2161: .Sh VALUES AVAILABLE TO THE USER
1.1 deraadt 2162: This section summarizes the various values available to the user
2163: in the rule actions.
1.16 jmc 2164: .Bl -tag -width Ds
2165: .It char *yytext
2166: Holds the text of the current token.
2167: It may be modified but not lengthened
2168: .Pq characters cannot be appended to the end .
2169: .Pp
1.1 deraadt 2170: If the special directive
1.16 jmc 2171: .Dq %array
1.1 deraadt 2172: appears in the first section of the scanner description, then
1.16 jmc 2173: .Fa yytext
1.1 deraadt 2174: is instead declared
1.16 jmc 2175: .Dq char yytext[YYLMAX] ,
1.1 deraadt 2176: where
1.16 jmc 2177: .Dv YYLMAX
2178: is a macro definition that can be redefined in the first section
2179: to change the default value
2180: .Pq generally 8KB .
2181: Using
2182: .Dq %array
1.1 deraadt 2183: results in somewhat slower scanners, but the value of
1.16 jmc 2184: .Fa yytext
1.1 deraadt 2185: becomes immune to calls to
1.16 jmc 2186: .Fn input
1.1 deraadt 2187: and
1.16 jmc 2188: .Fn unput ,
1.1 deraadt 2189: which potentially destroy its value when
1.16 jmc 2190: .Fa yytext
2191: is a character pointer.
2192: The opposite of
2193: .Dq %array
1.1 deraadt 2194: is
1.16 jmc 2195: .Dq %pointer ,
1.1 deraadt 2196: which is the default.
1.16 jmc 2197: .Pp
2198: .Dq %array
2199: cannot be used when generating C++ scanner classes
1.1 deraadt 2200: (the
1.16 jmc 2201: .Fl +
1.1 deraadt 2202: flag).
1.16 jmc 2203: .It int yyleng
2204: Holds the length of the current token.
2205: .It FILE *yyin
2206: Is the file which by default
2207: .Nm
2208: reads from.
2209: It may be redefined, but doing so only makes sense before
2210: scanning begins or after an
2211: .Dv EOF
2212: has been encountered.
2213: Changing it in the midst of scanning will have unexpected results since
2214: .Nm
1.1 deraadt 2215: buffers its input; use
1.16 jmc 2216: .Fn yyrestart
1.1 deraadt 2217: instead.
2218: Once scanning terminates because an end-of-file
1.16 jmc 2219: has been seen,
2220: .Fa yyin
2221: can be assigned as the new input file
2222: and the scanner can be called again to continue scanning.
2223: .It void yyrestart(FILE *new_file)
2224: May be called to point
2225: .Fa yyin
2226: at the new input file.
2227: The switch-over to the new file is immediate
2228: .Pq any previously buffered-up input is lost .
2229: Note that calling
2230: .Fn yyrestart
1.1 deraadt 2231: with
1.16 jmc 2232: .Fa yyin
1.1 deraadt 2233: as an argument thus throws away the current input buffer and continues
2234: scanning the same input file.
1.16 jmc 2235: .It FILE *yyout
2236: Is the file to which
2237: .Em ECHO
2238: actions are done.
2239: It can be reassigned by the user.
2240: .It YY_CURRENT_BUFFER
2241: Returns a
2242: .Dv YY_BUFFER_STATE
1.1 deraadt 2243: handle to the current buffer.
1.16 jmc 2244: .It YY_START
2245: Returns an integer value corresponding to the current start condition.
2246: This value can subsequently be used with
2247: .Em BEGIN
1.1 deraadt 2248: to return to that start condition.
1.16 jmc 2249: .El
2250: .Sh INTERFACING WITH YACC
1.1 deraadt 2251: One of the main uses of
1.16 jmc 2252: .Nm
1.1 deraadt 2253: is as a companion to the
1.16 jmc 2254: .Xr yacc 1
1.1 deraadt 2255: parser-generator.
1.16 jmc 2256: yacc parsers expect to call a routine named
2257: .Fn yylex
2258: to find the next input token.
2259: The routine is supposed to return the type of the next token
2260: as well as putting any associated value in the global
1.17 jmc 2261: .Fa yylval ,
2262: which is defined externally,
2263: and can be a union or any other complex data structure.
1.1 deraadt 2264: To use
1.16 jmc 2265: .Nm
2266: with yacc, one specifies the
2267: .Fl d
2268: option to yacc to instruct it to generate the file
2269: .Pa y.tab.h
1.1 deraadt 2270: containing definitions of all the
1.16 jmc 2271: .Dq %tokens
2272: appearing in the yacc input.
2273: This file is then included in the
2274: .Nm
2275: scanner.
2276: For example, if one of the tokens is
2277: .Qq TOK_NUMBER ,
1.1 deraadt 2278: part of the scanner might look like:
1.16 jmc 2279: .Bd -literal -offset indent
2280: %{
2281: #include "y.tab.h"
2282: %}
2283:
2284: %%
2285:
2286: [0-9]+ yylval = atoi(yytext); return TOK_NUMBER;
2287: .Ed
2288: .Sh OPTIONS
2289: .Nm
1.1 deraadt 2290: has the following options:
1.16 jmc 2291: .Bl -tag -width Ds
2292: .It Fl 7
2293: Instructs
2294: .Nm
2295: to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2296: characters in its input.
2297: The advantage of using
2298: .Fl 7
1.1 deraadt 2299: is that the scanner's tables can be up to half the size of those generated
2300: using the
1.16 jmc 2301: .Fl 8
2302: option
2303: .Pq see below .
2304: The disadvantage is that such scanners often hang
1.1 deraadt 2305: or crash if their input contains an 8-bit character.
1.16 jmc 2306: .Pp
2307: Note, however, that unless generating a scanner using the
2308: .Fl Cf
1.1 deraadt 2309: or
1.16 jmc 2310: .Fl CF
1.1 deraadt 2311: table compression options, use of
1.16 jmc 2312: .Fl 7
2313: will save only a small amount of table space,
2314: and make the scanner considerably less portable.
2315: .Nm flex Ns 's
2316: default behavior is to generate an 8-bit scanner unless
2317: .Fl Cf
2318: or
2319: .Fl CF
2320: is specified, in which case
2321: .Nm
2322: defaults to generating 7-bit scanners unless it was
2323: configured to generate 8-bit scanners
2324: (as will often be the case with non-USA sites).
2325: It is possible tell whether
2326: .Nm
2327: generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2328: .Fl v
2329: output as described below.
2330: .Pp
2331: Note that if
2332: .Fl Cfe
2333: or
2334: .Fl CFe
2335: are used
2336: (the table compression options, but also using equivalence classes as
2337: discussed below),
2338: .Nm
2339: still defaults to generating an 8-bit scanner,
2340: since usually with these compression options full 8-bit tables
1.1 deraadt 2341: are not much more expensive than 7-bit tables.
1.16 jmc 2342: .It Fl 8
2343: Instructs
2344: .Nm
1.1 deraadt 2345: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1.16 jmc 2346: characters.
2347: This flag is only needed for scanners generated using
2348: .Fl Cf
1.1 deraadt 2349: or
1.16 jmc 2350: .Fl CF ,
2351: as otherwise
2352: .Nm
2353: defaults to generating an 8-bit scanner anyway.
2354: .Pp
1.1 deraadt 2355: See the discussion of
1.16 jmc 2356: .Fl 7
2357: above for
2358: .Nm flex Ns 's
2359: default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2360: .It Fl B
2361: Instructs
2362: .Nm
2363: to generate a
2364: .Em batch
2365: scanner, the opposite of
2366: .Em interactive
2367: scanners generated by
2368: .Fl I
2369: .Pq see below .
2370: In general,
2371: .Fl B
2372: is used when the scanner will never be used interactively,
2373: and you want to squeeze a little more performance out of it.
2374: If the aim is instead to squeeze out a lot more performance,
2375: use the
2376: .Fl Cf
2377: or
2378: .Fl CF
2379: options
2380: .Pq discussed below ,
2381: which turn on
2382: .Fl B
2383: automatically anyway.
2384: .It Fl b
2385: Generate backing-up information to
2386: .Pa lex.backup .
2387: This is a list of scanner states which require backing up
2388: and the input characters on which they do so.
2389: By adding rules one can remove backing-up states.
2390: If all backing-up states are eliminated and
2391: .Fl Cf
2392: or
2393: .Fl CF
2394: is used, the generated scanner will run faster (see the
2395: .Fl p
2396: flag).
2397: Only users who wish to squeeze every last cycle out of their
2398: scanners need worry about this option.
2399: (See the section on
2400: .Sx PERFORMANCE CONSIDERATIONS
2401: below.)
2402: .It Fl C Ns Op Cm aeFfmr
2403: Controls the degree of table compression and, more generally, trade-offs
1.1 deraadt 2404: between small scanners and fast scanners.
1.16 jmc 2405: .Bl -tag -width Ds
2406: .It Fl Ca
2407: Instructs
2408: .Nm
2409: to trade off larger tables in the generated scanner for faster performance
2410: because the elements of the tables are better aligned for memory access
2411: and computation.
2412: On some
2413: .Tn RISC
2414: architectures, fetching and manipulating longwords is more efficient
2415: than with smaller-sized units such as shortwords.
2416: This option can double the size of the tables used by the scanner.
2417: .It Fl Ce
2418: Directs
2419: .Nm
1.1 deraadt 2420: to construct
1.16 jmc 2421: .Em equivalence classes ,
2422: i.e., sets of characters which have identical lexical properties
2423: (for example, if the only appearance of digits in the
2424: .Nm
1.1 deraadt 2425: input is in the character class
1.16 jmc 2426: .Qq [0-9]
2427: then the digits
2428: .Sq 0 ,
2429: .Sq 1 ,
2430: .Sq ... ,
2431: .Sq 9
2432: will all be put in the same equivalence class).
2433: Equivalence classes usually give dramatic reductions in the final
2434: table/object file sizes
2435: .Pq typically a factor of 2\-5
2436: and are pretty cheap performance-wise
2437: .Pq one array look-up per character scanned .
2438: .It Fl CF
2439: Specifies that the alternate fast scanner representation
2440: (described below under the
2441: .Fl F
2442: option)
2443: should be used.
2444: This option cannot be used with
2445: .Fl + .
2446: .It Fl Cf
2447: Specifies that the
2448: .Em full
2449: scanner tables should be generated \-
2450: .Nm
2451: should not compress the tables by taking advantage of
2452: similar transition functions for different states.
2453: .It Fl \&Cm
2454: Directs
2455: .Nm
1.1 deraadt 2456: to construct
1.16 jmc 2457: .Em meta-equivalence classes ,
2458: which are sets of equivalence classes
2459: (or characters, if equivalence classes are not being used)
2460: that are commonly used together.
2461: Meta-equivalence classes are often a big win when using compressed tables,
2462: but they have a moderate performance impact
2463: (one or two
2464: .Qq if
2465: tests and one array look-up per character scanned).
2466: .It Fl Cr
2467: Causes the generated scanner to
2468: .Em bypass
2469: use of the standard I/O library
2470: .Pq stdio
2471: for input.
2472: Instead of calling
2473: .Xr fread 3
1.1 deraadt 2474: or
1.16 jmc 2475: .Xr getc 3 ,
1.1 deraadt 2476: the scanner will use the
1.16 jmc 2477: .Xr read 2
2478: system call,
2479: resulting in a performance gain which varies from system to system,
2480: but in general is probably negligible unless
2481: .Fl Cf
1.1 deraadt 2482: or
1.16 jmc 2483: .Fl CF
2484: are being used.
1.1 deraadt 2485: Using
1.16 jmc 2486: .Fl Cr
2487: can cause strange behavior if, for example, reading from
2488: .Fa yyin
2489: using stdio prior to calling the scanner
2490: (because the scanner will miss whatever text previous reads left
2491: in the stdio input buffer).
2492: .Pp
2493: .Fl Cr
2494: has no effect if
2495: .Dv YY_INPUT
2496: is defined
2497: (see
2498: .Sx THE GENERATED SCANNER
2499: above).
2500: .El
2501: .Pp
1.1 deraadt 2502: A lone
1.16 jmc 2503: .Fl C
1.1 deraadt 2504: specifies that the scanner tables should be compressed but neither
2505: equivalence classes nor meta-equivalence classes should be used.
1.16 jmc 2506: .Pp
1.1 deraadt 2507: The options
1.16 jmc 2508: .Fl Cf
1.1 deraadt 2509: or
1.16 jmc 2510: .Fl CF
1.1 deraadt 2511: and
1.16 jmc 2512: .Fl \&Cm
2513: do not make sense together \- there is no opportunity for meta-equivalence
2514: classes if the table is not being compressed.
2515: Otherwise the options may be freely mixed, and are cumulative.
2516: .Pp
1.1 deraadt 2517: The default setting is
1.16 jmc 2518: .Fl Cem
1.1 deraadt 2519: which specifies that
1.16 jmc 2520: .Nm
2521: should generate equivalence classes and meta-equivalence classes.
2522: This setting provides the highest degree of table compression.
2523: It is possible to trade off faster-executing scanners at the cost of
2524: larger tables with the following generally being true:
2525: .Bd -unfilled -offset indent
2526: slowest & smallest
2527: -Cem
2528: -Cm
2529: -Ce
2530: -C
2531: -C{f,F}e
2532: -C{f,F}
2533: -C{f,F}a
2534: fastest & largest
2535: .Ed
2536: .Pp
1.1 deraadt 2537: Note that scanners with the smallest tables are usually generated and
1.16 jmc 2538: compiled the quickest,
2539: so during development the default is usually best,
2540: maximal compression.
2541: .Pp
2542: .Fl Cfe
2543: is often a good compromise between speed and size for production scanners.
2544: .It Fl d
2545: Makes the generated scanner run in debug mode.
2546: Whenever a pattern is recognized and the global
2547: .Fa yy_flex_debug
2548: is non-zero
2549: .Pq which is the default ,
2550: the scanner will write to stderr a line of the form:
2551: .Pp
2552: .D1 --accepting rule at line 53 ("the matched text")
2553: .Pp
2554: The line number refers to the location of the rule in the file
2555: defining the scanner
2556: (i.e., the file that was fed to
2557: .Nm ) .
2558: Messages are also generated when the scanner backs up,
2559: accepts the default rule,
2560: reaches the end of its input buffer
2561: (or encounters a NUL;
2562: at this point, the two look the same as far as the scanner's concerned),
2563: or reaches an end-of-file.
2564: .It Fl F
2565: Specifies that the fast scanner table representation should be used
2566: .Pq and stdio bypassed .
2567: This representation is about as fast as the full table representation
2568: .Pq Fl f ,
2569: and for some sets of patterns will be considerably smaller
2570: .Pq and for others, larger .
2571: In general, if the pattern set contains both
2572: .Qq keywords
2573: and a catch-all,
2574: .Qq identifier
2575: rule, such as in the set:
2576: .Bd -unfilled -offset indent
2577: "case" return TOK_CASE;
2578: "switch" return TOK_SWITCH;
2579: \&...
2580: "default" return TOK_DEFAULT;
2581: [a-z]+ return TOK_ID;
2582: .Ed
2583: .Pp
2584: then it's better to use the full table representation.
2585: If only the
2586: .Qq identifier
2587: rule is present and a hash table or some such is used to detect the keywords,
2588: it's better to use
2589: .Fl F .
2590: .Pp
2591: This option is equivalent to
2592: .Fl CFr
2593: .Pq see above .
2594: It cannot be used with
2595: .Fl + .
2596: .It Fl f
2597: Specifies
2598: .Em fast scanner .
2599: No table compression is done and stdio is bypassed.
2600: The result is large but fast.
2601: This option is equivalent to
2602: .Fl Cfr
2603: .Pq see above .
2604: .It Fl h
2605: Generates a help summary of
2606: .Nm flex Ns 's
2607: options to stdout and then exits.
2608: .Fl ?\&
2609: and
2610: .Fl Fl help
2611: are synonyms for
2612: .Fl h .
2613: .It Fl I
2614: Instructs
2615: .Nm
2616: to generate an
2617: .Em interactive
2618: scanner.
2619: An interactive scanner is one that only looks ahead to decide
2620: what token has been matched if it absolutely must.
2621: It turns out that always looking one extra character ahead,
2622: even if the scanner has already seen enough text
2623: to disambiguate the current token, is a bit faster than
2624: only looking ahead when necessary.
2625: But scanners that always look ahead give dreadful interactive performance;
2626: for example, when a user types a newline,
2627: it is not recognized as a newline token until they enter
2628: .Em another
2629: token, which often means typing in another whole line.
2630: .Pp
2631: .Nm
2632: scanners default to
2633: .Em interactive
2634: unless
2635: .Fl Cf
2636: or
2637: .Fl CF
2638: table-compression options are specified
2639: .Pq see above .
2640: That's because if high-performance is most important,
2641: one of these options should be used,
2642: so if they weren't,
2643: .Nm
1.24 sobrado 2644: assumes it is preferable to trade off a bit of run-time performance for
1.16 jmc 2645: intuitive interactive behavior.
2646: Note also that
2647: .Fl I
2648: cannot be used in conjunction with
2649: .Fl Cf
2650: or
2651: .Fl CF .
2652: Thus, this option is not really needed; it is on by default for all those
2653: cases in which it is allowed.
2654: .Pp
2655: A scanner can be forced to not be interactive by using
2656: .Fl B
2657: .Pq see above .
2658: .It Fl i
2659: Instructs
2660: .Nm
2661: to generate a case-insensitive scanner.
2662: The case of letters given in the
2663: .Nm
2664: input patterns will be ignored,
2665: and tokens in the input will be matched regardless of case.
2666: The matched text given in
2667: .Fa yytext
2668: will have the preserved case
2669: .Pq i.e., it will not be folded .
2670: .It Fl L
2671: Instructs
2672: .Nm
2673: not to generate
2674: .Dq #line
2675: directives.
2676: Without this option,
2677: .Nm
2678: peppers the generated scanner with #line directives so error messages
2679: in the actions will be correctly located with respect to either the original
2680: .Nm
2681: input file
2682: (if the errors are due to code in the input file),
2683: or
2684: .Pa lex.yy.c
2685: (if the errors are
2686: .Nm flex Ns 's
2687: fault \- these sorts of errors should be reported to the email address
2688: given below).
2689: .It Fl l
2690: Turns on maximum compatibility with the original AT&T
2691: .Nm lex
2692: implementation.
2693: Note that this does not mean full compatibility.
2694: Use of this option costs a considerable amount of performance,
2695: and it cannot be used with the
2696: .Fl + , f , F , Cf ,
2697: or
2698: .Fl CF
2699: options.
2700: For details on the compatibilities it provides, see the section
2701: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
2702: below.
2703: This option also results in the name
2704: .Dv YY_FLEX_LEX_COMPAT
2705: being #define'd in the generated scanner.
2706: .It Fl n
2707: Another do-nothing, deprecated option included only for
2708: .Tn POSIX
2709: compliance.
2710: .It Fl o Ns Ar output
2711: Directs
2712: .Nm
2713: to write the scanner to the file
2714: .Ar output
1.1 deraadt 2715: instead of
1.16 jmc 2716: .Pa lex.yy.c .
2717: If
2718: .Fl o
2719: is combined with the
2720: .Fl t
2721: option, then the scanner is written to stdout but its
2722: .Dq #line
2723: directives
2724: (see the
2725: .Fl L
2726: option above)
2727: refer to the file
2728: .Ar output .
2729: .It Fl P Ns Ar prefix
2730: Changes the default
2731: .Qq yy
1.1 deraadt 2732: prefix used by
1.16 jmc 2733: .Nm
1.6 aaron 2734: for all globally visible variable and function names to instead be
1.16 jmc 2735: .Ar prefix .
1.1 deraadt 2736: For example,
1.16 jmc 2737: .Fl P Ns Ar foo
1.1 deraadt 2738: changes the name of
1.16 jmc 2739: .Fa yytext
1.1 deraadt 2740: to
1.16 jmc 2741: .Fa footext .
1.1 deraadt 2742: It also changes the name of the default output file from
1.16 jmc 2743: .Pa lex.yy.c
1.1 deraadt 2744: to
1.16 jmc 2745: .Pa lex.foo.c .
1.1 deraadt 2746: Here are all of the names affected:
1.16 jmc 2747: .Bd -unfilled -offset indent
2748: yy_create_buffer
2749: yy_delete_buffer
2750: yy_flex_debug
2751: yy_init_buffer
2752: yy_flush_buffer
2753: yy_load_buffer_state
2754: yy_switch_to_buffer
2755: yyin
2756: yyleng
2757: yylex
2758: yylineno
2759: yyout
2760: yyrestart
2761: yytext
2762: yywrap
2763: .Ed
2764: .Pp
2765: (If using a C++ scanner, then only
2766: .Fa yywrap
1.1 deraadt 2767: and
1.16 jmc 2768: .Fa yyFlexLexer
1.1 deraadt 2769: are affected.)
1.16 jmc 2770: Within the scanner itself, it is still possible to refer to the global variables
1.1 deraadt 2771: and functions using either version of their name; but externally, they
2772: have the modified name.
1.16 jmc 2773: .Pp
2774: This option allows multiple
2775: .Nm
2776: programs to be easily linked together into the same executable.
2777: Note, though, that using this option also renames
2778: .Fn yywrap ,
2779: so now either an
2780: .Pq appropriately named
2781: version of the routine for the scanner must be supplied, or
2782: .Dq %option noyywrap
2783: must be used, as linking with
2784: .Fl lfl
2785: no longer provides one by default.
2786: .It Fl p
2787: Generates a performance report to stderr.
2788: The report consists of comments regarding features of the
2789: .Nm
2790: input file which will cause a serious loss of performance in the resulting
2791: scanner.
2792: If the flag is specified twice,
2793: comments regarding features that lead to minor performance losses
2794: will also be reported>
2795: .Pp
2796: Note that the use of
2797: .Em REJECT ,
2798: .Dq %option yylineno ,
2799: and variable trailing context
2800: (see the
2801: .Sx BUGS
2802: section below)
2803: entails a substantial performance penalty; use of
2804: .Fn yymore ,
2805: the
2806: .Sq ^
2807: operator, and the
2808: .Fl I
2809: flag entail minor performance penalties.
2810: .It Fl S Ns Ar skeleton
2811: Overrides the default skeleton file from which
2812: .Nm
2813: constructs its scanners.
2814: This option is needed only for
2815: .Nm
1.1 deraadt 2816: maintenance or development.
1.16 jmc 2817: .It Fl s
2818: Causes the default rule
2819: .Pq that unmatched scanner input is echoed to stdout
2820: to be suppressed.
2821: If the scanner encounters input that does not
2822: match any of its rules, it aborts with an error.
2823: This option is useful for finding holes in a scanner's rule set.
2824: .It Fl T
2825: Makes
2826: .Nm
2827: run in
2828: .Em trace
2829: mode.
2830: It will generate a lot of messages to stderr concerning
2831: the form of the input and the resultant non-deterministic and deterministic
2832: finite automata.
2833: This option is mostly for use in maintaining
2834: .Nm .
2835: .It Fl t
2836: Instructs
2837: .Nm
2838: to write the scanner it generates to standard output instead of
2839: .Pa lex.yy.c .
2840: .It Fl V
2841: Prints the version number to stdout and exits.
2842: .Fl Fl version
2843: is a synonym for
2844: .Fl V .
2845: .It Fl v
2846: Specifies that
2847: .Nm
2848: should write to stderr
2849: a summary of statistics regarding the scanner it generates.
2850: Most of the statistics are meaningless to the casual
2851: .Nm
2852: user, but the first line identifies the version of
2853: .Nm
2854: (same as reported by
2855: .Fl V ) ,
2856: and the next line the flags used when generating the scanner,
2857: including those that are on by default.
2858: .It Fl w
2859: Suppresses warning messages.
2860: .It Fl +
2861: Specifies that
2862: .Nm
2863: should generate a C++ scanner class.
2864: See the section on
2865: .Sx GENERATING C++ SCANNERS
2866: below for details.
2867: .El
2868: .Pp
2869: .Nm
1.1 deraadt 2870: also provides a mechanism for controlling options within the
1.16 jmc 2871: scanner specification itself, rather than from the
2872: .Nm
1.33 jmc 2873: command line.
1.1 deraadt 2874: This is done by including
1.16 jmc 2875: .Dq %option
1.1 deraadt 2876: directives in the first section of the scanner specification.
1.16 jmc 2877: Multiple options can be specified with a single
2878: .Dq %option
2879: directive, and multiple directives in the first section of the
2880: .Nm
2881: input file.
2882: .Pp
2883: Most options are given simply as names, optionally preceded by the word
2884: .Qq no
2885: .Pq with no intervening whitespace
2886: to negate their meaning.
2887: A number are equivalent to
2888: .Nm
2889: flags or their negation:
2890: .Bd -unfilled -offset indent
2891: 7bit -7 option
2892: 8bit -8 option
2893: align -Ca option
2894: backup -b option
2895: batch -B option
2896: c++ -+ option
2897:
2898: caseful or
2899: case-sensitive opposite of -i (default)
2900:
2901: case-insensitive or
2902: caseless -i option
2903:
2904: debug -d option
2905: default opposite of -s option
2906: ecs -Ce option
2907: fast -F option
2908: full -f option
2909: interactive -I option
2910: lex-compat -l option
2911: meta-ecs -Cm option
2912: perf-report -p option
2913: read -Cr option
2914: stdout -t option
2915: verbose -v option
2916: warn opposite of -w option
2917: (use "%option nowarn" for -w)
2918:
2919: array equivalent to "%array"
2920: pointer equivalent to "%pointer" (default)
2921: .Ed
2922: .Pp
2923: Some %option's provide features otherwise not available:
2924: .Bl -tag -width Ds
2925: .It always-interactive
2926: Instructs
2927: .Nm
2928: to generate a scanner which always considers its input
2929: .Qq interactive .
2930: Normally, on each new input file the scanner calls
2931: .Fn isatty
2932: in an attempt to determine whether the scanner's input source is interactive
2933: and thus should be read a character at a time.
2934: When this option is used, however, no such call is made.
2935: .It main
2936: Directs
2937: .Nm
2938: to provide a default
2939: .Fn main
1.1 deraadt 2940: program for the scanner, which simply calls
1.16 jmc 2941: .Fn yylex .
1.1 deraadt 2942: This option implies
1.16 jmc 2943: .Dq noyywrap
2944: .Pq see below .
2945: .It never-interactive
2946: Instructs
2947: .Nm
2948: to generate a scanner which never considers its input
2949: .Qq interactive
2950: (again, no call made to
2951: .Fn isatty ) .
1.1 deraadt 2952: This is the opposite of
1.16 jmc 2953: .Dq always-interactive .
2954: .It stack
2955: Enables the use of start condition stacks
2956: (see
2957: .Sx START CONDITIONS
2958: above).
2959: .It stdinit
2960: If set (i.e.,
2961: .Dq %option stdinit ) ,
1.1 deraadt 2962: initializes
1.16 jmc 2963: .Fa yyin
1.1 deraadt 2964: and
1.16 jmc 2965: .Fa yyout
2966: to stdin and stdout, instead of the default of
2967: .Dq nil .
1.1 deraadt 2968: Some existing
1.16 jmc 2969: .Nm lex
2970: programs depend on this behavior, even though it is not compliant with ANSI C,
2971: which does not require stdin and stdout to be compile-time constant.
2972: .It yylineno
2973: Directs
2974: .Nm
1.1 deraadt 2975: to generate a scanner that maintains the number of the current line
2976: read from its input in the global variable
1.16 jmc 2977: .Fa yylineno .
1.1 deraadt 2978: This option is implied by
1.16 jmc 2979: .Dq %option lex-compat .
2980: .It yywrap
2981: If unset (i.e.,
2982: .Dq %option noyywrap ) ,
1.1 deraadt 2983: makes the scanner not call
1.16 jmc 2984: .Fn yywrap
2985: upon an end-of-file, but simply assume that there are no more files to scan
2986: (until the user points
2987: .Fa yyin
1.1 deraadt 2988: at a new file and calls
1.16 jmc 2989: .Fn yylex
1.1 deraadt 2990: again).
1.16 jmc 2991: .El
2992: .Pp
2993: .Nm
2994: scans rule actions to determine whether the
2995: .Em REJECT
2996: or
2997: .Fn yymore
2998: features are being used.
2999: The
3000: .Dq reject
1.1 deraadt 3001: and
1.16 jmc 3002: .Dq yymore
3003: options are available to override its decision as to whether to use the
1.1 deraadt 3004: options, either by setting them (e.g.,
1.16 jmc 3005: .Dq %option reject )
3006: to indicate the feature is indeed used,
3007: or unsetting them to indicate it actually is not used
1.1 deraadt 3008: (e.g.,
1.16 jmc 3009: .Dq %option noyymore ) .
3010: .Pp
3011: Three options take string-delimited values, offset with
3012: .Sq = :
3013: .Pp
3014: .D1 %option outfile="ABC"
3015: .Pp
1.1 deraadt 3016: is equivalent to
1.16 jmc 3017: .Fl o Ns Ar ABC ,
1.1 deraadt 3018: and
1.16 jmc 3019: .Pp
3020: .D1 %option prefix="XYZ"
3021: .Pp
1.1 deraadt 3022: is equivalent to
1.16 jmc 3023: .Fl P Ns Ar XYZ .
1.1 deraadt 3024: Finally,
1.16 jmc 3025: .Pp
3026: .D1 %option yyclass="foo"
3027: .Pp
3028: only applies when generating a C++ scanner
3029: .Pf ( Fl +
3030: option).
3031: It informs
3032: .Nm
3033: that
3034: .Dq foo
3035: has been derived as a subclass of yyFlexLexer, so
3036: .Nm
3037: will place actions in the member function
3038: .Dq foo::yylex()
1.1 deraadt 3039: instead of
1.16 jmc 3040: .Dq yyFlexLexer::yylex() .
1.1 deraadt 3041: It also generates a
1.16 jmc 3042: .Dq yyFlexLexer::yylex()
1.1 deraadt 3043: member function that emits a run-time error (by invoking
1.16 jmc 3044: .Dq yyFlexLexer::LexerError() )
1.1 deraadt 3045: if called.
1.16 jmc 3046: See
3047: .Sx GENERATING C++ SCANNERS ,
3048: below, for additional information.
3049: .Pp
3050: A number of options are available for
1.32 jmc 3051: lint
1.16 jmc 3052: purists who want to suppress the appearance of unneeded routines
3053: in the generated scanner.
3054: Each of the following, if unset
1.1 deraadt 3055: (e.g.,
1.16 jmc 3056: .Dq %option nounput ) ,
3057: results in the corresponding routine not appearing in the generated scanner:
3058: .Bd -unfilled -offset indent
3059: input, unput
3060: yy_push_state, yy_pop_state, yy_top_state
3061: yy_scan_buffer, yy_scan_bytes, yy_scan_string
3062: .Ed
3063: .Pp
1.1 deraadt 3064: (though
1.16 jmc 3065: .Fn yy_push_state
3066: and friends won't appear anyway unless
3067: .Dq %option stack
3068: is being used).
3069: .Sh PERFORMANCE CONSIDERATIONS
1.1 deraadt 3070: The main design goal of
1.16 jmc 3071: .Nm
3072: is that it generate high-performance scanners.
3073: It has been optimized for dealing well with large sets of rules.
3074: Aside from the effects on scanner speed of the table compression
3075: .Fl C
1.1 deraadt 3076: options outlined above,
1.16 jmc 3077: there are a number of options/actions which degrade performance.
3078: These are, from most expensive to least:
3079: .Bd -unfilled -offset indent
3080: REJECT
3081: %option yylineno
3082: arbitrary trailing context
3083:
3084: pattern sets that require backing up
3085: %array
3086: %option interactive
3087: %option always-interactive
3088:
3089: \&'^' beginning-of-line operator
3090: yymore()
3091: .Ed
3092: .Pp
3093: with the first three all being quite expensive
3094: and the last two being quite cheap.
3095: Note also that
3096: .Fn unput
3097: is implemented as a routine call that potentially does quite a bit of work,
3098: while
3099: .Fn yyless
3100: is a quite-cheap macro; so if just putting back some excess text,
3101: use
3102: .Fn yyless .
3103: .Pp
3104: .Em REJECT
1.1 deraadt 3105: should be avoided at all costs when performance is important.
3106: It is a particularly expensive option.
1.16 jmc 3107: .Pp
1.1 deraadt 3108: Getting rid of backing up is messy and often may be an enormous
1.16 jmc 3109: amount of work for a complicated scanner.
3110: In principal, one begins by using the
3111: .Fl b
1.1 deraadt 3112: flag to generate a
1.16 jmc 3113: .Pa lex.backup
3114: file.
3115: For example, on the input
3116: .Bd -literal -offset indent
3117: %%
3118: foo return TOK_KEYWORD;
3119: foobar return TOK_KEYWORD;
3120: .Ed
3121: .Pp
1.1 deraadt 3122: the file looks like:
1.16 jmc 3123: .Bd -literal -offset indent
3124: State #6 is non-accepting -
3125: associated rule line numbers:
3126: 2 3
3127: out-transitions: [ o ]
3128: jam-transitions: EOF [ \e001-n p-\e177 ]
3129:
3130: State #8 is non-accepting -
3131: associated rule line numbers:
3132: 3
3133: out-transitions: [ a ]
3134: jam-transitions: EOF [ \e001-` b-\e177 ]
3135:
3136: State #9 is non-accepting -
3137: associated rule line numbers:
3138: 3
3139: out-transitions: [ r ]
3140: jam-transitions: EOF [ \e001-q s-\e177 ]
3141:
3142: Compressed tables always back up.
3143: .Ed
3144: .Pp
1.1 deraadt 3145: The first few lines tell us that there's a scanner state in
1.16 jmc 3146: which it can make a transition on an
3147: .Sq o
3148: but not on any other character,
3149: and that in that state the currently scanned text does not match any rule.
3150: The state occurs when trying to match the rules found
1.1 deraadt 3151: at lines 2 and 3 in the input file.
1.16 jmc 3152: If the scanner is in that state and then reads something other than an
3153: .Sq o ,
3154: it will have to back up to find a rule which is matched.
3155: With a bit of headscratching one can see that this must be the
3156: state it's in when it has seen
3157: .Sq fo .
3158: When this has happened, if anything other than another
3159: .Sq o
3160: is seen, the scanner will have to back up to simply match the
3161: .Sq f
3162: .Pq by the default rule .
3163: .Pp
3164: The comment regarding State #8 indicates there's a problem when
3165: .Qq foob
3166: has been scanned.
3167: Indeed, on any character other than an
3168: .Sq a ,
3169: the scanner will have to back up to accept
3170: .Qq foo .
3171: Similarly, the comment for State #9 concerns when
3172: .Qq fooba
3173: has been scanned and an
3174: .Sq r
3175: does not follow.
3176: .Pp
1.1 deraadt 3177: The final comment reminds us that there's no point going to
1.16 jmc 3178: all the trouble of removing backing up from the rules unless we're using
3179: .Fl Cf
1.1 deraadt 3180: or
1.16 jmc 3181: .Fl CF ,
1.1 deraadt 3182: since there's no performance gain doing so with compressed scanners.
1.16 jmc 3183: .Pp
3184: The way to remove the backing up is to add
3185: .Qq error
3186: rules:
3187: .Bd -literal -offset indent
3188: %%
3189: foo return TOK_KEYWORD;
3190: foobar return TOK_KEYWORD;
3191:
3192: fooba |
3193: foob |
3194: fo {
3195: /* false alarm, not really a keyword */
3196: return TOK_ID;
3197: }
3198: .Ed
3199: .Pp
3200: Eliminating backing up among a list of keywords can also be done using a
3201: .Qq catch-all
3202: rule:
3203: .Bd -literal -offset indent
3204: %%
3205: foo return TOK_KEYWORD;
3206: foobar return TOK_KEYWORD;
3207:
3208: [a-z]+ return TOK_ID;
3209: .Ed
3210: .Pp
1.1 deraadt 3211: This is usually the best solution when appropriate.
1.16 jmc 3212: .Pp
1.1 deraadt 3213: Backing up messages tend to cascade.
1.16 jmc 3214: With a complicated set of rules it's not uncommon to get hundreds of messages.
3215: If one can decipher them, though,
3216: it often only takes a dozen or so rules to eliminate the backing up
3217: (though it's easy to make a mistake and have an error rule accidentally match
3218: a valid token; a possible future
3219: .Nm
1.1 deraadt 3220: feature will be to automatically add rules to eliminate backing up).
1.16 jmc 3221: .Pp
3222: It's important to keep in mind that the benefits of eliminating
3223: backing up are gained only if
3224: .Em every
3225: instance of backing up is eliminated.
3226: Leaving just one gains nothing.
3227: .Pp
3228: .Em Variable
3229: trailing context
3230: (where both the leading and trailing parts do not have a fixed length)
3231: entails almost the same performance loss as
3232: .Em REJECT
3233: .Pq i.e., substantial .
3234: So when possible a rule like:
3235: .Bd -literal -offset indent
3236: %%
3237: mouse|rat/(cat|dog) run();
3238: .Ed
3239: .Pp
1.1 deraadt 3240: is better written:
1.16 jmc 3241: .Bd -literal -offset indent
3242: %%
3243: mouse/cat|dog run();
3244: rat/cat|dog run();
3245: .Ed
3246: .Pp
1.1 deraadt 3247: or as
1.16 jmc 3248: .Bd -literal -offset indent
3249: %%
3250: mouse|rat/cat run();
3251: mouse|rat/dog run();
3252: .Ed
3253: .Pp
3254: Note that here the special
3255: .Sq |\&
3256: action does not provide any savings, and can even make things worse (see
3257: .Sx BUGS
3258: below).
3259: .Pp
1.1 deraadt 3260: Another area where the user can increase a scanner's performance
1.16 jmc 3261: .Pq and one that's easier to implement
3262: arises from the fact that the longer the tokens matched,
3263: the faster the scanner will run.
1.1 deraadt 3264: This is because with long tokens the processing of most input
1.16 jmc 3265: characters takes place in the
3266: .Pq short
3267: inner scanning loop, and does not often have to go through the additional work
3268: of setting up the scanning environment (e.g.,
3269: .Fa yytext )
3270: for the action.
3271: Recall the scanner for C comments:
3272: .Bd -literal -offset indent
3273: %x comment
3274: %%
3275: int line_num = 1;
3276:
3277: "/*" BEGIN(comment);
3278:
3279: <comment>[^*\en]*
3280: <comment>"*"+[^*/\en]*
3281: <comment>\en ++line_num;
3282: <comment>"*"+"/" BEGIN(INITIAL);
3283: .Ed
3284: .Pp
1.1 deraadt 3285: This could be sped up by writing it as:
1.16 jmc 3286: .Bd -literal -offset indent
3287: %x comment
3288: %%
3289: int line_num = 1;
3290:
3291: "/*" BEGIN(comment);
3292:
3293: <comment>[^*\en]*
3294: <comment>[^*\en]*\en ++line_num;
3295: <comment>"*"+[^*/\en]*
3296: <comment>"*"+[^*/\en]*\en ++line_num;
3297: <comment>"*"+"/" BEGIN(INITIAL);
3298: .Ed
3299: .Pp
3300: Now instead of each newline requiring the processing of another action,
3301: recognizing the newlines is
3302: .Qq distributed
3303: over the other rules to keep the matched text as long as possible.
3304: Note that adding rules does
3305: .Em not
3306: slow down the scanner!
3307: The speed of the scanner is independent of the number of rules or
3308: (modulo the considerations given at the beginning of this section)
3309: how complicated the rules are with regard to operators such as
3310: .Sq *
3311: and
3312: .Sq |\& .
3313: .Pp
3314: A final example in speeding up a scanner:
3315: scan through a file containing identifiers and keywords, one per line
3316: and with no other extraneous characters, and recognize all the keywords.
3317: A natural first approach is:
3318: .Bd -literal -offset indent
3319: %%
3320: asm |
3321: auto |
3322: break |
3323: \&... etc ...
3324: volatile |
3325: while /* it's a keyword */
3326:
3327: \&.|\en /* it's not a keyword */
3328: .Ed
3329: .Pp
1.1 deraadt 3330: To eliminate the back-tracking, introduce a catch-all rule:
1.16 jmc 3331: .Bd -literal -offset indent
3332: %%
3333: asm |
3334: auto |
3335: break |
3336: \&... etc ...
3337: volatile |
3338: while /* it's a keyword */
3339:
3340: [a-z]+ |
3341: \&.|\en /* it's not a keyword */
3342: .Ed
3343: .Pp
1.1 deraadt 3344: Now, if it's guaranteed that there's exactly one word per line,
3345: then we can reduce the total number of matches by a half by
1.16 jmc 3346: merging in the recognition of newlines with that of the other tokens:
3347: .Bd -literal -offset indent
3348: %%
3349: asm\en |
3350: auto\en |
3351: break\en |
3352: \&... etc ...
3353: volatile\en |
3354: while\en /* it's a keyword */
3355:
3356: [a-z]+\en |
3357: \&.|\en /* it's not a keyword */
3358: .Ed
3359: .Pp
3360: One has to be careful here,
3361: as we have now reintroduced backing up into the scanner.
3362: In particular, while we know that there will never be any characters
3363: in the input stream other than letters or newlines,
3364: .Nm
1.1 deraadt 3365: can't figure this out, and it will plan for possibly needing to back up
1.16 jmc 3366: when it has scanned a token like
3367: .Qq auto
3368: and then the next character is something other than a newline or a letter.
3369: Previously it would then just match the
3370: .Qq auto
3371: rule and be done, but now it has no
3372: .Qq auto
3373: rule, only an
3374: .Qq auto\en
3375: rule.
3376: To eliminate the possibility of backing up,
1.1 deraadt 3377: we could either duplicate all rules but without final newlines, or,
3378: since we never expect to encounter such an input and therefore don't
1.16 jmc 3379: how it's classified, we can introduce one more catch-all rule,
3380: this one which doesn't include a newline:
3381: .Bd -literal -offset indent
3382: %%
3383: asm\en |
3384: auto\en |
3385: break\en |
3386: \&... etc ...
3387: volatile\en |
3388: while\en /* it's a keyword */
3389:
3390: [a-z]+\en |
3391: [a-z]+ |
3392: \&.|\en /* it's not a keyword */
3393: .Ed
3394: .Pp
1.1 deraadt 3395: Compiled with
1.16 jmc 3396: .Fl Cf ,
1.1 deraadt 3397: this is about as fast as one can get a
1.16 jmc 3398: .Nm
1.1 deraadt 3399: scanner to go for this particular problem.
1.16 jmc 3400: .Pp
1.1 deraadt 3401: A final note:
1.16 jmc 3402: .Nm
3403: is slow when matching NUL's,
3404: particularly when a token contains multiple NUL's.
3405: It's best to write rules which match short
1.1 deraadt 3406: amounts of text if it's anticipated that the text will often include NUL's.
1.16 jmc 3407: .Pp
1.1 deraadt 3408: Another final note regarding performance: as mentioned above in the section
1.16 jmc 3409: .Sx HOW THE INPUT IS MATCHED ,
3410: dynamically resizing
3411: .Fa yytext
1.1 deraadt 3412: to accommodate huge tokens is a slow process because it presently requires that
1.16 jmc 3413: the
3414: .Pq huge
3415: token be rescanned from the beginning.
3416: Thus if performance is vital, it is better to attempt to match
3417: .Qq large
3418: quantities of text but not
3419: .Qq huge
3420: quantities, where the cutoff between the two is at about 8K characters/token.
3421: .Sh GENERATING C++ SCANNERS
3422: .Nm
3423: provides two different ways to generate scanners for use with C++.
3424: The first way is to simply compile a scanner generated by
3425: .Nm
3426: using a C++ compiler instead of a C compiler.
3427: This should not generate any compilation errors
3428: (please report any found to the email address given in the
3429: .Sx AUTHORS
3430: section below).
3431: C++ code can then be used in rule actions instead of C code.
3432: Note that the default input source for scanners remains
3433: .Fa yyin ,
1.1 deraadt 3434: and default echoing is still done to
1.16 jmc 3435: .Fa yyout .
1.1 deraadt 3436: Both of these remain
1.16 jmc 3437: .Fa FILE *
3438: variables and not C++ streams.
3439: .Pp
3440: .Nm
3441: can also be used to generate a C++ scanner class, using the
3442: .Fl +
1.1 deraadt 3443: option (or, equivalently,
1.16 jmc 3444: .Dq %option c++ ) ,
3445: which is automatically specified if the name of the flex executable ends in a
3446: .Sq + ,
3447: such as
3448: .Nm flex++ .
3449: When using this option,
3450: .Nm
3451: defaults to generating the scanner to the file
3452: .Pa lex.yy.cc
1.1 deraadt 3453: instead of
1.16 jmc 3454: .Pa lex.yy.c .
1.1 deraadt 3455: The generated scanner includes the header file
1.16 jmc 3456: .Aq Pa g++/FlexLexer.h ,
1.1 deraadt 3457: which defines the interface to two C++ classes.
1.16 jmc 3458: .Pp
1.1 deraadt 3459: The first class,
1.16 jmc 3460: .Em FlexLexer ,
3461: provides an abstract base class defining the general scanner class interface.
3462: It provides the following member functions:
3463: .Bl -tag -width Ds
3464: .It const char* YYText()
3465: Returns the text of the most recently matched token, the equivalent of
3466: .Fa yytext .
3467: .It int YYLeng()
3468: Returns the length of the most recently matched token, the equivalent of
3469: .Fa yyleng .
3470: .It int lineno() const
3471: Returns the current input line number
1.1 deraadt 3472: (see
1.16 jmc 3473: .Dq %option yylineno ) ,
3474: or 1 if
3475: .Dq %option yylineno
1.1 deraadt 3476: was not used.
1.16 jmc 3477: .It void set_debug(int flag)
3478: Sets the debugging flag for the scanner, equivalent to assigning to
3479: .Fa yy_flex_debug
3480: (see the
3481: .Sx OPTIONS
3482: section above).
3483: Note that the scanner must be built using
3484: .Dq %option debug
1.1 deraadt 3485: to include debugging information in it.
1.16 jmc 3486: .It int debug() const
3487: Returns the current setting of the debugging flag.
3488: .El
3489: .Pp
1.1 deraadt 3490: Also provided are member functions equivalent to
1.16 jmc 3491: .Fn yy_switch_to_buffer ,
3492: .Fn yy_create_buffer
1.1 deraadt 3493: (though the first argument is an
1.18 espie 3494: .Fa std::istream*
1.1 deraadt 3495: object pointer and not a
1.16 jmc 3496: .Fa FILE* ) ,
3497: .Fn yy_flush_buffer ,
3498: .Fn yy_delete_buffer ,
1.1 deraadt 3499: and
1.16 jmc 3500: .Fn yyrestart
1.10 deraadt 3501: (again, the first argument is an
1.18 espie 3502: .Fa std::istream*
1.1 deraadt 3503: object pointer).
1.16 jmc 3504: .Pp
1.1 deraadt 3505: The second class defined in
1.16 jmc 3506: .Aq Pa g++/FlexLexer.h
1.1 deraadt 3507: is
1.16 jmc 3508: .Fa yyFlexLexer ,
1.1 deraadt 3509: which is derived from
1.16 jmc 3510: .Fa FlexLexer .
1.1 deraadt 3511: It defines the following additional member functions:
1.16 jmc 3512: .Bl -tag -width Ds
1.18 espie 3513: .It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
1.16 jmc 3514: Constructs a
3515: .Fa yyFlexLexer
3516: object using the given streams for input and output.
3517: If not specified, the streams default to
3518: .Fa cin
1.1 deraadt 3519: and
1.16 jmc 3520: .Fa cout ,
1.1 deraadt 3521: respectively.
1.16 jmc 3522: .It virtual int yylex()
3523: Performs the same role as
3524: .Fn yylex
1.1 deraadt 3525: does for ordinary flex scanners: it scans the input stream, consuming
1.16 jmc 3526: tokens, until a rule's action returns a value.
3527: If subclass
3528: .Sq S
3529: is derived from
3530: .Fa yyFlexLexer ,
3531: in order to access the member functions and variables of
3532: .Sq S
1.1 deraadt 3533: inside
1.16 jmc 3534: .Fn yylex ,
3535: use
3536: .Dq %option yyclass="S"
1.1 deraadt 3537: to inform
1.16 jmc 3538: .Nm
3539: that the
3540: .Sq S
3541: subclass will be used instead of
3542: .Fa yyFlexLexer .
1.1 deraadt 3543: In this case, rather than generating
1.16 jmc 3544: .Dq yyFlexLexer::yylex() ,
3545: .Nm
1.1 deraadt 3546: generates
1.16 jmc 3547: .Dq S::yylex()
1.1 deraadt 3548: (and also generates a dummy
1.16 jmc 3549: .Dq yyFlexLexer::yylex()
1.1 deraadt 3550: that calls
1.16 jmc 3551: .Dq yyFlexLexer::LexerError()
1.1 deraadt 3552: if called).
1.18 espie 3553: .It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
1.16 jmc 3554: Reassigns
3555: .Fa yyin
1.1 deraadt 3556: to
1.16 jmc 3557: .Fa new_in
3558: .Pq if non-nil
1.1 deraadt 3559: and
1.16 jmc 3560: .Fa yyout
1.1 deraadt 3561: to
1.16 jmc 3562: .Fa new_out
3563: .Pq ditto ,
3564: deleting the previous input buffer if
3565: .Fa yyin
1.1 deraadt 3566: is reassigned.
1.18 espie 3567: .It int yylex(std::istream* new_in, std::ostream* new_out = 0)
1.16 jmc 3568: First switches the input streams via
3569: .Dq switch_streams(new_in, new_out)
1.1 deraadt 3570: and then returns the value of
1.16 jmc 3571: .Fn yylex .
3572: .El
3573: .Pp
1.1 deraadt 3574: In addition,
1.16 jmc 3575: .Fa yyFlexLexer
3576: defines the following protected virtual functions which can be redefined
1.1 deraadt 3577: in derived classes to tailor the scanner:
1.16 jmc 3578: .Bl -tag -width Ds
3579: .It virtual int LexerInput(char* buf, int max_size)
3580: Reads up to
3581: .Fa max_size
1.1 deraadt 3582: characters into
1.16 jmc 3583: .Fa buf
3584: and returns the number of characters read.
3585: To indicate end-of-input, return 0 characters.
3586: Note that
3587: .Qq interactive
3588: scanners (see the
3589: .Fl B
1.1 deraadt 3590: and
1.16 jmc 3591: .Fl I
1.1 deraadt 3592: flags) define the macro
1.16 jmc 3593: .Dv YY_INTERACTIVE .
3594: If
3595: .Fn LexerInput
3596: has been redefined, and it's necessary to take different actions depending on
3597: whether or not the scanner might be scanning an interactive input source,
3598: it's possible to test for the presence of this name via
3599: .Dq #ifdef .
3600: .It virtual void LexerOutput(const char* buf, int size)
3601: Writes out
3602: .Fa size
1.1 deraadt 3603: characters from the buffer
1.16 jmc 3604: .Fa buf ,
3605: which, while NUL-terminated, may also contain
3606: .Qq internal
3607: NUL's if the scanner's rules can match text with NUL's in them.
3608: .It virtual void LexerError(const char* msg)
3609: Reports a fatal error message.
3610: The default version of this function writes the message to the stream
3611: .Fa cerr
1.1 deraadt 3612: and exits.
1.16 jmc 3613: .El
3614: .Pp
1.1 deraadt 3615: Note that a
1.16 jmc 3616: .Fa yyFlexLexer
3617: object contains its entire scanning state.
3618: Thus such objects can be used to create reentrant scanners.
3619: Multiple instances of the same
3620: .Fa yyFlexLexer
3621: class can be instantiated, and multiple C++ scanner classes can be combined
1.1 deraadt 3622: in the same program using the
1.16 jmc 3623: .Fl P
1.1 deraadt 3624: option discussed above.
1.16 jmc 3625: .Pp
1.1 deraadt 3626: Finally, note that the
1.16 jmc 3627: .Dq %array
3628: feature is not available to C++ scanner classes;
3629: .Dq %pointer
3630: must be used
3631: .Pq the default .
3632: .Pp
1.1 deraadt 3633: Here is an example of a simple C++ scanner:
1.16 jmc 3634: .Bd -literal -offset indent
3635: // An example of using the flex C++ scanner class.
1.1 deraadt 3636:
1.16 jmc 3637: %{
3638: #include <errno.h>
3639: int mylineno = 0;
3640: %}
1.1 deraadt 3641:
1.16 jmc 3642: string \e"[^\en"]+\e"
1.1 deraadt 3643:
1.16 jmc 3644: ws [ \et]+
1.1 deraadt 3645:
1.16 jmc 3646: alpha [A-Za-z]
3647: dig [0-9]
3648: name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3649: num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3650: num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3651: number {num1}|{num2}
1.1 deraadt 3652:
1.16 jmc 3653: %%
1.1 deraadt 3654:
1.16 jmc 3655: {ws} /* skip blanks and tabs */
1.1 deraadt 3656:
1.16 jmc 3657: "/*" {
3658: int c;
1.1 deraadt 3659:
1.16 jmc 3660: while ((c = yyinput()) != 0) {
3661: if(c == '\en')
1.1 deraadt 3662: ++mylineno;
1.16 jmc 3663: else if(c == '*') {
3664: if ((c = yyinput()) == '/')
1.1 deraadt 3665: break;
3666: else
3667: unput(c);
3668: }
1.16 jmc 3669: }
3670: }
1.1 deraadt 3671:
1.16 jmc 3672: {number} cout << "number " << YYText() << '\en';
1.1 deraadt 3673:
1.16 jmc 3674: \en mylineno++;
1.1 deraadt 3675:
1.16 jmc 3676: {name} cout << "name " << YYText() << '\en';
1.1 deraadt 3677:
1.16 jmc 3678: {string} cout << "string " << YYText() << '\en';
3679:
3680: %%
3681:
3682: int main(int /* argc */, char** /* argv */)
3683: {
3684: FlexLexer* lexer = new yyFlexLexer;
3685: while(lexer->yylex() != 0)
3686: ;
3687: return 0;
3688: }
3689: .Ed
3690: .Pp
3691: To create multiple
3692: .Pq different
3693: lexer classes, use the
3694: .Fl P
3695: flag
3696: (or the
3697: .Dq prefix=
3698: option)
3699: to rename each
3700: .Fa yyFlexLexer
1.1 deraadt 3701: to some other
1.16 jmc 3702: .Fa xxFlexLexer .
3703: .Aq Pa g++/FlexLexer.h
3704: can then be included in other sources once per lexer class, first renaming
3705: .Fa yyFlexLexer
1.1 deraadt 3706: as follows:
1.16 jmc 3707: .Bd -literal -offset indent
3708: #undef yyFlexLexer
3709: #define yyFlexLexer xxFlexLexer
3710: #include <g++/FlexLexer.h>
3711:
3712: #undef yyFlexLexer
3713: #define yyFlexLexer zzFlexLexer
3714: #include <g++/FlexLexer.h>
3715: .Ed
3716: .Pp
3717: If, for example,
3718: .Dq %option prefix="xx"
3719: is used for one scanner and
3720: .Dq %option prefix="zz"
3721: is used for the other.
3722: .Pp
3723: .Sy IMPORTANT :
3724: the present form of the scanning class is experimental
1.7 aaron 3725: and may change considerably between major releases.
1.16 jmc 3726: .Sh INCOMPATIBILITIES WITH LEX AND POSIX
3727: .Nm
1.25 sobrado 3728: is a rewrite of the
3729: .At
1.16 jmc 3730: .Nm lex
3731: tool
3732: (the two implementations do not share any code, though),
3733: with some extensions and incompatibilities, both of which are of concern
3734: to those who wish to write scanners acceptable to either implementation.
3735: .Nm
3736: is fully compliant with the
3737: .Tn POSIX
3738: .Nm lex
1.1 deraadt 3739: specification, except that when using
1.16 jmc 3740: .Dq %pointer
3741: .Pq the default ,
3742: a call to
3743: .Fn unput
1.1 deraadt 3744: destroys the contents of
1.16 jmc 3745: .Fa yytext ,
3746: which is counter to the
3747: .Tn POSIX
3748: specification.
3749: .Pp
3750: In this section we discuss all of the known areas of incompatibility between
3751: .Nm ,
3752: AT&T
3753: .Nm lex ,
3754: and the
3755: .Tn POSIX
3756: specification.
3757: .Pp
3758: .Nm flex Ns 's
3759: .Fl l
1.1 deraadt 3760: option turns on maximum compatibility with the original AT&T
1.16 jmc 3761: .Nm lex
1.1 deraadt 3762: implementation, at the cost of a major loss in the generated scanner's
1.16 jmc 3763: performance.
3764: We note below which incompatibilities can be overcome using the
3765: .Fl l
1.1 deraadt 3766: option.
1.16 jmc 3767: .Pp
3768: .Nm
1.1 deraadt 3769: is fully compatible with
1.16 jmc 3770: .Nm lex
1.1 deraadt 3771: with the following exceptions:
1.16 jmc 3772: .Bl -dash
3773: .It
1.1 deraadt 3774: The undocumented
1.16 jmc 3775: .Nm lex
1.1 deraadt 3776: scanner internal variable
1.16 jmc 3777: .Fa yylineno
1.1 deraadt 3778: is not supported unless
1.16 jmc 3779: .Fl l
1.1 deraadt 3780: or
1.16 jmc 3781: .Dq %option yylineno
1.1 deraadt 3782: is used.
1.16 jmc 3783: .Pp
3784: .Fa yylineno
1.1 deraadt 3785: should be maintained on a per-buffer basis, rather than a per-scanner
1.16 jmc 3786: .Pq single global variable
3787: basis.
3788: .Pp
3789: .Fa yylineno
3790: is not part of the
3791: .Tn POSIX
3792: specification.
3793: .It
1.1 deraadt 3794: The
1.16 jmc 3795: .Fn input
1.1 deraadt 3796: routine is not redefinable, though it may be called to read characters
1.16 jmc 3797: following whatever has been matched by a rule.
3798: If
3799: .Fn input
3800: encounters an end-of-file, the normal
3801: .Fn yywrap
3802: processing is done.
3803: A
3804: .Dq real
3805: end-of-file is returned by
3806: .Fn input
1.1 deraadt 3807: as
1.16 jmc 3808: .Dv EOF .
3809: .Pp
1.1 deraadt 3810: Input is instead controlled by defining the
1.16 jmc 3811: .Dv YY_INPUT
1.1 deraadt 3812: macro.
1.16 jmc 3813: .Pp
1.1 deraadt 3814: The
1.16 jmc 3815: .Nm
1.1 deraadt 3816: restriction that
1.16 jmc 3817: .Fn input
3818: cannot be redefined is in accordance with the
3819: .Tn POSIX
3820: specification, which simply does not specify any way of controlling the
1.1 deraadt 3821: scanner's input other than by making an initial assignment to
1.16 jmc 3822: .Fa yyin .
3823: .It
1.1 deraadt 3824: The
1.16 jmc 3825: .Fn unput
3826: routine is not redefinable.
3827: This restriction is in accordance with
3828: .Tn POSIX .
3829: .It
3830: .Nm
1.1 deraadt 3831: scanners are not as reentrant as
1.16 jmc 3832: .Nm lex
3833: scanners.
3834: In particular, if a scanner is interactive and
3835: an interrupt handler long-jumps out of the scanner,
3836: and the scanner is subsequently called again,
3837: the following error message may be displayed:
3838: .Pp
3839: .D1 fatal flex scanner internal error--end of buffer missed
3840: .Pp
1.1 deraadt 3841: To reenter the scanner, first use
1.16 jmc 3842: .Pp
3843: .Dl yyrestart(yyin);
3844: .Pp
3845: Note that this call will throw away any buffered input;
3846: usually this isn't a problem with an interactive scanner.
3847: .Pp
3848: Also note that flex C++ scanner classes are reentrant,
3849: so if using C++ is an option , they should be used instead.
3850: See
3851: .Sx GENERATING C++ SCANNERS
3852: above for details.
3853: .It
3854: .Fn output
1.1 deraadt 3855: is not supported.
3856: Output from the
1.16 jmc 3857: .Em ECHO
1.1 deraadt 3858: macro is done to the file-pointer
1.16 jmc 3859: .Fa yyout
3860: .Pq default stdout .
3861: .Pp
3862: .Fn output
3863: is not part of the
3864: .Tn POSIX
3865: specification.
3866: .It
3867: .Nm lex
3868: does not support exclusive start conditions
3869: .Pq %x ,
3870: though they are in the
3871: .Tn POSIX
3872: specification.
3873: .It
1.1 deraadt 3874: When definitions are expanded,
1.16 jmc 3875: .Nm
1.1 deraadt 3876: encloses them in parentheses.
1.16 jmc 3877: With
3878: .Nm lex ,
3879: the following:
3880: .Bd -literal -offset indent
3881: NAME [A-Z][A-Z0-9]*
3882: %%
3883: foo{NAME}? printf("Found it\en");
3884: %%
3885: .Ed
3886: .Pp
3887: will not match the string
3888: .Qq foo
3889: because when the macro is expanded the rule is equivalent to
3890: .Qq foo[A-Z][A-Z0-9]*?
3891: and the precedence is such that the
3892: .Sq ?\&
3893: is associated with
3894: .Qq [A-Z0-9]* .
3895: With
3896: .Nm ,
1.1 deraadt 3897: the rule will be expanded to
1.16 jmc 3898: .Qq foo([A-Z][A-Z0-9]*)?
3899: and so the string
3900: .Qq foo
3901: will match.
3902: .Pp
1.1 deraadt 3903: Note that if the definition begins with
1.16 jmc 3904: .Sq ^
1.1 deraadt 3905: or ends with
1.16 jmc 3906: .Sq $
3907: then it is not expanded with parentheses, to allow these operators to appear in
3908: definitions without losing their special meanings.
3909: But the
3910: .Sq Aq s ,
3911: .Sq / ,
1.1 deraadt 3912: and
1.16 jmc 3913: .Aq Aq EOF
1.1 deraadt 3914: operators cannot be used in a
1.16 jmc 3915: .Nm
1.1 deraadt 3916: definition.
1.16 jmc 3917: .Pp
1.1 deraadt 3918: Using
1.16 jmc 3919: .Fl l
1.1 deraadt 3920: results in the
1.16 jmc 3921: .Nm lex
1.1 deraadt 3922: behavior of no parentheses around the definition.
1.16 jmc 3923: .Pp
3924: The
3925: .Tn POSIX
3926: specification is that the definition be enclosed in parentheses.
3927: .It
1.1 deraadt 3928: Some implementations of
1.16 jmc 3929: .Nm lex
3930: allow a rule's action to begin on a separate line,
3931: if the rule's pattern has trailing whitespace:
3932: .Bd -literal -offset indent
3933: %%
3934: foo|bar<space here>
3935: { foobar_action(); }
3936: .Ed
3937: .Pp
3938: .Nm
1.1 deraadt 3939: does not support this feature.
1.16 jmc 3940: .It
1.1 deraadt 3941: The
1.16 jmc 3942: .Nm lex
3943: .Sq %r
3944: .Pq generate a Ratfor scanner
3945: option is not supported.
3946: It is not part of the
3947: .Tn POSIX
3948: specification.
3949: .It
1.1 deraadt 3950: After a call to
1.16 jmc 3951: .Fn unput ,
3952: .Fa yytext
3953: is undefined until the next token is matched,
3954: unless the scanner was built using
3955: .Dq %array .
1.1 deraadt 3956: This is not the case with
1.16 jmc 3957: .Nm lex
3958: or the
3959: .Tn POSIX
3960: specification.
3961: The
3962: .Fl l
1.1 deraadt 3963: option does away with this incompatibility.
1.16 jmc 3964: .It
1.1 deraadt 3965: The precedence of the
1.16 jmc 3966: .Sq {}
3967: .Pq numeric range
3968: operator is different.
3969: .Nm lex
3970: interprets
3971: .Qq abc{1,3}
3972: as match one, two, or three occurrences of
3973: .Sq abc ,
3974: whereas
3975: .Nm
3976: interprets it as match
3977: .Sq ab
3978: followed by one, two, or three occurrences of
3979: .Sq c .
3980: The latter is in agreement with the
3981: .Tn POSIX
3982: specification.
3983: .It
1.1 deraadt 3984: The precedence of the
1.16 jmc 3985: .Sq ^
1.1 deraadt 3986: operator is different.
1.16 jmc 3987: .Nm lex
3988: interprets
3989: .Qq ^foo|bar
3990: as match either
3991: .Sq foo
3992: at the beginning of a line, or
3993: .Sq bar
3994: anywhere, whereas
3995: .Nm
3996: interprets it as match either
3997: .Sq foo
3998: or
3999: .Sq bar
4000: if they come at the beginning of a line.
4001: The latter is in agreement with the
4002: .Tn POSIX
4003: specification.
4004: .It
1.1 deraadt 4005: The special table-size declarations such as
1.16 jmc 4006: .Sq %a
1.1 deraadt 4007: supported by
1.16 jmc 4008: .Nm lex
1.1 deraadt 4009: are not required by
1.16 jmc 4010: .Nm
1.1 deraadt 4011: scanners;
1.16 jmc 4012: .Nm
1.1 deraadt 4013: ignores them.
1.16 jmc 4014: .It
1.1 deraadt 4015: The name
1.16 jmc 4016: .Dv FLEX_SCANNER
1.1 deraadt 4017: is #define'd so scanners may be written for use with either
1.16 jmc 4018: .Nm
1.1 deraadt 4019: or
1.16 jmc 4020: .Nm lex .
1.1 deraadt 4021: Scanners also include
1.16 jmc 4022: .Dv YY_FLEX_MAJOR_VERSION
1.1 deraadt 4023: and
1.16 jmc 4024: .Dv YY_FLEX_MINOR_VERSION
1.1 deraadt 4025: indicating which version of
1.16 jmc 4026: .Nm
1.1 deraadt 4027: generated the scanner
1.16 jmc 4028: (for example, for the 2.5 release, these defines would be 2 and 5,
1.1 deraadt 4029: respectively).
1.16 jmc 4030: .El
4031: .Pp
1.1 deraadt 4032: The following
1.16 jmc 4033: .Nm
1.1 deraadt 4034: features are not included in
1.16 jmc 4035: .Nm lex
4036: or the
4037: .Tn POSIX
4038: specification:
4039: .Bd -unfilled -offset indent
4040: C++ scanners
4041: %option
4042: start condition scopes
4043: start condition stacks
4044: interactive/non-interactive scanners
4045: yy_scan_string() and friends
4046: yyterminate()
4047: yy_set_interactive()
4048: yy_set_bol()
4049: YY_AT_BOL()
4050: <<EOF>>
4051: <*>
4052: YY_DECL
4053: YY_START
4054: YY_USER_ACTION
4055: YY_USER_INIT
4056: #line directives
4057: %{}'s around actions
4058: multiple actions on a line
4059: .Ed
4060: .Pp
4061: plus almost all of the
4062: .Nm
4063: flags.
1.1 deraadt 4064: The last feature in the list refers to the fact that with
1.16 jmc 4065: .Nm
4066: Multiple actions ican be placed on the same line,
4067: separated with semi-colons, while with
4068: .Nm lex ,
1.1 deraadt 4069: the following
1.16 jmc 4070: .Pp
4071: .Dl foo handle_foo(); ++num_foos_seen;
4072: .Pp
4073: is
4074: .Pq rather surprisingly
4075: truncated to
4076: .Pp
4077: .Dl foo handle_foo();
4078: .Pp
4079: .Nm
4080: does not truncate the action.
4081: Actions that are not enclosed in braces
4082: are simply terminated at the end of the line.
4083: .Sh FILES
4084: .Bl -tag -width "<g++/FlexLexer.h>"
4085: .It flex.skl
4086: Skeleton scanner.
4087: This file is only used when building flex, not when
4088: .Nm
4089: executes.
4090: .It lex.backup
4091: Backing-up information for the
4092: .Fl b
4093: flag (called
4094: .Pa lex.bck
4095: on some systems).
4096: .It lex.yy.c
4097: Generated scanner
4098: (called
4099: .Pa lexyy.c
4100: on some systems).
4101: .It lex.yy.cc
4102: Generated C++ scanner class, when using
4103: .Fl + .
4104: .It Aq g++/FlexLexer.h
4105: Header file defining the C++ scanner base class,
4106: .Fa FlexLexer ,
4107: and its derived class,
4108: .Fa yyFlexLexer .
4109: .It /usr/lib/libl.*
4110: .Nm
4111: libraries.
4112: The
4113: .Pa /usr/lib/libfl.*\&
4114: libraries are links to these.
4115: Scanners must be linked using either
4116: .Fl \&ll
4117: or
4118: .Fl lfl .
4119: .El
1.29 jmc 4120: .Sh EXIT STATUS
4121: .Ex -std flex
1.16 jmc 4122: .Sh DIAGNOSTICS
4123: .Bl -diag
4124: .It warning, rule cannot be matched
4125: Indicates that the given rule cannot be matched because it follows other rules
4126: that will always match the same text as it.
4127: For example, in the following
4128: .Dq foo
4129: cannot be matched because it comes after an identifier
4130: .Qq catch-all
4131: rule:
4132: .Bd -literal -offset indent
4133: [a-z]+ got_identifier();
4134: foo got_foo();
4135: .Ed
4136: .Pp
1.1 deraadt 4137: Using
1.16 jmc 4138: .Em REJECT
1.1 deraadt 4139: in a scanner suppresses this warning.
1.16 jmc 4140: .It "warning, \-s option given but default rule can be matched"
4141: Means that it is possible
4142: .Pq perhaps only in a particular start condition
4143: that the default rule
4144: .Pq match any single character
4145: is the only one that will match a particular input.
4146: Since
4147: .Fl s
1.1 deraadt 4148: was given, presumably this is not intended.
1.16 jmc 4149: .It reject_used_but_not_detected undefined
4150: .It yymore_used_but_not_detected undefined
4151: These errors can occur at compile time.
4152: They indicate that the scanner uses
4153: .Em REJECT
1.1 deraadt 4154: or
1.16 jmc 4155: .Fn yymore
1.1 deraadt 4156: but that
1.16 jmc 4157: .Nm
1.1 deraadt 4158: failed to notice the fact, meaning that
1.16 jmc 4159: .Nm
1.1 deraadt 4160: scanned the first two sections looking for occurrences of these actions
1.16 jmc 4161: and failed to find any, but somehow they snuck in
4162: .Pq via an #include file, for example .
4163: Use
4164: .Dq %option reject
4165: or
4166: .Dq %option yymore
4167: to indicate to
4168: .Nm
4169: that these features are really needed.
4170: .It flex scanner jammed
4171: A scanner compiled with
4172: .Fl s
4173: has encountered an input string which wasn't matched by any of its rules.
4174: This error can also occur due to internal problems.
4175: .It token too large, exceeds YYLMAX
4176: The scanner uses
4177: .Dq %array
1.1 deraadt 4178: and one of its rules matched a string longer than the
1.16 jmc 4179: .Dv YYLMAX
4180: constant
4181: .Pq 8K bytes by default .
4182: The value can be increased by #define'ing
4183: .Dv YYLMAX
4184: in the definitions section of
4185: .Nm
1.1 deraadt 4186: input.
1.16 jmc 4187: .It "scanner requires \-8 flag to use the character 'x'"
4188: The scanner specification includes recognizing the 8-bit character
4189: .Sq x
4190: and the
4191: .Fl 8
4192: flag was not specified, and defaulted to 7-bit because the
4193: .Fl Cf
4194: or
4195: .Fl CF
4196: table compression options were used.
4197: See the discussion of the
4198: .Fl 7
1.1 deraadt 4199: flag for details.
1.16 jmc 4200: .It flex scanner push-back overflow
4201: unput() was used to push back so much text that the scanner's buffer
4202: could not hold both the pushed-back text and the current token in
4203: .Fa yytext .
4204: Ideally the scanner should dynamically resize the buffer in this case,
4205: but at present it does not.
4206: .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4207: The scanner was working on matching an extremely large token and needed
4208: to expand the input buffer.
4209: This doesn't work with scanners that use
4210: .Em REJECT .
4211: .It "fatal flex scanner internal error--end of buffer missed"
1.1 deraadt 4212: This can occur in an scanner which is reentered after a long-jump
1.16 jmc 4213: has jumped out
4214: .Pq or over
4215: the scanner's activation frame.
4216: Before reentering the scanner, use:
4217: .Pp
4218: .Dl yyrestart(yyin);
4219: .Pp
1.1 deraadt 4220: or, as noted above, switch to using the C++ scanner class.
1.16 jmc 4221: .It "too many start conditions in <> construct!"
4222: More start conditions than exist were listed in a <> construct
4223: (so at least one of them must have been listed twice).
4224: .El
4225: .Sh SEE ALSO
4226: .Xr awk 1 ,
4227: .Xr sed 1 ,
4228: .Xr yacc 1
4229: .Rs
4230: .%A John Levine
4231: .%A Tony Mason
4232: .%A Doug Brown
4233: .%B Lex & Yacc
4234: .%I O'Reilly and Associates
4235: .%N 2nd edition
4236: .Re
4237: .Rs
4238: .%A Alfred Aho
4239: .%A Ravi Sethi
4240: .%A Jeffrey Ullman
4241: .%B Compilers: Principles, Techniques and Tools
4242: .%I Addison-Wesley
4243: .%D 1986
4244: .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4245: .Re
1.23 jmc 4246: .Sh STANDARDS
4247: The
4248: .Nm lex
4249: utility is compliant with the
4250: .St -p1003.1-2008
4251: specification,
4252: though its presence is optional.
4253: .Pp
4254: The flags
1.31 jmc 4255: .Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
1.23 jmc 4256: .Op Fl -help ,
4257: and
4258: .Op Fl -version
4259: are extensions to that specification.
1.16 jmc 4260: .Sh AUTHORS
1.1 deraadt 4261: Vern Paxson, with the help of many ideas and much inspiration from
1.16 jmc 4262: Van Jacobson.
4263: Original version by Jef Poskanzer.
4264: The fast table representation is a partial implementation of a design done by
4265: Van Jacobson.
4266: The implementation was done by Kevin Gong and Vern Paxson.
4267: .Pp
1.1 deraadt 4268: Thanks to the many
1.16 jmc 4269: .Nm
1.1 deraadt 4270: beta-testers, feedbackers, and contributors, especially Francois Pinard,
4271: Casey Leedom,
4272: Robert Abramovitz,
4273: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4274: Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4275: Karl Berry, Peter A. Bigot, Simon Blanchard,
4276: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4277: Brian Clapper, J.T. Conklin,
4278: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11 deraadt 4279: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1 deraadt 4280: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4281: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4282: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4283: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4284: Jan Hajic, Charles Hemphill, NORO Hideo,
4285: Jarkko Hietaniemi, Scott Hofmann,
4286: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4287: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4288: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4289: Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4290: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4291: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4292: David Loffredo, Mike Long,
4293: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4294: Bengt Martensson, Chris Metcalf,
4295: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4296: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4297: Richard Ohnemus, Karsten Pahnke,
1.16 jmc 4298: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4299: Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
1.1 deraadt 4300: Frederic Raimbault, Pat Rankin, Rick Richardson,
4301: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4302: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4303: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4304: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4305: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4306: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
1.16 jmc 4307: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4308: Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4309: and those whose names have slipped my marginal mail-archiving skills
4310: but whose contributions are appreciated all the
1.1 deraadt 4311: same.
1.16 jmc 4312: .Pp
1.1 deraadt 4313: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4314: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4315: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4316: distribution headaches.
1.16 jmc 4317: .Pp
4318: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4319: to Benson Margulies and Fred Burke for C++ support;
4320: to Kent Williams and Tom Epperly for C++ class support;
4321: to Ove Ewerlid for support of NUL's;
4322: and to Eric Hughes for support of multiple buffers.
4323: .Pp
1.1 deraadt 4324: This work was primarily done when I was with the Real Time Systems Group
1.16 jmc 4325: at the Lawrence Berkeley Laboratory in Berkeley, CA.
4326: Many thanks to all there for the support I received.
4327: .Pp
4328: Send comments to
1.34 ! schwarze 4329: .Aq Mt vern@ee.lbl.gov .
1.16 jmc 4330: .Sh BUGS
4331: Some trailing context patterns cannot be properly matched and generate
4332: warning messages
4333: .Pq "dangerous trailing context" .
4334: These are patterns where the ending of the first part of the rule
4335: matches the beginning of the second part, such as
4336: .Qq zx*/xy* ,
4337: where the
4338: .Sq x*
4339: matches the
4340: .Sq x
4341: at the beginning of the trailing context.
4342: (Note that the POSIX draft states that the text matched by such patterns
4343: is undefined.)
4344: .Pp
4345: For some trailing context rules, parts which are actually fixed-length are
4346: not recognized as such, leading to the above mentioned performance loss.
4347: In particular, parts using
4348: .Sq |\&
4349: or
4350: .Sq {n}
4351: (such as
4352: .Qq foo{3} )
4353: are always considered variable-length.
4354: .Pp
4355: Combining trailing context with the special
4356: .Sq |\&
4357: action can result in fixed trailing context being turned into
4358: the more expensive variable trailing context.
4359: For example, in the following:
4360: .Bd -literal -offset indent
4361: %%
4362: abc |
4363: xyz/def
4364: .Ed
4365: .Pp
4366: Use of
4367: .Fn unput
4368: invalidates yytext and yyleng, unless the
4369: .Dq %array
4370: directive
4371: or the
4372: .Fl l
4373: option has been used.
4374: .Pp
4375: Pattern-matching of NUL's is substantially slower than matching other
4376: characters.
4377: .Pp
4378: Dynamic resizing of the input buffer is slow, as it entails rescanning
4379: all the text matched so far by the current
4380: .Pq generally huge
4381: token.
4382: .Pp
4383: Due to both buffering of input and read-ahead,
4384: it is not possible to intermix calls to
4385: .Aq Pa stdio.h
4386: routines, such as, for example,
4387: .Fn getchar ,
4388: with
4389: .Nm
4390: rules and expect it to work.
4391: Call
4392: .Fn input
4393: instead.
4394: .Pp
4395: The total table entries listed by the
4396: .Fl v
4397: flag excludes the number of table entries needed to determine
4398: what rule has been matched.
4399: The number of entries is equal to the number of DFA states
4400: if the scanner does not use
4401: .Em REJECT ,
4402: and somewhat greater than the number of states if it does.
4403: .Pp
4404: .Em REJECT
4405: cannot be used with the
4406: .Fl f
4407: or
4408: .Fl F
4409: options.
4410: .Pp
4411: The
4412: .Nm
4413: internal algorithms need documentation.