Annotation of src/usr.bin/lex/flex.1, Revision 1.18
1.18 ! espie 1: .\" $OpenBSD: flex.1,v 1.17 2003/12/09 12:44:17 jmc Exp $
1.16 jmc 2: .\"
1.12 jmc 3: .\" Copyright (c) 1990 The Regents of the University of California.
4: .\" All rights reserved.
1.2 deraadt 5: .\"
1.12 jmc 6: .\" This code is derived from software contributed to Berkeley by
7: .\" Vern Paxson.
8: .\"
9: .\" The United States Government has rights in this work pursuant
10: .\" to contract no. DE-AC03-76SF00098 between the United States
11: .\" Department of Energy and the University of California.
12: .\"
13: .\" Redistribution and use in source and binary forms, with or without
1.13 millert 14: .\" modification, are permitted provided that the following conditions
15: .\" are met:
16: .\"
17: .\" 1. Redistributions of source code must retain the above copyright
18: .\" notice, this list of conditions and the following disclaimer.
19: .\" 2. Redistributions in binary form must reproduce the above copyright
20: .\" notice, this list of conditions and the following disclaimer in the
21: .\" documentation and/or other materials provided with the distribution.
22: .\"
23: .\" Neither the name of the University nor the names of its contributors
24: .\" may be used to endorse or promote products derived from this software
25: .\" without specific prior written permission.
26: .\"
27: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30: .\" PURPOSE.
1.16 jmc 31: .\"
32: .Dd April 1, 1995
33: .Dt FLEX 1
34: .Os
35: .Sh NAME
36: .Nm flex
37: .Nd fast lexical analyzer generator
38: .Sh SYNOPSIS
39: .Nm
40: .Op Fl 78BbcdFfhIiLlnpsTtVvw+?
41: .Op Fl C Ns Op Cm aeFfmr
42: .Op Fl Fl help
43: .Op Fl Fl version
44: .Sm off
45: .Op Fl o Ar output
46: .Op Fl P Ar prefix
47: .Op Fl S Ar skeleton
48: .Op Ar filename ...
49: .Sm on
50: .Sh OVERVIEW
1.1 deraadt 51: This manual describes
1.16 jmc 52: .Nm ,
53: a tool for generating programs that perform pattern-matching on text.
54: The manual includes both tutorial and reference sections:
55: .Bl -ohang
56: .It Sy Description
57: A brief overview of the tool.
58: .It Sy Some Simple Examples
59: .It Sy Format of the Input File
60: .It Sy Patterns
61: The extended regular expressions used by
62: .Nm .
63: .It Sy How the Input is Matched
64: The rules for determining what has been matched.
65: .It Sy Actions
66: How to specify what to do when a pattern is matched.
67: .It Sy The Generated Scanner
68: Details regarding the scanner that
69: .Nm
70: produces;
71: how to control the input source.
72: .It Sy Start Conditions
73: Introducing context into scanners, and managing
74: .Qq mini-scanners .
75: .It Sy Multiple Input Buffers
76: How to manipulate multiple input sources;
77: how to scan from strings instead of files.
78: .It Sy End-of-File Rules
79: Special rules for matching the end of the input.
80: .It Sy Miscellaneous Macros
81: A summary of macros available to the actions.
82: .It Sy Values Available to the User
83: A summary of values available to the actions.
84: .It Sy Interfacing with Yacc
85: Connecting flex scanners together with
86: .Xr yacc 1
87: parsers.
88: .It Sy Options
89: .Nm
90: command-line options, and the
91: .Dq %option
92: directive.
93: .It Sy Performance Considerations
94: How to make scanners go as fast as possible.
95: .It Sy Generating C++ Scanners
96: The
97: .Pq experimental
98: facility for generating C++ scanner classes.
99: .It Sy Incompatibilities with Lex and POSIX
100: How
101: .Nm
102: differs from AT&T lex and the
103: .Tn POSIX
104: lex standard.
105: .It Sy Files
106: Files used by
107: .Nm .
108: .It Sy Diagnostics
109: Those error messages produced by
110: .Nm
111: .Pq or scanners it generates
112: whose meanings might not be apparent.
113: .It Sy See Also
114: Other documentation, related tools.
115: .It Sy Authors
116: Includes contact information.
117: .It Sy Bugs
118: Known problems with
119: .Nm .
120: .El
121: .Sh DESCRIPTION
122: .Nm
1.1 deraadt 123: is a tool for generating
1.16 jmc 124: .Em scanners :
1.9 millert 125: programs which recognize lexical patterns in text.
1.16 jmc 126: .Nm
127: reads the given input files, or its standard input if no file names are given,
128: for a description of a scanner to generate.
129: The description is in the form of pairs of regular expressions and C code,
130: called
131: .Em rules .
132: .Nm
1.1 deraadt 133: generates as output a C source file,
1.16 jmc 134: .Pa lex.yy.c ,
1.1 deraadt 135: which defines a routine
1.16 jmc 136: .Fn yylex .
1.1 deraadt 137: This file is compiled and linked with the
1.16 jmc 138: .Fl lfl
139: library to produce an executable.
140: When the executable is run, it analyzes its input for occurrences
141: of the regular expressions.
142: Whenever it finds one, it executes the corresponding C code.
143: .Sh SOME SIMPLE EXAMPLES
1.1 deraadt 144: First some simple examples to get the flavor of how one uses
1.16 jmc 145: .Nm .
1.1 deraadt 146: The following
1.16 jmc 147: .Nm
1.1 deraadt 148: input specifies a scanner which whenever it encounters the string
1.16 jmc 149: .Qq username
150: will replace it with the user's login name:
151: .Bd -literal -offset indent
152: %%
153: username printf("%s", getlogin());
154: .Ed
155: .Pp
1.1 deraadt 156: By default, any text not matched by a
1.16 jmc 157: .Nm
158: scanner is copied to the output, so the net effect of this scanner is
159: to copy its input file to its output with each occurrence of
160: .Qq username
161: expanded.
162: In this input, there is just one rule.
163: .Qq username
164: is the
165: .Em pattern
166: and the
167: .Qq printf
168: is the
169: .Em action .
170: The
171: .Qq %%
172: marks the beginning of the rules.
173: .Pp
1.1 deraadt 174: Here's another simple example:
1.16 jmc 175: .Bd -literal -offset indent
176: int num_lines = 0, num_chars = 0;
1.1 deraadt 177:
1.16 jmc 178: %%
179: \en ++num_lines; ++num_chars;
180: \&. ++num_chars;
181:
182: %%
183: main()
184: {
185: yylex();
186: printf("# of lines = %d, # of chars = %d\en",
187: num_lines, num_chars);
188: }
189: .Ed
190: .Pp
1.1 deraadt 191: This scanner counts the number of characters and the number
1.16 jmc 192: of lines in its input
193: (it produces no output other than the final report on the counts).
194: The first line declares two globals,
195: .Qq num_lines
196: and
197: .Qq num_chars ,
198: which are accessible both inside
199: .Fn yylex
1.1 deraadt 200: and in the
1.16 jmc 201: .Fn main
202: routine declared after the second
203: .Qq %% .
204: There are two rules, one which matches a newline
205: .Pq \&"\en\&"
206: and increments both the line count and the character count,
207: and one which matches any character other than a newline
208: (indicated by the
209: .Qq \&.
210: regular expression).
211: .Pp
1.1 deraadt 212: A somewhat more complicated example:
1.16 jmc 213: .Bd -literal -offset indent
214: /* scanner for a toy Pascal-like language */
1.1 deraadt 215:
1.16 jmc 216: %{
217: /* need this for the call to atof() below */
218: #include <math.h>
219: %}
1.1 deraadt 220:
1.16 jmc 221: DIGIT [0-9]
222: ID [a-z][a-z0-9]*
1.1 deraadt 223:
1.16 jmc 224: %%
1.1 deraadt 225:
1.16 jmc 226: {DIGIT}+ {
227: printf("An integer: %s (%d)\en", yytext,
228: atoi(yytext));
229: }
1.1 deraadt 230:
1.16 jmc 231: {DIGIT}+"."{DIGIT}* {
232: printf("A float: %s (%g)\en", yytext,
233: atof(yytext));
234: }
1.1 deraadt 235:
1.16 jmc 236: if|then|begin|end|procedure|function {
237: printf("A keyword: %s\en", yytext);
238: }
1.1 deraadt 239:
1.16 jmc 240: {ID} printf("An identifier: %s\en", yytext);
1.1 deraadt 241:
1.16 jmc 242: "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
1.1 deraadt 243:
1.16 jmc 244: "{"[^}\en]*"}" /* eat up one-line comments */
1.1 deraadt 245:
1.16 jmc 246: [ \et\en]+ /* eat up whitespace */
1.1 deraadt 247:
1.16 jmc 248: \&. printf("Unrecognized character: %s\en", yytext);
1.1 deraadt 249:
1.16 jmc 250: %%
1.1 deraadt 251:
1.16 jmc 252: main(int argc, char *argv[])
253: {
254: ++argv; --argc; /* skip over program name */
255: if (argc > 0)
256: yyin = fopen(argv[0], "r");
1.1 deraadt 257: else
258: yyin = stdin;
1.7 aaron 259:
1.1 deraadt 260: yylex();
1.16 jmc 261: }
262: .Ed
263: .Pp
264: This is the beginnings of a simple scanner for a language like Pascal.
265: It identifies different types of
266: .Em tokens
1.1 deraadt 267: and reports on what it has seen.
1.16 jmc 268: .Pp
269: The details of this example will be explained in the following sections.
270: .Sh FORMAT OF THE INPUT FILE
1.1 deraadt 271: The
1.16 jmc 272: .Nm
1.1 deraadt 273: input file consists of three sections, separated by a line with just
1.16 jmc 274: .Qq %%
1.1 deraadt 275: in it:
1.16 jmc 276: .Bd -unfilled -offset indent
277: definitions
278: %%
279: rules
280: %%
281: user code
282: .Ed
283: .Pp
1.1 deraadt 284: The
1.16 jmc 285: .Em definitions
1.1 deraadt 286: section contains declarations of simple
1.16 jmc 287: .Em name
1.1 deraadt 288: definitions to simplify the scanner specification, and declarations of
1.16 jmc 289: .Em start conditions ,
1.1 deraadt 290: which are explained in a later section.
1.16 jmc 291: .Pp
1.1 deraadt 292: Name definitions have the form:
1.16 jmc 293: .Pp
294: .D1 name definition
295: .Pp
296: The
297: .Qq name
298: is a word beginning with a letter or an underscore
299: .Pq Sq _
300: followed by zero or more letters, digits,
301: .Sq _ ,
302: or
303: .Sq -
304: .Pq dash .
1.8 aaron 305: The definition is taken to begin at the first non-whitespace character
1.1 deraadt 306: following the name and continuing to the end of the line.
1.16 jmc 307: The definition can subsequently be referred to using
308: .Qq {name} ,
309: which will expand to
310: .Qq (definition) .
311: For example:
312: .Bd -literal -offset indent
313: DIGIT [0-9]
314: ID [a-z][a-z0-9]*
315: .Ed
316: .Pp
317: This defines
318: .Qq DIGIT
319: to be a regular expression which matches a single digit, and
320: .Qq ID
321: to be a regular expression which matches a letter
1.1 deraadt 322: followed by zero-or-more letters-or-digits.
323: A subsequent reference to
1.16 jmc 324: .Pp
325: .Dl {DIGIT}+"."{DIGIT}*
326: .Pp
1.1 deraadt 327: is identical to
1.16 jmc 328: .Pp
329: .Dl ([0-9])+"."([0-9])*
330: .Pp
331: and matches one-or-more digits followed by a
332: .Sq .\&
333: followed by zero-or-more digits.
334: .Pp
1.1 deraadt 335: The
1.16 jmc 336: .Em rules
1.1 deraadt 337: section of the
1.16 jmc 338: .Nm
1.1 deraadt 339: input contains a series of rules of the form:
1.16 jmc 340: .Pp
341: .D1 pattern action
342: .Pp
343: The pattern must be unindented and the action must begin
1.1 deraadt 344: on the same line.
1.16 jmc 345: .Pp
1.1 deraadt 346: See below for a further description of patterns and actions.
1.16 jmc 347: .Pp
1.1 deraadt 348: Finally, the user code section is simply copied to
1.16 jmc 349: .Pa lex.yy.c
1.1 deraadt 350: verbatim.
1.16 jmc 351: It is used for companion routines which call or are called by the scanner.
352: The presence of this section is optional;
1.1 deraadt 353: if it is missing, the second
1.16 jmc 354: .Qq %%
355: in the input file may be skipped too.
356: .Pp
357: In the definitions and rules sections, any indented text or text enclosed in
358: .Sq %{
1.1 deraadt 359: and
1.16 jmc 360: .Sq %}
361: is copied verbatim to the output
362: .Pq with the %{}'s removed .
1.1 deraadt 363: The %{}'s must appear unindented on lines by themselves.
1.16 jmc 364: .Pp
1.1 deraadt 365: In the rules section,
1.16 jmc 366: any indented or %{} text appearing before the first rule may be used to
367: declare variables which are local to the scanning routine and
368: .Pq after the declarations
1.1 deraadt 369: code which is to be executed whenever the scanning routine is entered.
370: Other indented or %{} text in the rule section is still copied to the output,
371: but its meaning is not well-defined and it may well cause compile-time
372: errors (this feature is present for
1.16 jmc 373: .Tn POSIX
1.1 deraadt 374: compliance; see below for other such features).
1.16 jmc 375: .Pp
376: In the definitions section
377: .Pq but not in the rules section ,
378: an unindented comment
379: (i.e., a line beginning with
380: .Qq /* )
381: is also copied verbatim to the output up to the next
382: .Qq */ .
383: .Sh PATTERNS
1.1 deraadt 384: The patterns in the input are written using an extended set of regular
1.16 jmc 385: expressions.
386: These are:
387: .Bl -tag -width "XXXXXXXX"
388: .It x
389: Match the character
390: .Sq x .
391: .It .\&
392: Any character
393: .Pq byte
394: except newline.
395: .It [xyz]
396: A
397: .Qq character class ;
398: in this case, the pattern matches either an
399: .Sq x ,
400: a
401: .Sq y ,
402: or a
403: .Sq z .
404: .It [abj-oZ]
405: A
406: .Qq character class
407: with a range in it; matches an
408: .Sq a ,
409: a
410: .Sq b ,
411: any letter from
412: .Sq j
413: through
414: .Sq o ,
415: or a
416: .Sq Z .
417: .It [^A-Z]
418: A
419: .Qq negated character class ,
420: i.e., any character but those in the class.
421: In this case, any character EXCEPT an uppercase letter.
422: .It [^A-Z\en]
423: Any character EXCEPT an uppercase letter or a newline.
424: .It r*
425: Zero or more r's, where
426: .Sq r
427: is any regular expression.
428: .It r+
429: One or more r's.
430: .It r?
431: Zero or one r's (that is,
432: .Qq an optional r ) .
433: .It r{2,5}
434: Anywhere from two to five r's.
435: .It r{2,}
436: Two or more r's.
437: .It r{4}
438: Exactly 4 r's.
439: .It {name}
440: The expansion of the
441: .Qq name
442: definition
443: .Pq see above .
444: .It \&"[xyz]\e\&"foo\&"
445: The literal string: [xyz]"foo.
446: .It \eX
447: If
448: .Sq X
449: is an
450: .Sq a ,
451: .Sq b ,
452: .Sq f ,
453: .Sq n ,
454: .Sq r ,
455: .Sq t ,
456: or
457: .Sq v ,
458: then the ANSI-C interpretation of
459: .Sq \eX .
460: Otherwise, a literal
461: .Sq X
462: (used to escape operators such as
463: .Sq * ) .
464: .It \e0
465: A NUL character
466: .Pq ASCII code 0 .
467: .It \e123
468: The character with octal value 123.
469: .It \ex2a
470: The character with hexadecimal value 2a.
471: .It (r)
472: Match an
473: .Sq r ;
474: parentheses are used to override precedence
475: .Pq see below .
476: .It rs
477: The regular expression
478: .Sq r
479: followed by the regular expression
480: .Sq s ;
481: called
482: .Qq concatenation .
483: .It r|s
484: Either an
485: .Sq r
486: or an
487: .Sq s .
488: .It r/s
489: An
490: .Sq r ,
491: but only if it is followed by an
492: .Sq s .
493: The text matched by
494: .Sq s
495: is included when determining whether this rule is the
496: .Qq longest match ,
497: but is then returned to the input before the action is executed.
498: So the action only sees the text matched by
499: .Sq r .
500: This type of pattern is called
501: .Qq trailing context .
502: (There are some combinations of r/s that
503: .Nm
504: cannot match correctly; see notes in the
505: .Sx BUGS
506: section below regarding
507: .Qq dangerous trailing context . )
508: .It ^r
509: An
510: .Sq r ,
511: but only at the beginning of a line
512: (i.e., just starting to scan, or right after a newline has been scanned).
513: .It r$
514: An
515: .Sq r ,
516: but only at the end of a line
517: .Pq i.e., just before a newline .
518: Equivalent to
519: .Qq r/\en .
520: .Pp
521: Note that
522: .Nm flex Ns 's
523: notion of
524: .Qq newline
525: is exactly whatever the C compiler used to compile
526: .Nm
527: interprets
528: .Sq \en
529: as.
530: .\" In particular, on some DOS systems you must either filter out \er's in the
531: .\" input yourself, or explicitly use r/\er\en for
532: .\" .Qq r$ .
533: .It <s>r
534: An
535: .Sq r ,
536: but only in start condition
537: .Sq s
538: .Pq see below for discussion of start conditions .
539: .It <s1,s2,s3>r
540: The same, but in any of start conditions s1, s2, or s3.
541: .It <*>r
542: An
543: .Sq r
544: in any start condition, even an exclusive one.
545: .It <<EOF>>
546: An end-of-file.
547: .It <s1,s2><<EOF>>
548: An end-of-file when in start condition s1 or s2.
549: .El
550: .Pp
1.1 deraadt 551: Note that inside of a character class, all regular expression operators
1.16 jmc 552: lose their special meaning except escape
553: .Pq Sq \e
554: and the character class operators,
555: .Sq - ,
556: .Sq ]\& ,
557: and, at the beginning of the class,
558: .Sq ^ .
559: .Pp
1.1 deraadt 560: The regular expressions listed above are grouped according to
561: precedence, from highest precedence at the top to lowest at the bottom.
1.16 jmc 562: Those grouped together have equal precedence.
563: For example,
564: .Pp
565: .D1 foo|bar*
566: .Pp
1.1 deraadt 567: is the same as
1.16 jmc 568: .Pp
569: .D1 (foo)|(ba(r*))
570: .Pp
571: since the
572: .Sq *
573: operator has higher precedence than concatenation,
574: and concatenation higher than alternation
575: .Pq Sq |\& .
576: This pattern therefore matches
577: .Em either
578: the string
579: .Qq foo
580: .Em or
581: the string
582: .Qq ba
583: followed by zero-or-more r's.
584: To match
585: .Qq foo
586: or zero-or-more "bar"'s,
587: use:
588: .Pp
589: .D1 foo|(bar)*
590: .Pp
1.1 deraadt 591: and to match zero-or-more "foo"'s-or-"bar"'s:
1.16 jmc 592: .Pp
593: .D1 (foo|bar)*
594: .Pp
1.1 deraadt 595: In addition to characters and ranges of characters, character classes
596: can also contain character class
1.16 jmc 597: .Em expressions .
1.1 deraadt 598: These are expressions enclosed inside
1.16 jmc 599: .Sq [:
600: and
601: .Sq :]
602: delimiters (which themselves must appear between the
603: .Sq [
1.1 deraadt 604: and
1.16 jmc 605: .Sq ]\&
606: of the
1.1 deraadt 607: character class; other elements may occur inside the character class, too).
608: The valid expressions are:
1.16 jmc 609: .Bd -unfilled -offset indent
610: [:alnum:] [:alpha:] [:blank:]
611: [:cntrl:] [:digit:] [:graph:]
612: [:lower:] [:print:] [:punct:]
613: [:space:] [:upper:] [:xdigit:]
614: .Ed
615: .Pp
1.1 deraadt 616: These expressions all designate a set of characters equivalent to
617: the corresponding standard C
1.16 jmc 618: .Fn isXXX
619: function.
620: For example, [:alnum:] designates those characters for which
621: .Xr isalnum 3
622: returns true \- i.e., any alphabetic or numeric.
1.1 deraadt 623: Some systems don't provide
1.16 jmc 624: .Xr isblank 3 ,
625: so
626: .Nm
627: defines [:blank:] as a blank or a tab.
628: .Pp
1.1 deraadt 629: For example, the following character classes are all equivalent:
1.16 jmc 630: .Bd -unfilled -offset indent
631: [[:alnum:]]
632: [[:alpha:][:digit:]]
633: [[:alpha:]0-9]
634: [a-zA-Z0-9]
635: .Ed
636: .Pp
637: If the scanner is case-insensitive (the
638: .Fl i
639: flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
640: .Pp
1.1 deraadt 641: Some notes on patterns:
1.16 jmc 642: .Bl -dash
643: .It
644: A negated character class such as the example
645: .Qq [^A-Z]
646: above will match a newline unless "\en"
647: .Pq or an equivalent escape sequence
648: is one of the characters explicitly present in the negated character class
649: (e.g.,
650: .Qq [^A-Z\en] ) .
651: This is unlike how many other regular expression tools treat negated character
652: classes, but unfortunately the inconsistency is historically entrenched.
653: Matching newlines means that a pattern like
654: .Qq [^"]*
655: can match the entire input unless there's another quote in the input.
656: .It
657: A rule can have at most one instance of trailing context
658: (the
659: .Sq /
660: operator or the
661: .Sq $
662: operator).
663: The start condition,
664: .Sq ^ ,
665: and
666: .Qq <<EOF>>
667: patterns can only occur at the beginning of a pattern, and, as well as with
668: .Sq /
669: and
670: .Sq $ ,
671: cannot be grouped inside parentheses.
672: A
673: .Sq ^
674: which does not occur at the beginning of a rule or a
675: .Sq $
676: which does not occur at the end of a rule loses its special properties
677: and is treated as a normal character.
678: .It
1.1 deraadt 679: The following are illegal:
1.16 jmc 680: .Bd -unfilled -offset indent
681: foo/bar$
682: <sc1>foo<sc2>bar
683: .Ed
684: .Pp
685: Note that the first of these, can be written
686: .Qq foo/bar\en .
687: .It
688: The following will result in
689: .Sq $
690: or
691: .Sq ^
692: being treated as a normal character:
693: .Bd -unfilled -offset indent
694: foo|(bar$)
695: foo|^bar
696: .Ed
697: .Pp
698: If what's wanted is a
699: .Qq foo
700: or a bar-followed-by-a-newline, the following could be used
701: (the special
702: .Sq |\&
703: action is explained below):
704: .Bd -unfilled -offset indent
705: foo |
706: bar$ /* action goes here */
707: .Ed
708: .Pp
1.1 deraadt 709: A similar trick will work for matching a foo or a
710: bar-at-the-beginning-of-a-line.
1.16 jmc 711: .El
712: .Sh HOW THE INPUT IS MATCHED
713: When the generated scanner is run,
714: it analyzes its input looking for strings which match any of its patterns.
715: If it finds more than one match,
716: it takes the one matching the most text
717: (for trailing context rules, this includes the length of the trailing part,
718: even though it will then be returned to the input).
719: If it finds two or more matches of the same length,
720: the rule listed first in the
721: .Nm
1.1 deraadt 722: input file is chosen.
1.16 jmc 723: .Pp
1.1 deraadt 724: Once the match is determined, the text corresponding to the match
725: (called the
1.16 jmc 726: .Em token )
1.1 deraadt 727: is made available in the global character pointer
1.16 jmc 728: .Fa yytext ,
1.1 deraadt 729: and its length in the global integer
1.16 jmc 730: .Fa yyleng .
1.1 deraadt 731: The
1.16 jmc 732: .Em action
733: corresponding to the matched pattern is then executed
734: .Pq a more detailed description of actions follows ,
735: and then the remaining input is scanned for another match.
736: .Pp
737: If no match is found, then the default rule is executed:
738: the next character in the input is considered matched and
739: copied to the standard output.
740: Thus, the simplest legal
741: .Nm
1.1 deraadt 742: input is:
1.16 jmc 743: .Pp
744: .D1 %%
745: .Pp
746: which generates a scanner that simply copies its input
747: .Pq one character at a time
748: to its output.
749: .Pp
1.1 deraadt 750: Note that
1.16 jmc 751: .Fa yytext
752: can be defined in two different ways:
753: either as a character pointer or as a character array.
754: Which definition
755: .Nm
756: uses can be controlled by including one of the special directives
757: .Dq %pointer
758: or
759: .Dq %array
760: in the first
761: .Pq definitions
762: section of flex input.
763: The default is
764: .Dq %pointer ,
765: unless the
766: .Fl l
767: lex compatibility option is used, in which case
768: .Fa yytext
1.1 deraadt 769: will be an array.
770: The advantage of using
1.16 jmc 771: .Dq %pointer
1.1 deraadt 772: is substantially faster scanning and no buffer overflow when matching
1.16 jmc 773: very large tokens
774: .Pq unless not enough dynamic memory is available .
775: The disadvantage is that actions are restricted in how they can modify
776: .Fa yytext
777: .Pq see the next section ,
778: and calls to the
779: .Fn unput
1.10 deraadt 780: function destroy the present contents of
1.16 jmc 781: .Fa yytext ,
1.1 deraadt 782: which can be a considerable porting headache when moving between different
1.16 jmc 783: .Nm lex
1.1 deraadt 784: versions.
1.16 jmc 785: .Pp
1.1 deraadt 786: The advantage of
1.16 jmc 787: .Dq %array
788: is that
789: .Fa yytext
790: can be modified as much as wanted, and calls to
791: .Fn unput
1.1 deraadt 792: do not destroy
1.16 jmc 793: .Fa yytext
794: .Pq see below .
795: Furthermore, existing
796: .Nm lex
1.1 deraadt 797: programs sometimes access
1.16 jmc 798: .Fa yytext
1.1 deraadt 799: externally using declarations of the form:
1.16 jmc 800: .Pp
801: .D1 extern char yytext[];
802: .Pp
1.1 deraadt 803: This definition is erroneous when used with
1.16 jmc 804: .Dq %pointer ,
1.1 deraadt 805: but correct for
1.16 jmc 806: .Dq %array .
807: .Pp
808: .Dq %array
1.1 deraadt 809: defines
1.16 jmc 810: .Fa yytext
1.1 deraadt 811: to be an array of
1.16 jmc 812: .Dv YYLMAX
813: characters, which defaults to a fairly large value.
814: The size can be changed by simply #define'ing
815: .Dv YYLMAX
816: to a different value in the first section of
817: .Nm
818: input.
819: As mentioned above, with
820: .Dq %pointer
821: yytext grows dynamically to accommodate large tokens.
822: While this means a
823: .Dq %pointer
824: scanner can accommodate very large tokens
825: .Pq such as matching entire blocks of comments ,
826: bear in mind that each time the scanner must resize
827: .Fa yytext
1.1 deraadt 828: it also must rescan the entire token from the beginning, so matching such
829: tokens can prove slow.
1.16 jmc 830: .Fa yytext
831: presently does not dynamically grow if a call to
832: .Fn unput
1.1 deraadt 833: results in too much text being pushed back; instead, a run-time error results.
1.16 jmc 834: .Pp
835: Also note that
836: .Dq %array
837: cannot be used with C++ scanner classes
838: .Pq the c++ option; see below .
839: .Sh ACTIONS
840: Each pattern in a rule has a corresponding action,
841: which can be any arbitrary C statement.
842: The pattern ends at the first non-escaped whitespace character;
843: the remainder of the line is its action.
844: If the action is empty,
845: then when the pattern is matched the input token is simply discarded.
846: For example, here is the specification for a program
847: which deletes all occurrences of
848: .Qq zap me
849: from its input:
850: .Bd -literal -offset indent
851: %%
852: "zap me"
853: .Ed
854: .Pp
1.1 deraadt 855: (It will copy all other characters in the input to the output since
856: they will be matched by the default rule.)
1.16 jmc 857: .Pp
1.1 deraadt 858: Here is a program which compresses multiple blanks and tabs down to
859: a single blank, and throws away whitespace found at the end of a line:
1.16 jmc 860: .Bd -literal -offset indent
861: %%
862: [ \et]+ putchar(' ');
863: [ \et]+$ /* ignore this token */
864: .Ed
865: .Pp
866: If the action contains a
867: .Sq { ,
868: then the action spans till the balancing
869: .Sq }
1.1 deraadt 870: is found, and the action may cross multiple lines.
1.16 jmc 871: .Nm
1.1 deraadt 872: knows about C strings and comments and won't be fooled by braces found
873: within them, but also allows actions to begin with
1.16 jmc 874: .Sq %{
1.1 deraadt 875: and will consider the action to be all the text up to the next
1.16 jmc 876: .Sq %}
877: .Pq regardless of ordinary braces inside the action .
878: .Pp
879: An action consisting solely of a vertical bar
880: .Pq Sq |\&
881: means
882: .Qq same as the action for the next rule .
883: See below for an illustration.
884: .Pp
885: Actions can include arbitrary C code,
886: including return statements to return a value to whatever routine called
887: .Fn yylex .
1.1 deraadt 888: Each time
1.16 jmc 889: .Fn yylex
890: is called, it continues processing tokens from where it last left off
891: until it either reaches the end of the file or executes a return.
892: .Pp
1.1 deraadt 893: Actions are free to modify
1.16 jmc 894: .Fa yytext
895: except for lengthening it
896: (adding characters to its end \- these will overwrite later characters in the
897: input stream).
898: This, however, does not apply when using
899: .Dq %array
900: .Pq see above ;
901: in that case,
902: .Fa yytext
1.1 deraadt 903: may be freely modified in any way.
1.16 jmc 904: .Pp
1.1 deraadt 905: Actions are free to modify
1.16 jmc 906: .Fa yyleng
1.1 deraadt 907: except they should not do so if the action also includes use of
1.16 jmc 908: .Fn yymore
909: .Pq see below .
910: .Pp
1.1 deraadt 911: There are a number of special directives which can be included within
912: an action:
1.16 jmc 913: .Bl -tag -width Ds
914: .It ECHO
915: Copies
916: .Fa yytext
917: to the scanner's output.
918: .It BEGIN
919: Followed by the name of a start condition, places the scanner in the
920: corresponding start condition
921: .Pq see below .
922: .It REJECT
923: Directs the scanner to proceed on to the
924: .Qq second best
925: rule which matched the input
926: .Pq or a prefix of the input .
927: The rule is chosen as described above in
928: .Sx HOW THE INPUT IS MATCHED ,
929: and
930: .Fa yytext
1.1 deraadt 931: and
1.16 jmc 932: .Fa yyleng
1.1 deraadt 933: set up appropriately.
934: It may either be one which matched as much text
935: as the originally chosen rule but came later in the
1.16 jmc 936: .Nm
1.1 deraadt 937: input file, or one which matched less text.
938: For example, the following will both count the
1.16 jmc 939: words in the input and call the routine
940: .Fn special
941: whenever
942: .Qq frob
943: is seen:
944: .Bd -literal -offset indent
945: int word_count = 0;
946: %%
947:
948: frob special(); REJECT;
949: [^ \et\en]+ ++word_count;
950: .Ed
951: .Pp
1.1 deraadt 952: Without the
1.16 jmc 953: .Em REJECT ,
954: any "frob"'s in the input would not be counted as words,
955: since the scanner normally executes only one action per token.
1.1 deraadt 956: Multiple
1.16 jmc 957: .Em REJECT Ns 's
958: are allowed,
959: each one finding the next best choice to the currently active rule.
960: For example, when the following scanner scans the token
961: .Qq abcd ,
962: it will write
963: .Qq abcdabcaba
964: to the output:
965: .Bd -literal -offset indent
966: %%
967: a |
968: ab |
969: abc |
970: abcd ECHO; REJECT;
971: \&.|\en /* eat up any unmatched character */
972: .Ed
973: .Pp
1.1 deraadt 974: (The first three rules share the fourth's action since they use
1.16 jmc 975: the special
976: .Sq |\&
977: action.)
978: .Em REJECT
1.1 deraadt 979: is a particularly expensive feature in terms of scanner performance;
1.16 jmc 980: if it is used in any of the scanner's actions it will slow down
981: all of the scanner's matching.
982: Furthermore,
983: .Em REJECT
1.1 deraadt 984: cannot be used with the
1.16 jmc 985: .Fl Cf
1.1 deraadt 986: or
1.16 jmc 987: .Fl CF
988: options
989: .Pq see below .
990: .Pp
1.1 deraadt 991: Note also that unlike the other special actions,
1.16 jmc 992: .Em REJECT
1.1 deraadt 993: is a
1.16 jmc 994: .Em branch ;
995: code immediately following it in the action will not be executed.
996: .It yymore()
997: Tells the scanner that the next time it matches a rule, the corresponding
998: token should be appended onto the current value of
999: .Fa yytext
1000: rather than replacing it.
1001: For example, given the input
1002: .Qq mega-kludge
1003: the following will write
1004: .Qq mega-mega-kludge
1005: to the output:
1006: .Bd -literal -offset indent
1007: %%
1008: mega- ECHO; yymore();
1009: kludge ECHO;
1010: .Ed
1011: .Pp
1012: First
1013: .Qq mega-
1014: is matched and echoed to the output.
1015: Then
1016: .Qq kludge
1017: is matched, but the previous
1018: .Qq mega-
1019: is still hanging around at the beginning of
1020: .Fa yytext
1.1 deraadt 1021: so the
1.16 jmc 1022: .Em ECHO
1023: for the
1024: .Qq kludge
1025: rule will actually write
1026: .Qq mega-kludge .
1027: .Pp
1.1 deraadt 1028: Two notes regarding use of
1.16 jmc 1029: .Fn yymore :
1.1 deraadt 1030: First,
1.16 jmc 1031: .Fn yymore
1.1 deraadt 1032: depends on the value of
1.16 jmc 1033: .Fa yyleng
1034: correctly reflecting the size of the current token, so
1035: .Fa yyleng
1036: must not be modified when using
1037: .Fn yymore .
1.1 deraadt 1038: Second, the presence of
1.16 jmc 1039: .Fn yymore
1.1 deraadt 1040: in the scanner's action entails a minor performance penalty in the
1041: scanner's matching speed.
1.16 jmc 1042: .It yyless(n)
1043: Returns all but the first
1044: .Ar n
1.1 deraadt 1045: characters of the current token back to the input stream, where they
1046: will be rescanned when the scanner looks for the next match.
1.16 jmc 1047: .Fa yytext
1.1 deraadt 1048: and
1.16 jmc 1049: .Fa yyleng
1.1 deraadt 1050: are adjusted appropriately (e.g.,
1.16 jmc 1051: .Fa yyleng
1.1 deraadt 1052: will now be equal to
1.16 jmc 1053: .Ar n ) .
1054: For example, on the input
1055: .Qq foobar
1056: the following will write out
1057: .Qq foobarbar :
1058: .Bd -literal -offset indent
1059: %%
1060: foobar ECHO; yyless(3);
1061: [a-z]+ ECHO;
1062: .Ed
1063: .Pp
1.1 deraadt 1064: An argument of 0 to
1.16 jmc 1065: .Fa yyless
1066: will cause the entire current input string to be scanned again.
1067: Unless how the scanner will subsequently process its input has been changed
1068: (using
1069: .Em BEGIN ,
1070: for example),
1071: this will result in an endless loop.
1072: .Pp
1.1 deraadt 1073: Note that
1.16 jmc 1074: .Fa yyless
1075: is a macro and can only be used in the
1076: .Nm
1077: input file, not from other source files.
1078: .It unput(c)
1079: Puts the character
1080: .Ar c
1081: back into the input stream.
1082: It will be the next character scanned.
1.1 deraadt 1083: The following action will take the current token and cause it
1084: to be rescanned enclosed in parentheses.
1.16 jmc 1085: .Bd -literal -offset indent
1086: {
1087: int i;
1088: char *yycopy;
1089:
1090: /* Copy yytext because unput() trashes yytext */
1091: if ((yycopy = strdup(yytext)) == NULL)
1092: err(1, NULL);
1093: unput(')');
1094: for (i = yyleng - 1; i >= 0; --i)
1095: unput(yycopy[i]);
1096: unput('(');
1097: free(yycopy);
1098: }
1099: .Ed
1100: .Pp
1.1 deraadt 1101: Note that since each
1.16 jmc 1102: .Fn unput
1103: puts the given character back at the beginning of the input stream,
1104: pushing back strings must be done back-to-front.
1105: .Pp
1.1 deraadt 1106: An important potential problem when using
1.16 jmc 1107: .Fn unput
1108: is that if using
1109: .Dq %pointer
1110: .Pq the default ,
1111: a call to
1112: .Fn unput
1113: destroys the contents of
1114: .Fa yytext ,
1.1 deraadt 1115: starting with its rightmost character and devouring one character to
1.16 jmc 1116: the left with each call.
1117: If the value of
1118: .Fa yytext
1119: should be preserved after a call to
1120: .Fn unput
1121: .Pq as in the above example ,
1122: it must either first be copied elsewhere, or the scanner must be built using
1123: .Dq %array
1124: instead (see
1125: .Sx HOW THE INPUT IS MATCHED ) .
1126: .Pp
1127: Finally, note that EOF cannot be put back
1.1 deraadt 1128: to attempt to mark the input stream with an end-of-file.
1.16 jmc 1129: .It input()
1130: Reads the next character from the input stream.
1131: For example, the following is one way to eat up C comments:
1132: .Bd -literal -offset indent
1133: %%
1134: "/*" {
1135: int c;
1136:
1137: for (;;) {
1138: while ((c = input()) != '*' && c != EOF)
1139: ; /* eat up text of comment */
1140:
1141: if (c == '*') {
1142: while ((c = input()) == '*')
1143: ;
1144: if (c == '/')
1145: break; /* found the end */
1146: }
1147:
1148: if (c == EOF) {
1149: errx(1, "EOF in comment");
1.1 deraadt 1150: break;
1151: }
1.16 jmc 1152: }
1153: }
1154: .Ed
1155: .Pp
1156: (Note that if the scanner is compiled using C++, then
1157: .Fn input
1.1 deraadt 1158: is instead referred to as
1.16 jmc 1159: .Fn yyinput ,
1160: in order to avoid a name clash with the C++ stream by the name of input.)
1161: .It YY_FLUSH_BUFFER
1162: Flushes the scanner's internal buffer
1163: so that the next time the scanner attempts to match a token,
1164: it will first refill the buffer using
1165: .Dv YY_INPUT
1166: (see
1167: .Sx THE GENERATED SCANNER ,
1168: below).
1169: This action is a special case of the more general
1170: .Fn yy_flush_buffer
1171: function, described below in the section
1172: .Sx MULTIPLE INPUT BUFFERS .
1173: .It yyterminate()
1174: Can be used in lieu of a return statement in an action.
1175: It terminates the scanner and returns a 0 to the scanner's caller, indicating
1176: .Qq all done .
1.1 deraadt 1177: By default,
1.16 jmc 1178: .Fn yyterminate
1179: is also called when an end-of-file is encountered.
1180: It is a macro and may be redefined.
1181: .El
1182: .Sh THE GENERATED SCANNER
1.1 deraadt 1183: The output of
1.16 jmc 1184: .Nm
1.1 deraadt 1185: is the file
1.16 jmc 1186: .Pa lex.yy.c ,
1.1 deraadt 1187: which contains the scanning routine
1.16 jmc 1188: .Fn yylex ,
1189: a number of tables used by it for matching tokens,
1190: and a number of auxiliary routines and macros.
1191: By default,
1192: .Fn yylex
1.1 deraadt 1193: is declared as follows:
1.16 jmc 1194: .Bd -unfilled -offset indent
1195: int yylex()
1196: {
1197: ... various definitions and the actions in here ...
1198: }
1199: .Ed
1200: .Pp
1201: (If the environment supports function prototypes, then it will
1202: be "int yylex(void)".)
1203: This definition may be changed by defining the
1204: .Dv YY_DECL
1205: macro.
1206: For example:
1207: .Bd -literal -offset indent
1208: #define YY_DECL float lexscan(a, b) float a, b;
1209: .Ed
1210: .Pp
1211: would give the scanning routine the name
1212: .Em lexscan ,
1213: returning a float, and taking two floats as arguments.
1214: Note that if arguments are given to the scanning routine using a
1215: K&R-style/non-prototyped function declaration,
1216: the definition must be terminated with a semi-colon
1217: .Pq Sq ;\& .
1218: .Pp
1.1 deraadt 1219: Whenever
1.16 jmc 1220: .Fn yylex
1.1 deraadt 1221: is called, it scans tokens from the global input file
1.16 jmc 1222: .Pa yyin
1223: .Pq which defaults to stdin .
1224: It continues until it either reaches an end-of-file
1225: .Pq at which point it returns the value 0
1226: or one of its actions executes a
1227: .Em return
1.1 deraadt 1228: statement.
1.16 jmc 1229: .Pp
1.1 deraadt 1230: If the scanner reaches an end-of-file, subsequent calls are undefined
1231: unless either
1.16 jmc 1232: .Em yyin
1233: is pointed at a new input file
1234: .Pq in which case scanning continues from that file ,
1235: or
1236: .Fn yyrestart
1.1 deraadt 1237: is called.
1.16 jmc 1238: .Fn yyrestart
1.1 deraadt 1239: takes one argument, a
1.16 jmc 1240: .Fa FILE *
1241: pointer (which can be nil, if
1242: .Dv YY_INPUT
1243: has been set up to scan from a source other than
1244: .Em yyin ) ,
1.1 deraadt 1245: and initializes
1.16 jmc 1246: .Em yyin
1247: for scanning from that file.
1248: Essentially there is no difference between just assigning
1249: .Em yyin
1.1 deraadt 1250: to a new input file or using
1.16 jmc 1251: .Fn yyrestart
1252: to do so; the latter is available for compatibility with previous versions of
1253: .Nm ,
1.1 deraadt 1254: and because it can be used to switch input files in the middle of scanning.
1.16 jmc 1255: It can also be used to throw away the current input buffer,
1256: by calling it with an argument of
1257: .Em yyin ;
1.1 deraadt 1258: but better is to use
1.16 jmc 1259: .Dv YY_FLUSH_BUFFER
1260: .Pq see above .
1.1 deraadt 1261: Note that
1.16 jmc 1262: .Fn yyrestart
1263: does not reset the start condition to
1264: .Em INITIAL
1265: (see
1266: .Sx START CONDITIONS ,
1267: below).
1268: .Pp
1.1 deraadt 1269: If
1.16 jmc 1270: .Fn yylex
1.1 deraadt 1271: stops scanning due to executing a
1.16 jmc 1272: .Em return
1.1 deraadt 1273: statement in one of the actions, the scanner may then be called again and it
1274: will resume scanning where it left off.
1.16 jmc 1275: .Pp
1276: By default
1277: .Pq and for purposes of efficiency ,
1278: the scanner uses block-reads rather than simple
1279: .Xr getc 3
1.1 deraadt 1280: calls to read characters from
1.16 jmc 1281: .Em yyin .
1.1 deraadt 1282: The nature of how it gets its input can be controlled by defining the
1.16 jmc 1283: .Dv YY_INPUT
1.1 deraadt 1284: macro.
1.16 jmc 1285: .Dv YY_INPUT Ns 's
1286: calling sequence is
1287: .Qq YY_INPUT(buf,result,max_size) .
1288: Its action is to place up to
1289: .Dv max_size
1.1 deraadt 1290: characters in the character array
1.16 jmc 1291: .Em buf
1.1 deraadt 1292: and return in the integer variable
1.16 jmc 1293: .Em result
1294: either the number of characters read or the constant
1295: .Dv YY_NULL
1296: (0 on
1297: .Ux
1298: systems)
1299: to indicate
1300: .Dv EOF .
1301: The default
1302: .Dv YY_INPUT
1303: reads from the global file-pointer
1304: .Qq yyin .
1305: .Pp
1306: A sample definition of
1307: .Dv YY_INPUT
1308: .Pq in the definitions section of the input file :
1309: .Bd -unfilled -offset indent
1310: %{
1311: #define YY_INPUT(buf,result,max_size) \e
1312: { \e
1313: int c = getchar(); \e
1314: result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1315: }
1316: %}
1317: .Ed
1318: .Pp
1.1 deraadt 1319: This definition will change the input processing to occur
1320: one character at a time.
1.16 jmc 1321: .Pp
1322: When the scanner receives an end-of-file indication from
1323: .Dv YY_INPUT ,
1.1 deraadt 1324: it then checks the
1.16 jmc 1325: .Fn yywrap
1326: function.
1327: If
1328: .Fn yywrap
1329: returns false
1330: .Pq zero ,
1331: then it is assumed that the function has gone ahead and set up
1332: .Em yyin
1333: to point to another input file, and scanning continues.
1334: If it returns true
1335: .Pq non-zero ,
1336: then the scanner terminates, returning 0 to its caller.
1337: Note that in either case, the start condition remains unchanged;
1338: it does not revert to
1339: .Em INITIAL .
1340: .Pp
1.1 deraadt 1341: If you do not supply your own version of
1.16 jmc 1342: .Fn yywrap ,
1.1 deraadt 1343: then you must either use
1.16 jmc 1344: .Dq %option noyywrap
1.1 deraadt 1345: (in which case the scanner behaves as though
1.16 jmc 1346: .Fn yywrap
1.1 deraadt 1347: returned 1), or you must link with
1.16 jmc 1348: .Fl lfl
1.1 deraadt 1349: to obtain the default version of the routine, which always returns 1.
1.16 jmc 1350: .Pp
1.1 deraadt 1351: Three routines are available for scanning from in-memory buffers rather
1352: than files:
1.16 jmc 1353: .Fn yy_scan_string ,
1354: .Fn yy_scan_bytes ,
1.1 deraadt 1355: and
1.16 jmc 1356: .Fn yy_scan_buffer .
1357: See the discussion of them below in the section
1358: .Sx MULTIPLE INPUT BUFFERS .
1359: .Pp
1.1 deraadt 1360: The scanner writes its
1.16 jmc 1361: .Em ECHO
1.1 deraadt 1362: output to the
1.16 jmc 1363: .Em yyout
1364: global
1365: .Pq default, stdout ,
1366: which may be redefined by the user simply by assigning it to some other
1367: .Va FILE
1.1 deraadt 1368: pointer.
1.16 jmc 1369: .Sh START CONDITIONS
1370: .Nm
1371: provides a mechanism for conditionally activating rules.
1372: Any rule whose pattern is prefixed with
1373: .Qq Aq sc
1374: will only be active when the scanner is in the start condition named
1375: .Qq sc .
1376: For example,
1377: .Bd -literal -offset indent
1378: <STRING>[^"]* { /* eat up the string body ... */
1379: ...
1380: }
1381: .Ed
1382: .Pp
1383: will be active only when the scanner is in the
1384: .Qq STRING
1385: start condition, and
1386: .Bd -literal -offset indent
1387: <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1388: ...
1389: }
1390: .Ed
1391: .Pp
1392: will be active only when the current start condition is either
1393: .Qq INITIAL ,
1394: .Qq STRING ,
1395: or
1396: .Qq QUOTE .
1397: .Pp
1398: Start conditions are declared in the definitions
1399: .Pq first
1400: section of the input using unindented lines beginning with either
1401: .Sq %s
1.1 deraadt 1402: or
1.16 jmc 1403: .Sq %x
1.1 deraadt 1404: followed by a list of names.
1405: The former declares
1.16 jmc 1406: .Em inclusive
1.1 deraadt 1407: start conditions, the latter
1.16 jmc 1408: .Em exclusive
1409: start conditions.
1410: A start condition is activated using the
1411: .Em BEGIN
1412: action.
1413: Until the next
1414: .Em BEGIN
1415: action is executed, rules with the given start condition will be active and
1.1 deraadt 1416: rules with other start conditions will be inactive.
1.16 jmc 1417: If the start condition is inclusive,
1.1 deraadt 1418: then rules with no start conditions at all will also be active.
1.16 jmc 1419: If it is exclusive,
1420: then only rules qualified with the start condition will be active.
1.1 deraadt 1421: A set of rules contingent on the same exclusive start condition
1422: describe a scanner which is independent of any of the other rules in the
1.16 jmc 1423: .Nm
1424: input.
1425: Because of this, exclusive start conditions make it easy to specify
1426: .Qq mini-scanners
1.1 deraadt 1427: which scan portions of the input that are syntactically different
1.16 jmc 1428: from the rest
1429: .Pq e.g., comments .
1430: .Pp
1.1 deraadt 1431: If the distinction between inclusive and exclusive start conditions
1432: is still a little vague, here's a simple example illustrating the
1.16 jmc 1433: connection between the two.
1434: The set of rules:
1435: .Bd -literal -offset indent
1436: %s example
1437: %%
1438:
1439: <example>foo do_something();
1440:
1441: bar something_else();
1442: .Ed
1443: .Pp
1.1 deraadt 1444: is equivalent to
1.16 jmc 1445: .Bd -literal -offset indent
1446: %x example
1447: %%
1448:
1449: <example>foo do_something();
1450:
1451: <INITIAL,example>bar something_else();
1452: .Ed
1453: .Pp
1.1 deraadt 1454: Without the
1.16 jmc 1455: .Aq INITIAL,example
1.1 deraadt 1456: qualifier, the
1.16 jmc 1457: .Dq bar
1458: pattern in the second example wouldn't be active
1459: .Pq i.e., couldn't match
1.1 deraadt 1460: when in start condition
1.16 jmc 1461: .Dq example .
1.1 deraadt 1462: If we just used
1.16 jmc 1463: .Aq example
1.1 deraadt 1464: to qualify
1.16 jmc 1465: .Dq bar ,
1.1 deraadt 1466: though, then it would only be active in
1.16 jmc 1467: .Dq example
1.1 deraadt 1468: and not in
1.16 jmc 1469: .Em INITIAL ,
1470: while in the first example it's active in both,
1471: because in the first example the
1472: .Dq example
1473: start condition is an inclusive
1474: .Pq Sq %s
1.1 deraadt 1475: start condition.
1.16 jmc 1476: .Pp
1.1 deraadt 1477: Also note that the special start-condition specifier
1.16 jmc 1478: .Sq Aq *
1479: matches every start condition.
1480: Thus, the above example could also have been written:
1481: .Bd -literal -offset indent
1482: %x example
1483: %%
1484:
1485: <example>foo do_something();
1486:
1487: <*>bar something_else();
1488: .Ed
1489: .Pp
1.1 deraadt 1490: The default rule (to
1.16 jmc 1491: .Em ECHO
1492: any unmatched character) remains active in start conditions.
1493: It is equivalent to:
1494: .Bd -literal -offset indent
1495: <*>.|\en ECHO;
1496: .Ed
1497: .Pp
1498: .Dq BEGIN(0)
1.1 deraadt 1499: returns to the original state where only the rules with
1.16 jmc 1500: no start conditions are active.
1501: This state can also be referred to as the start-condition
1502: .Em INITIAL ,
1503: so
1504: .Dq BEGIN(INITIAL)
1.1 deraadt 1505: is equivalent to
1.16 jmc 1506: .Dq BEGIN(0) .
1.1 deraadt 1507: (The parentheses around the start condition name are not required but
1508: are considered good style.)
1.16 jmc 1509: .Pp
1510: .Em BEGIN
1.1 deraadt 1511: actions can also be given as indented code at the beginning
1.16 jmc 1512: of the rules section.
1513: For example, the following will cause the scanner to enter the
1514: .Qq SPECIAL
1515: start condition whenever
1516: .Fn yylex
1.1 deraadt 1517: is called and the global variable
1.16 jmc 1518: .Fa enter_special
1.1 deraadt 1519: is true:
1.16 jmc 1520: .Bd -literal -offset indent
1521: int enter_special;
1.1 deraadt 1522:
1.16 jmc 1523: %x SPECIAL
1524: %%
1525: if (enter_special)
1.1 deraadt 1526: BEGIN(SPECIAL);
1527:
1.16 jmc 1528: <SPECIAL>blahblahblah
1529: \&...more rules follow...
1530: .Ed
1531: .Pp
1.1 deraadt 1532: To illustrate the uses of start conditions,
1533: here is a scanner which provides two different interpretations
1.16 jmc 1534: of a string like
1535: .Qq 123.456 .
1536: By default it will treat it as three tokens: the integer
1537: .Qq 123 ,
1538: a dot
1539: .Pq Sq .\& ,
1540: and the integer
1541: .Qq 456 .
1.1 deraadt 1542: But if the string is preceded earlier in the line by the string
1.16 jmc 1543: .Qq expect-floats
1544: it will treat it as a single token, the floating-point number 123.456:
1545: .Bd -literal -offset indent
1546: %{
1547: #include <math.h>
1548: %}
1549: %s expect
1550:
1551: %%
1552: expect-floats BEGIN(expect);
1553:
1554: <expect>[0-9]+"."[0-9]+ {
1555: printf("found a float, = %f\en",
1556: atof(yytext));
1557: }
1558: <expect>\en {
1559: /*
1560: * That's the end of the line, so
1561: * we need another "expect-number"
1562: * before we'll recognize any more
1563: * numbers.
1564: */
1565: BEGIN(INITIAL);
1566: }
1567:
1568: [0-9]+ {
1569: printf("found an integer, = %d\en",
1570: atoi(yytext));
1571: }
1572:
1573: "." printf("found a dot\en");
1574: .Ed
1575: .Pp
1576: Here is a scanner which recognizes
1577: .Pq and discards
1578: C comments while maintaining a count of the current input line:
1579: .Bd -literal -offset indent
1580: %x comment
1581: %%
1582: int line_num = 1;
1583:
1584: "/*" BEGIN(comment);
1585:
1586: <comment>[^*\en]* /* eat anything that's not a '*' */
1587: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1588: <comment>\en ++line_num;
1589: <comment>"*"+"/" BEGIN(INITIAL);
1590: .Ed
1591: .Pp
1.1 deraadt 1592: This scanner goes to a bit of trouble to match as much
1.16 jmc 1593: text as possible with each rule.
1594: In general, when attempting to write a high-speed scanner
1595: try to match as much as possible in each rule, as it's a big win.
1596: .Pp
1.10 deraadt 1597: Note that start-condition names are really integer values and
1.16 jmc 1598: can be stored as such.
1599: Thus, the above could be extended in the following fashion:
1600: .Bd -literal -offset indent
1601: %x comment foo
1602: %%
1603: int line_num = 1;
1604: int comment_caller;
1605:
1606: "/*" {
1607: comment_caller = INITIAL;
1608: BEGIN(comment);
1609: }
1610:
1611: \&...
1612:
1613: <foo>"/*" {
1614: comment_caller = foo;
1615: BEGIN(comment);
1616: }
1617:
1618: <comment>[^*\en]* /* eat anything that's not a '*' */
1619: <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1620: <comment>\en ++line_num;
1621: <comment>"*"+"/" BEGIN(comment_caller);
1622: .Ed
1623: .Pp
1624: Furthermore, the current start condition can be accessed by using
1.1 deraadt 1625: the integer-valued
1.16 jmc 1626: .Dv YY_START
1627: macro.
1628: For example, the above assignments to
1629: .Em comment_caller
1.1 deraadt 1630: could instead be written
1.16 jmc 1631: .Pp
1632: .Dl comment_caller = YY_START;
1633: .Pp
1.1 deraadt 1634: Flex provides
1.16 jmc 1635: .Dv YYSTATE
1.1 deraadt 1636: as an alias for
1.16 jmc 1637: .Dv YY_START
1.1 deraadt 1638: (since that is what's used by AT&T
1.16 jmc 1639: .Nm lex ) .
1640: .Pp
1641: Note that start conditions do not have their own name-space;
1642: %s's and %x's declare names in the same fashion as #define's.
1643: .Pp
1.1 deraadt 1644: Finally, here's an example of how to match C-style quoted strings using
1.16 jmc 1645: exclusive start conditions, including expanded escape sequences
1646: (but not including checking for a string that's too long):
1647: .Bd -literal -offset indent
1648: %x str
1649:
1650: %%
1651: #define MAX_STR_CONST 1024
1652: char string_buf[MAX_STR_CONST];
1653: char *string_buf_ptr;
1654:
1655: \e" string_buf_ptr = string_buf; BEGIN(str);
1656:
1657: <str>\e" { /* saw closing quote - all done */
1658: BEGIN(INITIAL);
1659: *string_buf_ptr = '\e0';
1660: /*
1661: * return string constant token type and
1662: * value to parser
1663: */
1664: }
1665:
1666: <str>\en {
1667: /* error - unterminated string constant */
1668: /* generate error message */
1669: }
1670:
1671: <str>\e\e[0-7]{1,3} {
1672: /* octal escape sequence */
1673: int result;
1674:
1675: (void) sscanf(yytext + 1, "%o", &result);
1676:
1677: if (result > 0xff) {
1678: /* error, constant is out-of-bounds */
1679: } else
1680: *string_buf_ptr++ = result;
1681: }
1682:
1683: <str>\e\e[0-9]+ {
1684: /*
1685: * generate error - bad escape sequence; something
1686: * like '\e48' or '\e0777777'
1687: */
1688: }
1689:
1690: <str>\e\en *string_buf_ptr++ = '\en';
1691: <str>\e\et *string_buf_ptr++ = '\et';
1692: <str>\e\er *string_buf_ptr++ = '\er';
1693: <str>\e\eb *string_buf_ptr++ = '\eb';
1694: <str>\e\ef *string_buf_ptr++ = '\ef';
1695:
1696: <str>\e\e(.|\en) *string_buf_ptr++ = yytext[1];
1697:
1698: <str>[^\e\e\en\e"]+ {
1699: char *yptr = yytext;
1700:
1701: while (*yptr)
1702: *string_buf_ptr++ = *yptr++;
1703: }
1704: .Ed
1705: .Pp
1706: Often, such as in some of the examples above,
1707: a whole bunch of rules are all preceded by the same start condition(s).
1708: .Nm
1.1 deraadt 1709: makes this a little easier and cleaner by introducing a notion of
1710: start condition
1.16 jmc 1711: .Em scope .
1.1 deraadt 1712: A start condition scope is begun with:
1.16 jmc 1713: .Pp
1714: .Dl <SCs>{
1715: .Pp
1.1 deraadt 1716: where
1.16 jmc 1717: .Dq SCs
1718: is a list of one or more start conditions.
1719: Inside the start condition scope, every rule automatically has the prefix
1720: .Aq SCs
1.1 deraadt 1721: applied to it, until a
1.16 jmc 1722: .Sq }
1.1 deraadt 1723: which matches the initial
1.16 jmc 1724: .Sq { .
1.1 deraadt 1725: So, for example,
1.16 jmc 1726: .Bd -literal -offset indent
1727: <ESC>{
1728: "\e\en" return '\en';
1729: "\e\er" return '\er';
1730: "\e\ef" return '\ef';
1731: "\e\e0" return '\e0';
1732: }
1733: .Ed
1734: .Pp
1.1 deraadt 1735: is equivalent to:
1.16 jmc 1736: .Bd -literal -offset indent
1737: <ESC>"\e\en" return '\en';
1738: <ESC>"\e\er" return '\er';
1739: <ESC>"\e\ef" return '\ef';
1740: <ESC>"\e\e0" return '\e0';
1741: .Ed
1742: .Pp
1.1 deraadt 1743: Start condition scopes may be nested.
1.16 jmc 1744: .Pp
1.1 deraadt 1745: Three routines are available for manipulating stacks of start conditions:
1.16 jmc 1746: .Bl -tag -width Ds
1747: .It void yy_push_state(int new_state)
1748: Pushes the current start condition onto the top of the start condition
1.1 deraadt 1749: stack and switches to
1.16 jmc 1750: .Fa new_state
1751: as though
1752: .Dq BEGIN new_state
1753: had been used
1754: .Pq recall that start condition names are also integers .
1755: .It void yy_pop_state()
1756: Pops the top of the stack and switches to it via
1757: .Em BEGIN .
1758: .It int yy_top_state()
1759: Returns the top of the stack without altering the stack's contents.
1760: .El
1761: .Pp
1.1 deraadt 1762: The start condition stack grows dynamically and so has no built-in
1.16 jmc 1763: size limitation.
1764: If memory is exhausted, program execution aborts.
1765: .Pp
1766: To use start condition stacks, scanners must include a
1767: .Dq %option stack
1768: directive (see
1769: .Sx OPTIONS
1770: below).
1771: .Sh MULTIPLE INPUT BUFFERS
1772: Some scanners
1773: (such as those which support
1774: .Qq include
1775: files)
1776: require reading from several input streams.
1777: As
1778: .Nm
1.1 deraadt 1779: scanners do a large amount of buffering, one cannot control
1780: where the next input will be read from by simply writing a
1.16 jmc 1781: .Dv YY_INPUT
1.1 deraadt 1782: which is sensitive to the scanning context.
1.16 jmc 1783: .Dv YY_INPUT
1.1 deraadt 1784: is only called when the scanner reaches the end of its buffer, which
1.16 jmc 1785: may be a long time after scanning a statement such as an
1786: .Qq include
1.1 deraadt 1787: which requires switching the input source.
1.16 jmc 1788: .Pp
1.1 deraadt 1789: To negotiate these sorts of problems,
1.16 jmc 1790: .Nm
1.1 deraadt 1791: provides a mechanism for creating and switching between multiple
1.16 jmc 1792: input buffers.
1793: An input buffer is created by using:
1794: .Pp
1795: .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1796: .Pp
1.1 deraadt 1797: which takes a
1.16 jmc 1798: .Fa FILE
1799: pointer and a
1800: .Fa size
1801: and creates a buffer associated with the given file and large enough to hold
1802: .Fa size
1.1 deraadt 1803: characters (when in doubt, use
1.16 jmc 1804: .Dv YY_BUF_SIZE
1805: for the size).
1806: It returns a
1807: .Dv YY_BUFFER_STATE
1808: handle, which may then be passed to other routines
1809: .Pq see below .
1810: The
1811: .Dv YY_BUFFER_STATE
1.1 deraadt 1812: type is a pointer to an opaque
1.16 jmc 1813: .Dq struct yy_buffer_state
1814: structure, so
1815: .Dv YY_BUFFER_STATE
1816: variables may be safely initialized to
1817: .Dq ((YY_BUFFER_STATE) 0)
1818: if desired, and the opaque structure can also be referred to in order to
1819: correctly declare input buffers in source files other than that of scanners.
1820: Note that the
1821: .Fa FILE
1.1 deraadt 1822: pointer in the call to
1.16 jmc 1823: .Fn yy_create_buffer
1.1 deraadt 1824: is only used as the value of
1.16 jmc 1825: .Fa yyin
1.1 deraadt 1826: seen by
1.16 jmc 1827: .Dv YY_INPUT ;
1828: if
1829: .Dv YY_INPUT
1830: is redefined so that it no longer uses
1831: .Fa yyin ,
1832: then a nil
1833: .Fa FILE
1834: pointer can safely be passed to
1835: .Fn yy_create_buffer .
1836: To select a particular buffer to scan:
1837: .Pp
1838: .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1839: .Pp
1840: It switches the scanner's input buffer so subsequent tokens will
1.1 deraadt 1841: come from
1.16 jmc 1842: .Fa new_buffer .
1.1 deraadt 1843: Note that
1.16 jmc 1844: .Fn yy_switch_to_buffer
1845: may be used by
1846: .Fn yywrap
1847: to set things up for continued scanning,
1848: instead of opening a new file and pointing
1849: .Fa yyin
1850: at it.
1851: Note also that switching input sources via either
1852: .Fn yy_switch_to_buffer
1853: or
1854: .Fn yywrap
1855: does not change the start condition.
1856: .Pp
1857: .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1858: .Pp
1859: is used to reclaim the storage associated with a buffer.
1860: .Pf ( Fa buffer
1.1 deraadt 1861: can be nil, in which case the routine does nothing.)
1.16 jmc 1862: To clear the current contents of a buffer:
1863: .Pp
1864: .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1865: .Pp
1.1 deraadt 1866: This function discards the buffer's contents,
1.16 jmc 1867: so the next time the scanner attempts to match a token from the buffer,
1868: it will first fill the buffer anew using
1869: .Dv YY_INPUT .
1870: .Pp
1871: .Fn yy_new_buffer
1.1 deraadt 1872: is an alias for
1.16 jmc 1873: .Fn yy_create_buffer ,
1.1 deraadt 1874: provided for compatibility with the C++ use of
1.16 jmc 1875: .Em new
1.1 deraadt 1876: and
1.16 jmc 1877: .Em delete
1.1 deraadt 1878: for creating and destroying dynamic objects.
1.16 jmc 1879: .Pp
1.1 deraadt 1880: Finally, the
1.16 jmc 1881: .Dv YY_CURRENT_BUFFER
1.1 deraadt 1882: macro returns a
1.16 jmc 1883: .Dv YY_BUFFER_STATE
1.1 deraadt 1884: handle to the current buffer.
1.16 jmc 1885: .Pp
1.1 deraadt 1886: Here is an example of using these features for writing a scanner
1887: which expands include files (the
1.16 jmc 1888: .Aq Aq EOF
1.1 deraadt 1889: feature is discussed below):
1.16 jmc 1890: .Bd -literal -offset indent
1891: /*
1892: * the "incl" state is used for picking up the name
1893: * of an include file
1894: */
1895: %x incl
1896:
1897: %{
1898: #define MAX_INCLUDE_DEPTH 10
1899: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1900: int include_stack_ptr = 0;
1901: %}
1902:
1903: %%
1904: include BEGIN(incl);
1905:
1906: [a-z]+ ECHO;
1907: [^a-z\en]*\en? ECHO;
1908:
1909: <incl>[ \et]* /* eat the whitespace */
1910: <incl>[^ \et\en]+ { /* got the include file name */
1911: if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1912: errx(1, "Includes nested too deeply");
1913:
1914: include_stack[include_stack_ptr++] =
1915: YY_CURRENT_BUFFER;
1916:
1917: yyin = fopen(yytext, "r");
1918:
1919: if (yyin == NULL)
1920: err(1, NULL);
1.1 deraadt 1921:
1.16 jmc 1922: yy_switch_to_buffer(
1923: yy_create_buffer(yyin, YY_BUF_SIZE));
1.1 deraadt 1924:
1.16 jmc 1925: BEGIN(INITIAL);
1926: }
1.1 deraadt 1927:
1.16 jmc 1928: <<EOF>> {
1929: if (--include_stack_ptr < 0)
1.1 deraadt 1930: yyterminate();
1.16 jmc 1931: else {
1932: yy_delete_buffer(YY_CURRENT_BUFFER);
1.1 deraadt 1933: yy_switch_to_buffer(
1.16 jmc 1934: include_stack[include_stack_ptr]);
1935: }
1936: }
1937: .Ed
1938: .Pp
1.1 deraadt 1939: Three routines are available for setting up input buffers for
1.16 jmc 1940: scanning in-memory strings instead of files.
1941: All of them create a new input buffer for scanning the string,
1942: and return a corresponding
1943: .Dv YY_BUFFER_STATE
1944: handle (which should be deleted afterwards using
1945: .Fn yy_delete_buffer ) .
1946: They also switch to the new buffer using
1947: .Fn yy_switch_to_buffer ,
1.1 deraadt 1948: so the next call to
1.16 jmc 1949: .Fn yylex
1.1 deraadt 1950: will start scanning the string.
1.16 jmc 1951: .Bl -tag -width Ds
1952: .It yy_scan_string(const char *str)
1953: Scans a NUL-terminated string.
1954: .It yy_scan_bytes(const char *bytes, int len)
1955: Scans
1956: .Fa len
1957: bytes
1958: .Pq including possibly NUL's
1.1 deraadt 1959: starting at location
1.16 jmc 1960: .Fa bytes .
1961: .El
1962: .Pp
1963: Note that both of these functions create and scan a copy
1964: of the string or bytes.
1965: (This may be desirable, since
1966: .Fn yylex
1967: modifies the contents of the buffer it is scanning.)
1968: The copy can be avoided by using:
1969: .Bl -tag -width Ds
1970: .It yy_scan_buffer(char *base, yy_size_t size)
1971: Which scans the buffer starting at
1972: .Fa base ,
1.1 deraadt 1973: consisting of
1.16 jmc 1974: .Fa size
1975: bytes, the last two bytes of which must be
1976: .Dv YY_END_OF_BUFFER_CHAR
1977: .Pq ASCII NUL .
1978: These last two bytes are not scanned; thus, scanning consists of
1979: base[0] through base[size-2], inclusive.
1980: .Pp
1981: If
1982: .Fa base
1983: is not set up in this manner
1984: (i.e., forget the final two
1985: .Dv YY_END_OF_BUFFER_CHAR
1.1 deraadt 1986: bytes), then
1.16 jmc 1987: .Fn yy_scan_buffer
1.1 deraadt 1988: returns a nil pointer instead of creating a new input buffer.
1.16 jmc 1989: .Pp
1.1 deraadt 1990: The type
1.16 jmc 1991: .Fa yy_size_t
1992: is an integral type which can be cast to an integer expression
1.1 deraadt 1993: reflecting the size of the buffer.
1.16 jmc 1994: .El
1995: .Sh END-OF-FILE RULES
1996: The special rule
1997: .Qq Aq Aq EOF
1998: indicates actions which are to be taken when an end-of-file is encountered and
1999: .Fn yywrap
2000: returns non-zero
2001: .Pq i.e., indicates no further files to process .
2002: The action must finish by doing one of four things:
2003: .Bl -dash
2004: .It
2005: Assigning
2006: .Em yyin
2007: to a new input file
2008: (in previous versions of
2009: .Nm ,
2010: after doing the assignment, it was necessary to call the special action
2011: .Dv YY_NEW_FILE ;
2012: this is no longer necessary).
2013: .It
2014: Executing a
2015: .Em return
2016: statement.
2017: .It
2018: Executing the special
2019: .Fn yyterminate
2020: action.
2021: .It
2022: Switching to a new buffer using
2023: .Fn yy_switch_to_buffer
1.1 deraadt 2024: as shown in the example above.
1.16 jmc 2025: .El
2026: .Pp
2027: .Aq Aq EOF
2028: rules may not be used with other patterns;
2029: they may only be qualified with a list of start conditions.
2030: If an unqualified
2031: .Aq Aq EOF
2032: rule is given, it applies to all start conditions which do not already have
2033: .Aq Aq EOF
2034: actions.
2035: To specify an
2036: .Aq Aq EOF
2037: rule for only the initial start condition, use
2038: .Pp
2039: .Dl <INITIAL><<EOF>>
2040: .Pp
1.1 deraadt 2041: These rules are useful for catching things like unclosed comments.
2042: An example:
1.16 jmc 2043: .Bd -literal -offset indent
2044: %x quote
2045: %%
2046:
2047: \&...other rules for dealing with quotes...
2048:
2049: <quote><<EOF>> {
2050: error("unterminated quote");
2051: yyterminate();
2052: }
2053: <<EOF>> {
2054: if (*++filelist)
2055: yyin = fopen(*filelist, "r");
2056: else
2057: yyterminate();
2058: }
2059: .Ed
2060: .Sh MISCELLANEOUS MACROS
1.1 deraadt 2061: The macro
1.16 jmc 2062: .Dv YY_USER_ACTION
1.1 deraadt 2063: can be defined to provide an action
1.16 jmc 2064: which is always executed prior to the matched rule's action.
2065: For example,
1.1 deraadt 2066: it could be #define'd to call a routine to convert yytext to lower-case.
2067: When
1.16 jmc 2068: .Dv YY_USER_ACTION
1.1 deraadt 2069: is invoked, the variable
1.16 jmc 2070: .Fa yy_act
2071: gives the number of the matched rule
2072: .Pq rules are numbered starting with 1 .
2073: For example, to profile how often each rule is matched,
2074: the following would do the trick:
2075: .Pp
2076: .Dl #define YY_USER_ACTION ++ctr[yy_act]
2077: .Pp
1.1 deraadt 2078: where
1.16 jmc 2079: .Fa ctr
2080: is an array to hold the counts for the different rules.
2081: Note that the macro
2082: .Dv YY_NUM_RULES
2083: gives the total number of rules
2084: (including the default rule, even if
2085: .Fl s
2086: is used),
1.1 deraadt 2087: so a correct declaration for
1.16 jmc 2088: .Fa ctr
1.1 deraadt 2089: is:
1.16 jmc 2090: .Pp
2091: .Dl int ctr[YY_NUM_RULES];
2092: .Pp
1.1 deraadt 2093: The macro
1.16 jmc 2094: .Dv YY_USER_INIT
1.1 deraadt 2095: may be defined to provide an action which is always executed before
1.16 jmc 2096: the first scan
2097: .Pq and before the scanner's internal initializations are done .
1.1 deraadt 2098: For example, it could be used to call a routine to read
2099: in a data table or open a logging file.
1.16 jmc 2100: .Pp
1.1 deraadt 2101: The macro
1.16 jmc 2102: .Dv yy_set_interactive(is_interactive)
1.1 deraadt 2103: can be used to control whether the current buffer is considered
1.16 jmc 2104: .Em interactive .
1.1 deraadt 2105: An interactive buffer is processed more slowly,
2106: but must be used when the scanner's input source is indeed
2107: interactive to avoid problems due to waiting to fill buffers
2108: (see the discussion of the
1.16 jmc 2109: .Fl I
2110: flag below).
2111: A non-zero value in the macro invocation marks the buffer as interactive,
2112: a zero value as non-interactive.
2113: Note that use of this macro overrides
2114: .Dq %option always-interactive
2115: or
2116: .Dq %option never-interactive
2117: (see
2118: .Sx OPTIONS
2119: below).
2120: .Fn yy_set_interactive
1.1 deraadt 2121: must be invoked prior to beginning to scan the buffer that is
1.16 jmc 2122: .Pq or is not
2123: to be considered interactive.
2124: .Pp
1.1 deraadt 2125: The macro
1.16 jmc 2126: .Dv yy_set_bol(at_bol)
1.1 deraadt 2127: can be used to control whether the current buffer's scanning
2128: context for the next token match is done as though at the
1.16 jmc 2129: beginning of a line.
2130: A non-zero macro argument makes rules anchored with
2131: .Sq ^
2132: active, while a zero argument makes
2133: .Sq ^
2134: rules inactive.
2135: .Pp
1.1 deraadt 2136: The macro
1.16 jmc 2137: .Dv YY_AT_BOL
2138: returns true if the next token scanned from the current buffer will have
2139: .Sq ^
2140: rules active, false otherwise.
2141: .Pp
1.1 deraadt 2142: In the generated scanner, the actions are all gathered in one large
2143: switch statement and separated using
1.16 jmc 2144: .Dv YY_BREAK ,
2145: which may be redefined.
2146: By default, it is simply a
2147: .Qq break ,
2148: to separate each rule's action from the following rules.
1.1 deraadt 2149: Redefining
1.16 jmc 2150: .Dv YY_BREAK
1.1 deraadt 2151: allows, for example, C++ users to
1.16 jmc 2152: .Dq #define YY_BREAK
2153: to do nothing
2154: (while being very careful that every rule ends with a
2155: .Qq break
2156: or a
2157: .Qq return ! )
2158: to avoid suffering from unreachable statement warnings where because a rule's
2159: action ends with
2160: .Dq return ,
2161: the
2162: .Dv YY_BREAK
1.1 deraadt 2163: is inaccessible.
1.16 jmc 2164: .Sh VALUES AVAILABLE TO THE USER
1.1 deraadt 2165: This section summarizes the various values available to the user
2166: in the rule actions.
1.16 jmc 2167: .Bl -tag -width Ds
2168: .It char *yytext
2169: Holds the text of the current token.
2170: It may be modified but not lengthened
2171: .Pq characters cannot be appended to the end .
2172: .Pp
1.1 deraadt 2173: If the special directive
1.16 jmc 2174: .Dq %array
1.1 deraadt 2175: appears in the first section of the scanner description, then
1.16 jmc 2176: .Fa yytext
1.1 deraadt 2177: is instead declared
1.16 jmc 2178: .Dq char yytext[YYLMAX] ,
1.1 deraadt 2179: where
1.16 jmc 2180: .Dv YYLMAX
2181: is a macro definition that can be redefined in the first section
2182: to change the default value
2183: .Pq generally 8KB .
2184: Using
2185: .Dq %array
1.1 deraadt 2186: results in somewhat slower scanners, but the value of
1.16 jmc 2187: .Fa yytext
1.1 deraadt 2188: becomes immune to calls to
1.16 jmc 2189: .Fn input
1.1 deraadt 2190: and
1.16 jmc 2191: .Fn unput ,
1.1 deraadt 2192: which potentially destroy its value when
1.16 jmc 2193: .Fa yytext
2194: is a character pointer.
2195: The opposite of
2196: .Dq %array
1.1 deraadt 2197: is
1.16 jmc 2198: .Dq %pointer ,
1.1 deraadt 2199: which is the default.
1.16 jmc 2200: .Pp
2201: .Dq %array
2202: cannot be used when generating C++ scanner classes
1.1 deraadt 2203: (the
1.16 jmc 2204: .Fl +
1.1 deraadt 2205: flag).
1.16 jmc 2206: .It int yyleng
2207: Holds the length of the current token.
2208: .It FILE *yyin
2209: Is the file which by default
2210: .Nm
2211: reads from.
2212: It may be redefined, but doing so only makes sense before
2213: scanning begins or after an
2214: .Dv EOF
2215: has been encountered.
2216: Changing it in the midst of scanning will have unexpected results since
2217: .Nm
1.1 deraadt 2218: buffers its input; use
1.16 jmc 2219: .Fn yyrestart
1.1 deraadt 2220: instead.
2221: Once scanning terminates because an end-of-file
1.16 jmc 2222: has been seen,
2223: .Fa yyin
2224: can be assigned as the new input file
2225: and the scanner can be called again to continue scanning.
2226: .It void yyrestart(FILE *new_file)
2227: May be called to point
2228: .Fa yyin
2229: at the new input file.
2230: The switch-over to the new file is immediate
2231: .Pq any previously buffered-up input is lost .
2232: Note that calling
2233: .Fn yyrestart
1.1 deraadt 2234: with
1.16 jmc 2235: .Fa yyin
1.1 deraadt 2236: as an argument thus throws away the current input buffer and continues
2237: scanning the same input file.
1.16 jmc 2238: .It FILE *yyout
2239: Is the file to which
2240: .Em ECHO
2241: actions are done.
2242: It can be reassigned by the user.
2243: .It YY_CURRENT_BUFFER
2244: Returns a
2245: .Dv YY_BUFFER_STATE
1.1 deraadt 2246: handle to the current buffer.
1.16 jmc 2247: .It YY_START
2248: Returns an integer value corresponding to the current start condition.
2249: This value can subsequently be used with
2250: .Em BEGIN
1.1 deraadt 2251: to return to that start condition.
1.16 jmc 2252: .El
2253: .Sh INTERFACING WITH YACC
1.1 deraadt 2254: One of the main uses of
1.16 jmc 2255: .Nm
1.1 deraadt 2256: is as a companion to the
1.16 jmc 2257: .Xr yacc 1
1.1 deraadt 2258: parser-generator.
1.16 jmc 2259: yacc parsers expect to call a routine named
2260: .Fn yylex
2261: to find the next input token.
2262: The routine is supposed to return the type of the next token
2263: as well as putting any associated value in the global
1.17 jmc 2264: .Fa yylval ,
2265: which is defined externally,
2266: and can be a union or any other complex data structure.
1.1 deraadt 2267: To use
1.16 jmc 2268: .Nm
2269: with yacc, one specifies the
2270: .Fl d
2271: option to yacc to instruct it to generate the file
2272: .Pa y.tab.h
1.1 deraadt 2273: containing definitions of all the
1.16 jmc 2274: .Dq %tokens
2275: appearing in the yacc input.
2276: This file is then included in the
2277: .Nm
2278: scanner.
2279: For example, if one of the tokens is
2280: .Qq TOK_NUMBER ,
1.1 deraadt 2281: part of the scanner might look like:
1.16 jmc 2282: .Bd -literal -offset indent
2283: %{
2284: #include "y.tab.h"
2285: %}
2286:
2287: %%
2288:
2289: [0-9]+ yylval = atoi(yytext); return TOK_NUMBER;
2290: .Ed
2291: .Sh OPTIONS
2292: .Nm
1.1 deraadt 2293: has the following options:
1.16 jmc 2294: .Bl -tag -width Ds
2295: .It Fl 7
2296: Instructs
2297: .Nm
2298: to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2299: characters in its input.
2300: The advantage of using
2301: .Fl 7
1.1 deraadt 2302: is that the scanner's tables can be up to half the size of those generated
2303: using the
1.16 jmc 2304: .Fl 8
2305: option
2306: .Pq see below .
2307: The disadvantage is that such scanners often hang
1.1 deraadt 2308: or crash if their input contains an 8-bit character.
1.16 jmc 2309: .Pp
2310: Note, however, that unless generating a scanner using the
2311: .Fl Cf
1.1 deraadt 2312: or
1.16 jmc 2313: .Fl CF
1.1 deraadt 2314: table compression options, use of
1.16 jmc 2315: .Fl 7
2316: will save only a small amount of table space,
2317: and make the scanner considerably less portable.
2318: .Nm flex Ns 's
2319: default behavior is to generate an 8-bit scanner unless
2320: .Fl Cf
2321: or
2322: .Fl CF
2323: is specified, in which case
2324: .Nm
2325: defaults to generating 7-bit scanners unless it was
2326: configured to generate 8-bit scanners
2327: (as will often be the case with non-USA sites).
2328: It is possible tell whether
2329: .Nm
2330: generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2331: .Fl v
2332: output as described below.
2333: .Pp
2334: Note that if
2335: .Fl Cfe
2336: or
2337: .Fl CFe
2338: are used
2339: (the table compression options, but also using equivalence classes as
2340: discussed below),
2341: .Nm
2342: still defaults to generating an 8-bit scanner,
2343: since usually with these compression options full 8-bit tables
1.1 deraadt 2344: are not much more expensive than 7-bit tables.
1.16 jmc 2345: .It Fl 8
2346: Instructs
2347: .Nm
1.1 deraadt 2348: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1.16 jmc 2349: characters.
2350: This flag is only needed for scanners generated using
2351: .Fl Cf
1.1 deraadt 2352: or
1.16 jmc 2353: .Fl CF ,
2354: as otherwise
2355: .Nm
2356: defaults to generating an 8-bit scanner anyway.
2357: .Pp
1.1 deraadt 2358: See the discussion of
1.16 jmc 2359: .Fl 7
2360: above for
2361: .Nm flex Ns 's
2362: default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2363: .It Fl B
2364: Instructs
2365: .Nm
2366: to generate a
2367: .Em batch
2368: scanner, the opposite of
2369: .Em interactive
2370: scanners generated by
2371: .Fl I
2372: .Pq see below .
2373: In general,
2374: .Fl B
2375: is used when the scanner will never be used interactively,
2376: and you want to squeeze a little more performance out of it.
2377: If the aim is instead to squeeze out a lot more performance,
2378: use the
2379: .Fl Cf
2380: or
2381: .Fl CF
2382: options
2383: .Pq discussed below ,
2384: which turn on
2385: .Fl B
2386: automatically anyway.
2387: .It Fl b
2388: Generate backing-up information to
2389: .Pa lex.backup .
2390: This is a list of scanner states which require backing up
2391: and the input characters on which they do so.
2392: By adding rules one can remove backing-up states.
2393: If all backing-up states are eliminated and
2394: .Fl Cf
2395: or
2396: .Fl CF
2397: is used, the generated scanner will run faster (see the
2398: .Fl p
2399: flag).
2400: Only users who wish to squeeze every last cycle out of their
2401: scanners need worry about this option.
2402: (See the section on
2403: .Sx PERFORMANCE CONSIDERATIONS
2404: below.)
2405: .It Fl C Ns Op Cm aeFfmr
2406: Controls the degree of table compression and, more generally, trade-offs
1.1 deraadt 2407: between small scanners and fast scanners.
1.16 jmc 2408: .Bl -tag -width Ds
2409: .It Fl Ca
2410: Instructs
2411: .Nm
2412: to trade off larger tables in the generated scanner for faster performance
2413: because the elements of the tables are better aligned for memory access
2414: and computation.
2415: On some
2416: .Tn RISC
2417: architectures, fetching and manipulating longwords is more efficient
2418: than with smaller-sized units such as shortwords.
2419: This option can double the size of the tables used by the scanner.
2420: .It Fl Ce
2421: Directs
2422: .Nm
1.1 deraadt 2423: to construct
1.16 jmc 2424: .Em equivalence classes ,
2425: i.e., sets of characters which have identical lexical properties
2426: (for example, if the only appearance of digits in the
2427: .Nm
1.1 deraadt 2428: input is in the character class
1.16 jmc 2429: .Qq [0-9]
2430: then the digits
2431: .Sq 0 ,
2432: .Sq 1 ,
2433: .Sq ... ,
2434: .Sq 9
2435: will all be put in the same equivalence class).
2436: Equivalence classes usually give dramatic reductions in the final
2437: table/object file sizes
2438: .Pq typically a factor of 2\-5
2439: and are pretty cheap performance-wise
2440: .Pq one array look-up per character scanned .
2441: .It Fl CF
2442: Specifies that the alternate fast scanner representation
2443: (described below under the
2444: .Fl F
2445: option)
2446: should be used.
2447: This option cannot be used with
2448: .Fl + .
2449: .It Fl Cf
2450: Specifies that the
2451: .Em full
2452: scanner tables should be generated \-
2453: .Nm
2454: should not compress the tables by taking advantage of
2455: similar transition functions for different states.
2456: .It Fl \&Cm
2457: Directs
2458: .Nm
1.1 deraadt 2459: to construct
1.16 jmc 2460: .Em meta-equivalence classes ,
2461: which are sets of equivalence classes
2462: (or characters, if equivalence classes are not being used)
2463: that are commonly used together.
2464: Meta-equivalence classes are often a big win when using compressed tables,
2465: but they have a moderate performance impact
2466: (one or two
2467: .Qq if
2468: tests and one array look-up per character scanned).
2469: .It Fl Cr
2470: Causes the generated scanner to
2471: .Em bypass
2472: use of the standard I/O library
2473: .Pq stdio
2474: for input.
2475: Instead of calling
2476: .Xr fread 3
1.1 deraadt 2477: or
1.16 jmc 2478: .Xr getc 3 ,
1.1 deraadt 2479: the scanner will use the
1.16 jmc 2480: .Xr read 2
2481: system call,
2482: resulting in a performance gain which varies from system to system,
2483: but in general is probably negligible unless
2484: .Fl Cf
1.1 deraadt 2485: or
1.16 jmc 2486: .Fl CF
2487: are being used.
1.1 deraadt 2488: Using
1.16 jmc 2489: .Fl Cr
2490: can cause strange behavior if, for example, reading from
2491: .Fa yyin
2492: using stdio prior to calling the scanner
2493: (because the scanner will miss whatever text previous reads left
2494: in the stdio input buffer).
2495: .Pp
2496: .Fl Cr
2497: has no effect if
2498: .Dv YY_INPUT
2499: is defined
2500: (see
2501: .Sx THE GENERATED SCANNER
2502: above).
2503: .El
2504: .Pp
1.1 deraadt 2505: A lone
1.16 jmc 2506: .Fl C
1.1 deraadt 2507: specifies that the scanner tables should be compressed but neither
2508: equivalence classes nor meta-equivalence classes should be used.
1.16 jmc 2509: .Pp
1.1 deraadt 2510: The options
1.16 jmc 2511: .Fl Cf
1.1 deraadt 2512: or
1.16 jmc 2513: .Fl CF
1.1 deraadt 2514: and
1.16 jmc 2515: .Fl \&Cm
2516: do not make sense together \- there is no opportunity for meta-equivalence
2517: classes if the table is not being compressed.
2518: Otherwise the options may be freely mixed, and are cumulative.
2519: .Pp
1.1 deraadt 2520: The default setting is
1.16 jmc 2521: .Fl Cem
1.1 deraadt 2522: which specifies that
1.16 jmc 2523: .Nm
2524: should generate equivalence classes and meta-equivalence classes.
2525: This setting provides the highest degree of table compression.
2526: It is possible to trade off faster-executing scanners at the cost of
2527: larger tables with the following generally being true:
2528: .Bd -unfilled -offset indent
2529: slowest & smallest
2530: -Cem
2531: -Cm
2532: -Ce
2533: -C
2534: -C{f,F}e
2535: -C{f,F}
2536: -C{f,F}a
2537: fastest & largest
2538: .Ed
2539: .Pp
1.1 deraadt 2540: Note that scanners with the smallest tables are usually generated and
1.16 jmc 2541: compiled the quickest,
2542: so during development the default is usually best,
2543: maximal compression.
2544: .Pp
2545: .Fl Cfe
2546: is often a good compromise between speed and size for production scanners.
2547: .It Fl c
2548: A do-nothing, deprecated option included for
2549: .Tn POSIX
2550: compliance.
2551: .It Fl d
2552: Makes the generated scanner run in debug mode.
2553: Whenever a pattern is recognized and the global
2554: .Fa yy_flex_debug
2555: is non-zero
2556: .Pq which is the default ,
2557: the scanner will write to stderr a line of the form:
2558: .Pp
2559: .D1 --accepting rule at line 53 ("the matched text")
2560: .Pp
2561: The line number refers to the location of the rule in the file
2562: defining the scanner
2563: (i.e., the file that was fed to
2564: .Nm ) .
2565: Messages are also generated when the scanner backs up,
2566: accepts the default rule,
2567: reaches the end of its input buffer
2568: (or encounters a NUL;
2569: at this point, the two look the same as far as the scanner's concerned),
2570: or reaches an end-of-file.
2571: .It Fl F
2572: Specifies that the fast scanner table representation should be used
2573: .Pq and stdio bypassed .
2574: This representation is about as fast as the full table representation
2575: .Pq Fl f ,
2576: and for some sets of patterns will be considerably smaller
2577: .Pq and for others, larger .
2578: In general, if the pattern set contains both
2579: .Qq keywords
2580: and a catch-all,
2581: .Qq identifier
2582: rule, such as in the set:
2583: .Bd -unfilled -offset indent
2584: "case" return TOK_CASE;
2585: "switch" return TOK_SWITCH;
2586: \&...
2587: "default" return TOK_DEFAULT;
2588: [a-z]+ return TOK_ID;
2589: .Ed
2590: .Pp
2591: then it's better to use the full table representation.
2592: If only the
2593: .Qq identifier
2594: rule is present and a hash table or some such is used to detect the keywords,
2595: it's better to use
2596: .Fl F .
2597: .Pp
2598: This option is equivalent to
2599: .Fl CFr
2600: .Pq see above .
2601: It cannot be used with
2602: .Fl + .
2603: .It Fl f
2604: Specifies
2605: .Em fast scanner .
2606: No table compression is done and stdio is bypassed.
2607: The result is large but fast.
2608: This option is equivalent to
2609: .Fl Cfr
2610: .Pq see above .
2611: .It Fl h
2612: Generates a help summary of
2613: .Nm flex Ns 's
2614: options to stdout and then exits.
2615: .Fl ?\&
2616: and
2617: .Fl Fl help
2618: are synonyms for
2619: .Fl h .
2620: .It Fl I
2621: Instructs
2622: .Nm
2623: to generate an
2624: .Em interactive
2625: scanner.
2626: An interactive scanner is one that only looks ahead to decide
2627: what token has been matched if it absolutely must.
2628: It turns out that always looking one extra character ahead,
2629: even if the scanner has already seen enough text
2630: to disambiguate the current token, is a bit faster than
2631: only looking ahead when necessary.
2632: But scanners that always look ahead give dreadful interactive performance;
2633: for example, when a user types a newline,
2634: it is not recognized as a newline token until they enter
2635: .Em another
2636: token, which often means typing in another whole line.
2637: .Pp
2638: .Nm
2639: scanners default to
2640: .Em interactive
2641: unless
2642: .Fl Cf
2643: or
2644: .Fl CF
2645: table-compression options are specified
2646: .Pq see above .
2647: That's because if high-performance is most important,
2648: one of these options should be used,
2649: so if they weren't,
2650: .Nm
2651: assumes it is preferrable to trade off a bit of run-time performance for
2652: intuitive interactive behavior.
2653: Note also that
2654: .Fl I
2655: cannot be used in conjunction with
2656: .Fl Cf
2657: or
2658: .Fl CF .
2659: Thus, this option is not really needed; it is on by default for all those
2660: cases in which it is allowed.
2661: .Pp
2662: A scanner can be forced to not be interactive by using
2663: .Fl B
2664: .Pq see above .
2665: .It Fl i
2666: Instructs
2667: .Nm
2668: to generate a case-insensitive scanner.
2669: The case of letters given in the
2670: .Nm
2671: input patterns will be ignored,
2672: and tokens in the input will be matched regardless of case.
2673: The matched text given in
2674: .Fa yytext
2675: will have the preserved case
2676: .Pq i.e., it will not be folded .
2677: .It Fl L
2678: Instructs
2679: .Nm
2680: not to generate
2681: .Dq #line
2682: directives.
2683: Without this option,
2684: .Nm
2685: peppers the generated scanner with #line directives so error messages
2686: in the actions will be correctly located with respect to either the original
2687: .Nm
2688: input file
2689: (if the errors are due to code in the input file),
2690: or
2691: .Pa lex.yy.c
2692: (if the errors are
2693: .Nm flex Ns 's
2694: fault \- these sorts of errors should be reported to the email address
2695: given below).
2696: .It Fl l
2697: Turns on maximum compatibility with the original AT&T
2698: .Nm lex
2699: implementation.
2700: Note that this does not mean full compatibility.
2701: Use of this option costs a considerable amount of performance,
2702: and it cannot be used with the
2703: .Fl + , f , F , Cf ,
2704: or
2705: .Fl CF
2706: options.
2707: For details on the compatibilities it provides, see the section
2708: .Sx INCOMPATIBILITIES WITH LEX AND POSIX
2709: below.
2710: This option also results in the name
2711: .Dv YY_FLEX_LEX_COMPAT
2712: being #define'd in the generated scanner.
2713: .It Fl n
2714: Another do-nothing, deprecated option included only for
2715: .Tn POSIX
2716: compliance.
2717: .It Fl o Ns Ar output
2718: Directs
2719: .Nm
2720: to write the scanner to the file
2721: .Ar output
1.1 deraadt 2722: instead of
1.16 jmc 2723: .Pa lex.yy.c .
2724: If
2725: .Fl o
2726: is combined with the
2727: .Fl t
2728: option, then the scanner is written to stdout but its
2729: .Dq #line
2730: directives
2731: (see the
2732: .Fl L
2733: option above)
2734: refer to the file
2735: .Ar output .
2736: .It Fl P Ns Ar prefix
2737: Changes the default
2738: .Qq yy
1.1 deraadt 2739: prefix used by
1.16 jmc 2740: .Nm
1.6 aaron 2741: for all globally visible variable and function names to instead be
1.16 jmc 2742: .Ar prefix .
1.1 deraadt 2743: For example,
1.16 jmc 2744: .Fl P Ns Ar foo
1.1 deraadt 2745: changes the name of
1.16 jmc 2746: .Fa yytext
1.1 deraadt 2747: to
1.16 jmc 2748: .Fa footext .
1.1 deraadt 2749: It also changes the name of the default output file from
1.16 jmc 2750: .Pa lex.yy.c
1.1 deraadt 2751: to
1.16 jmc 2752: .Pa lex.foo.c .
1.1 deraadt 2753: Here are all of the names affected:
1.16 jmc 2754: .Bd -unfilled -offset indent
2755: yy_create_buffer
2756: yy_delete_buffer
2757: yy_flex_debug
2758: yy_init_buffer
2759: yy_flush_buffer
2760: yy_load_buffer_state
2761: yy_switch_to_buffer
2762: yyin
2763: yyleng
2764: yylex
2765: yylineno
2766: yyout
2767: yyrestart
2768: yytext
2769: yywrap
2770: .Ed
2771: .Pp
2772: (If using a C++ scanner, then only
2773: .Fa yywrap
1.1 deraadt 2774: and
1.16 jmc 2775: .Fa yyFlexLexer
1.1 deraadt 2776: are affected.)
1.16 jmc 2777: Within the scanner itself, it is still possible to refer to the global variables
1.1 deraadt 2778: and functions using either version of their name; but externally, they
2779: have the modified name.
1.16 jmc 2780: .Pp
2781: This option allows multiple
2782: .Nm
2783: programs to be easily linked together into the same executable.
2784: Note, though, that using this option also renames
2785: .Fn yywrap ,
2786: so now either an
2787: .Pq appropriately named
2788: version of the routine for the scanner must be supplied, or
2789: .Dq %option noyywrap
2790: must be used, as linking with
2791: .Fl lfl
2792: no longer provides one by default.
2793: .It Fl p
2794: Generates a performance report to stderr.
2795: The report consists of comments regarding features of the
2796: .Nm
2797: input file which will cause a serious loss of performance in the resulting
2798: scanner.
2799: If the flag is specified twice,
2800: comments regarding features that lead to minor performance losses
2801: will also be reported>
2802: .Pp
2803: Note that the use of
2804: .Em REJECT ,
2805: .Dq %option yylineno ,
2806: and variable trailing context
2807: (see the
2808: .Sx BUGS
2809: section below)
2810: entails a substantial performance penalty; use of
2811: .Fn yymore ,
2812: the
2813: .Sq ^
2814: operator, and the
2815: .Fl I
2816: flag entail minor performance penalties.
2817: .It Fl S Ns Ar skeleton
2818: Overrides the default skeleton file from which
2819: .Nm
2820: constructs its scanners.
2821: This option is needed only for
2822: .Nm
1.1 deraadt 2823: maintenance or development.
1.16 jmc 2824: .It Fl s
2825: Causes the default rule
2826: .Pq that unmatched scanner input is echoed to stdout
2827: to be suppressed.
2828: If the scanner encounters input that does not
2829: match any of its rules, it aborts with an error.
2830: This option is useful for finding holes in a scanner's rule set.
2831: .It Fl T
2832: Makes
2833: .Nm
2834: run in
2835: .Em trace
2836: mode.
2837: It will generate a lot of messages to stderr concerning
2838: the form of the input and the resultant non-deterministic and deterministic
2839: finite automata.
2840: This option is mostly for use in maintaining
2841: .Nm .
2842: .It Fl t
2843: Instructs
2844: .Nm
2845: to write the scanner it generates to standard output instead of
2846: .Pa lex.yy.c .
2847: .It Fl V
2848: Prints the version number to stdout and exits.
2849: .Fl Fl version
2850: is a synonym for
2851: .Fl V .
2852: .It Fl v
2853: Specifies that
2854: .Nm
2855: should write to stderr
2856: a summary of statistics regarding the scanner it generates.
2857: Most of the statistics are meaningless to the casual
2858: .Nm
2859: user, but the first line identifies the version of
2860: .Nm
2861: (same as reported by
2862: .Fl V ) ,
2863: and the next line the flags used when generating the scanner,
2864: including those that are on by default.
2865: .It Fl w
2866: Suppresses warning messages.
2867: .It Fl +
2868: Specifies that
2869: .Nm
2870: should generate a C++ scanner class.
2871: See the section on
2872: .Sx GENERATING C++ SCANNERS
2873: below for details.
2874: .El
2875: .Pp
2876: .Nm
1.1 deraadt 2877: also provides a mechanism for controlling options within the
1.16 jmc 2878: scanner specification itself, rather than from the
2879: .Nm
2880: command-line.
1.1 deraadt 2881: This is done by including
1.16 jmc 2882: .Dq %option
1.1 deraadt 2883: directives in the first section of the scanner specification.
1.16 jmc 2884: Multiple options can be specified with a single
2885: .Dq %option
2886: directive, and multiple directives in the first section of the
2887: .Nm
2888: input file.
2889: .Pp
2890: Most options are given simply as names, optionally preceded by the word
2891: .Qq no
2892: .Pq with no intervening whitespace
2893: to negate their meaning.
2894: A number are equivalent to
2895: .Nm
2896: flags or their negation:
2897: .Bd -unfilled -offset indent
2898: 7bit -7 option
2899: 8bit -8 option
2900: align -Ca option
2901: backup -b option
2902: batch -B option
2903: c++ -+ option
2904:
2905: caseful or
2906: case-sensitive opposite of -i (default)
2907:
2908: case-insensitive or
2909: caseless -i option
2910:
2911: debug -d option
2912: default opposite of -s option
2913: ecs -Ce option
2914: fast -F option
2915: full -f option
2916: interactive -I option
2917: lex-compat -l option
2918: meta-ecs -Cm option
2919: perf-report -p option
2920: read -Cr option
2921: stdout -t option
2922: verbose -v option
2923: warn opposite of -w option
2924: (use "%option nowarn" for -w)
2925:
2926: array equivalent to "%array"
2927: pointer equivalent to "%pointer" (default)
2928: .Ed
2929: .Pp
2930: Some %option's provide features otherwise not available:
2931: .Bl -tag -width Ds
2932: .It always-interactive
2933: Instructs
2934: .Nm
2935: to generate a scanner which always considers its input
2936: .Qq interactive .
2937: Normally, on each new input file the scanner calls
2938: .Fn isatty
2939: in an attempt to determine whether the scanner's input source is interactive
2940: and thus should be read a character at a time.
2941: When this option is used, however, no such call is made.
2942: .It main
2943: Directs
2944: .Nm
2945: to provide a default
2946: .Fn main
1.1 deraadt 2947: program for the scanner, which simply calls
1.16 jmc 2948: .Fn yylex .
1.1 deraadt 2949: This option implies
1.16 jmc 2950: .Dq noyywrap
2951: .Pq see below .
2952: .It never-interactive
2953: Instructs
2954: .Nm
2955: to generate a scanner which never considers its input
2956: .Qq interactive
2957: (again, no call made to
2958: .Fn isatty ) .
1.1 deraadt 2959: This is the opposite of
1.16 jmc 2960: .Dq always-interactive .
2961: .It stack
2962: Enables the use of start condition stacks
2963: (see
2964: .Sx START CONDITIONS
2965: above).
2966: .It stdinit
2967: If set (i.e.,
2968: .Dq %option stdinit ) ,
1.1 deraadt 2969: initializes
1.16 jmc 2970: .Fa yyin
1.1 deraadt 2971: and
1.16 jmc 2972: .Fa yyout
2973: to stdin and stdout, instead of the default of
2974: .Dq nil .
1.1 deraadt 2975: Some existing
1.16 jmc 2976: .Nm lex
2977: programs depend on this behavior, even though it is not compliant with ANSI C,
2978: which does not require stdin and stdout to be compile-time constant.
2979: .It yylineno
2980: Directs
2981: .Nm
1.1 deraadt 2982: to generate a scanner that maintains the number of the current line
2983: read from its input in the global variable
1.16 jmc 2984: .Fa yylineno .
1.1 deraadt 2985: This option is implied by
1.16 jmc 2986: .Dq %option lex-compat .
2987: .It yywrap
2988: If unset (i.e.,
2989: .Dq %option noyywrap ) ,
1.1 deraadt 2990: makes the scanner not call
1.16 jmc 2991: .Fn yywrap
2992: upon an end-of-file, but simply assume that there are no more files to scan
2993: (until the user points
2994: .Fa yyin
1.1 deraadt 2995: at a new file and calls
1.16 jmc 2996: .Fn yylex
1.1 deraadt 2997: again).
1.16 jmc 2998: .El
2999: .Pp
3000: .Nm
3001: scans rule actions to determine whether the
3002: .Em REJECT
3003: or
3004: .Fn yymore
3005: features are being used.
3006: The
3007: .Dq reject
1.1 deraadt 3008: and
1.16 jmc 3009: .Dq yymore
3010: options are available to override its decision as to whether to use the
1.1 deraadt 3011: options, either by setting them (e.g.,
1.16 jmc 3012: .Dq %option reject )
3013: to indicate the feature is indeed used,
3014: or unsetting them to indicate it actually is not used
1.1 deraadt 3015: (e.g.,
1.16 jmc 3016: .Dq %option noyymore ) .
3017: .Pp
3018: Three options take string-delimited values, offset with
3019: .Sq = :
3020: .Pp
3021: .D1 %option outfile="ABC"
3022: .Pp
1.1 deraadt 3023: is equivalent to
1.16 jmc 3024: .Fl o Ns Ar ABC ,
1.1 deraadt 3025: and
1.16 jmc 3026: .Pp
3027: .D1 %option prefix="XYZ"
3028: .Pp
1.1 deraadt 3029: is equivalent to
1.16 jmc 3030: .Fl P Ns Ar XYZ .
1.1 deraadt 3031: Finally,
1.16 jmc 3032: .Pp
3033: .D1 %option yyclass="foo"
3034: .Pp
3035: only applies when generating a C++ scanner
3036: .Pf ( Fl +
3037: option).
3038: It informs
3039: .Nm
3040: that
3041: .Dq foo
3042: has been derived as a subclass of yyFlexLexer, so
3043: .Nm
3044: will place actions in the member function
3045: .Dq foo::yylex()
1.1 deraadt 3046: instead of
1.16 jmc 3047: .Dq yyFlexLexer::yylex() .
1.1 deraadt 3048: It also generates a
1.16 jmc 3049: .Dq yyFlexLexer::yylex()
1.1 deraadt 3050: member function that emits a run-time error (by invoking
1.16 jmc 3051: .Dq yyFlexLexer::LexerError() )
1.1 deraadt 3052: if called.
1.16 jmc 3053: See
3054: .Sx GENERATING C++ SCANNERS ,
3055: below, for additional information.
3056: .Pp
3057: A number of options are available for
3058: .Xr lint 1
3059: purists who want to suppress the appearance of unneeded routines
3060: in the generated scanner.
3061: Each of the following, if unset
1.1 deraadt 3062: (e.g.,
1.16 jmc 3063: .Dq %option nounput ) ,
3064: results in the corresponding routine not appearing in the generated scanner:
3065: .Bd -unfilled -offset indent
3066: input, unput
3067: yy_push_state, yy_pop_state, yy_top_state
3068: yy_scan_buffer, yy_scan_bytes, yy_scan_string
3069: .Ed
3070: .Pp
1.1 deraadt 3071: (though
1.16 jmc 3072: .Fn yy_push_state
3073: and friends won't appear anyway unless
3074: .Dq %option stack
3075: is being used).
3076: .Sh PERFORMANCE CONSIDERATIONS
1.1 deraadt 3077: The main design goal of
1.16 jmc 3078: .Nm
3079: is that it generate high-performance scanners.
3080: It has been optimized for dealing well with large sets of rules.
3081: Aside from the effects on scanner speed of the table compression
3082: .Fl C
1.1 deraadt 3083: options outlined above,
1.16 jmc 3084: there are a number of options/actions which degrade performance.
3085: These are, from most expensive to least:
3086: .Bd -unfilled -offset indent
3087: REJECT
3088: %option yylineno
3089: arbitrary trailing context
3090:
3091: pattern sets that require backing up
3092: %array
3093: %option interactive
3094: %option always-interactive
3095:
3096: \&'^' beginning-of-line operator
3097: yymore()
3098: .Ed
3099: .Pp
3100: with the first three all being quite expensive
3101: and the last two being quite cheap.
3102: Note also that
3103: .Fn unput
3104: is implemented as a routine call that potentially does quite a bit of work,
3105: while
3106: .Fn yyless
3107: is a quite-cheap macro; so if just putting back some excess text,
3108: use
3109: .Fn yyless .
3110: .Pp
3111: .Em REJECT
1.1 deraadt 3112: should be avoided at all costs when performance is important.
3113: It is a particularly expensive option.
1.16 jmc 3114: .Pp
1.1 deraadt 3115: Getting rid of backing up is messy and often may be an enormous
1.16 jmc 3116: amount of work for a complicated scanner.
3117: In principal, one begins by using the
3118: .Fl b
1.1 deraadt 3119: flag to generate a
1.16 jmc 3120: .Pa lex.backup
3121: file.
3122: For example, on the input
3123: .Bd -literal -offset indent
3124: %%
3125: foo return TOK_KEYWORD;
3126: foobar return TOK_KEYWORD;
3127: .Ed
3128: .Pp
1.1 deraadt 3129: the file looks like:
1.16 jmc 3130: .Bd -literal -offset indent
3131: State #6 is non-accepting -
3132: associated rule line numbers:
3133: 2 3
3134: out-transitions: [ o ]
3135: jam-transitions: EOF [ \e001-n p-\e177 ]
3136:
3137: State #8 is non-accepting -
3138: associated rule line numbers:
3139: 3
3140: out-transitions: [ a ]
3141: jam-transitions: EOF [ \e001-` b-\e177 ]
3142:
3143: State #9 is non-accepting -
3144: associated rule line numbers:
3145: 3
3146: out-transitions: [ r ]
3147: jam-transitions: EOF [ \e001-q s-\e177 ]
3148:
3149: Compressed tables always back up.
3150: .Ed
3151: .Pp
1.1 deraadt 3152: The first few lines tell us that there's a scanner state in
1.16 jmc 3153: which it can make a transition on an
3154: .Sq o
3155: but not on any other character,
3156: and that in that state the currently scanned text does not match any rule.
3157: The state occurs when trying to match the rules found
1.1 deraadt 3158: at lines 2 and 3 in the input file.
1.16 jmc 3159: If the scanner is in that state and then reads something other than an
3160: .Sq o ,
3161: it will have to back up to find a rule which is matched.
3162: With a bit of headscratching one can see that this must be the
3163: state it's in when it has seen
3164: .Sq fo .
3165: When this has happened, if anything other than another
3166: .Sq o
3167: is seen, the scanner will have to back up to simply match the
3168: .Sq f
3169: .Pq by the default rule .
3170: .Pp
3171: The comment regarding State #8 indicates there's a problem when
3172: .Qq foob
3173: has been scanned.
3174: Indeed, on any character other than an
3175: .Sq a ,
3176: the scanner will have to back up to accept
3177: .Qq foo .
3178: Similarly, the comment for State #9 concerns when
3179: .Qq fooba
3180: has been scanned and an
3181: .Sq r
3182: does not follow.
3183: .Pp
1.1 deraadt 3184: The final comment reminds us that there's no point going to
1.16 jmc 3185: all the trouble of removing backing up from the rules unless we're using
3186: .Fl Cf
1.1 deraadt 3187: or
1.16 jmc 3188: .Fl CF ,
1.1 deraadt 3189: since there's no performance gain doing so with compressed scanners.
1.16 jmc 3190: .Pp
3191: The way to remove the backing up is to add
3192: .Qq error
3193: rules:
3194: .Bd -literal -offset indent
3195: %%
3196: foo return TOK_KEYWORD;
3197: foobar return TOK_KEYWORD;
3198:
3199: fooba |
3200: foob |
3201: fo {
3202: /* false alarm, not really a keyword */
3203: return TOK_ID;
3204: }
3205: .Ed
3206: .Pp
3207: Eliminating backing up among a list of keywords can also be done using a
3208: .Qq catch-all
3209: rule:
3210: .Bd -literal -offset indent
3211: %%
3212: foo return TOK_KEYWORD;
3213: foobar return TOK_KEYWORD;
3214:
3215: [a-z]+ return TOK_ID;
3216: .Ed
3217: .Pp
1.1 deraadt 3218: This is usually the best solution when appropriate.
1.16 jmc 3219: .Pp
1.1 deraadt 3220: Backing up messages tend to cascade.
1.16 jmc 3221: With a complicated set of rules it's not uncommon to get hundreds of messages.
3222: If one can decipher them, though,
3223: it often only takes a dozen or so rules to eliminate the backing up
3224: (though it's easy to make a mistake and have an error rule accidentally match
3225: a valid token; a possible future
3226: .Nm
1.1 deraadt 3227: feature will be to automatically add rules to eliminate backing up).
1.16 jmc 3228: .Pp
3229: It's important to keep in mind that the benefits of eliminating
3230: backing up are gained only if
3231: .Em every
3232: instance of backing up is eliminated.
3233: Leaving just one gains nothing.
3234: .Pp
3235: .Em Variable
3236: trailing context
3237: (where both the leading and trailing parts do not have a fixed length)
3238: entails almost the same performance loss as
3239: .Em REJECT
3240: .Pq i.e., substantial .
3241: So when possible a rule like:
3242: .Bd -literal -offset indent
3243: %%
3244: mouse|rat/(cat|dog) run();
3245: .Ed
3246: .Pp
1.1 deraadt 3247: is better written:
1.16 jmc 3248: .Bd -literal -offset indent
3249: %%
3250: mouse/cat|dog run();
3251: rat/cat|dog run();
3252: .Ed
3253: .Pp
1.1 deraadt 3254: or as
1.16 jmc 3255: .Bd -literal -offset indent
3256: %%
3257: mouse|rat/cat run();
3258: mouse|rat/dog run();
3259: .Ed
3260: .Pp
3261: Note that here the special
3262: .Sq |\&
3263: action does not provide any savings, and can even make things worse (see
3264: .Sx BUGS
3265: below).
3266: .Pp
1.1 deraadt 3267: Another area where the user can increase a scanner's performance
1.16 jmc 3268: .Pq and one that's easier to implement
3269: arises from the fact that the longer the tokens matched,
3270: the faster the scanner will run.
1.1 deraadt 3271: This is because with long tokens the processing of most input
1.16 jmc 3272: characters takes place in the
3273: .Pq short
3274: inner scanning loop, and does not often have to go through the additional work
3275: of setting up the scanning environment (e.g.,
3276: .Fa yytext )
3277: for the action.
3278: Recall the scanner for C comments:
3279: .Bd -literal -offset indent
3280: %x comment
3281: %%
3282: int line_num = 1;
3283:
3284: "/*" BEGIN(comment);
3285:
3286: <comment>[^*\en]*
3287: <comment>"*"+[^*/\en]*
3288: <comment>\en ++line_num;
3289: <comment>"*"+"/" BEGIN(INITIAL);
3290: .Ed
3291: .Pp
1.1 deraadt 3292: This could be sped up by writing it as:
1.16 jmc 3293: .Bd -literal -offset indent
3294: %x comment
3295: %%
3296: int line_num = 1;
3297:
3298: "/*" BEGIN(comment);
3299:
3300: <comment>[^*\en]*
3301: <comment>[^*\en]*\en ++line_num;
3302: <comment>"*"+[^*/\en]*
3303: <comment>"*"+[^*/\en]*\en ++line_num;
3304: <comment>"*"+"/" BEGIN(INITIAL);
3305: .Ed
3306: .Pp
3307: Now instead of each newline requiring the processing of another action,
3308: recognizing the newlines is
3309: .Qq distributed
3310: over the other rules to keep the matched text as long as possible.
3311: Note that adding rules does
3312: .Em not
3313: slow down the scanner!
3314: The speed of the scanner is independent of the number of rules or
3315: (modulo the considerations given at the beginning of this section)
3316: how complicated the rules are with regard to operators such as
3317: .Sq *
3318: and
3319: .Sq |\& .
3320: .Pp
3321: A final example in speeding up a scanner:
3322: scan through a file containing identifiers and keywords, one per line
3323: and with no other extraneous characters, and recognize all the keywords.
3324: A natural first approach is:
3325: .Bd -literal -offset indent
3326: %%
3327: asm |
3328: auto |
3329: break |
3330: \&... etc ...
3331: volatile |
3332: while /* it's a keyword */
3333:
3334: \&.|\en /* it's not a keyword */
3335: .Ed
3336: .Pp
1.1 deraadt 3337: To eliminate the back-tracking, introduce a catch-all rule:
1.16 jmc 3338: .Bd -literal -offset indent
3339: %%
3340: asm |
3341: auto |
3342: break |
3343: \&... etc ...
3344: volatile |
3345: while /* it's a keyword */
3346:
3347: [a-z]+ |
3348: \&.|\en /* it's not a keyword */
3349: .Ed
3350: .Pp
1.1 deraadt 3351: Now, if it's guaranteed that there's exactly one word per line,
3352: then we can reduce the total number of matches by a half by
1.16 jmc 3353: merging in the recognition of newlines with that of the other tokens:
3354: .Bd -literal -offset indent
3355: %%
3356: asm\en |
3357: auto\en |
3358: break\en |
3359: \&... etc ...
3360: volatile\en |
3361: while\en /* it's a keyword */
3362:
3363: [a-z]+\en |
3364: \&.|\en /* it's not a keyword */
3365: .Ed
3366: .Pp
3367: One has to be careful here,
3368: as we have now reintroduced backing up into the scanner.
3369: In particular, while we know that there will never be any characters
3370: in the input stream other than letters or newlines,
3371: .Nm
1.1 deraadt 3372: can't figure this out, and it will plan for possibly needing to back up
1.16 jmc 3373: when it has scanned a token like
3374: .Qq auto
3375: and then the next character is something other than a newline or a letter.
3376: Previously it would then just match the
3377: .Qq auto
3378: rule and be done, but now it has no
3379: .Qq auto
3380: rule, only an
3381: .Qq auto\en
3382: rule.
3383: To eliminate the possibility of backing up,
1.1 deraadt 3384: we could either duplicate all rules but without final newlines, or,
3385: since we never expect to encounter such an input and therefore don't
1.16 jmc 3386: how it's classified, we can introduce one more catch-all rule,
3387: this one which doesn't include a newline:
3388: .Bd -literal -offset indent
3389: %%
3390: asm\en |
3391: auto\en |
3392: break\en |
3393: \&... etc ...
3394: volatile\en |
3395: while\en /* it's a keyword */
3396:
3397: [a-z]+\en |
3398: [a-z]+ |
3399: \&.|\en /* it's not a keyword */
3400: .Ed
3401: .Pp
1.1 deraadt 3402: Compiled with
1.16 jmc 3403: .Fl Cf ,
1.1 deraadt 3404: this is about as fast as one can get a
1.16 jmc 3405: .Nm
1.1 deraadt 3406: scanner to go for this particular problem.
1.16 jmc 3407: .Pp
1.1 deraadt 3408: A final note:
1.16 jmc 3409: .Nm
3410: is slow when matching NUL's,
3411: particularly when a token contains multiple NUL's.
3412: It's best to write rules which match short
1.1 deraadt 3413: amounts of text if it's anticipated that the text will often include NUL's.
1.16 jmc 3414: .Pp
1.1 deraadt 3415: Another final note regarding performance: as mentioned above in the section
1.16 jmc 3416: .Sx HOW THE INPUT IS MATCHED ,
3417: dynamically resizing
3418: .Fa yytext
1.1 deraadt 3419: to accommodate huge tokens is a slow process because it presently requires that
1.16 jmc 3420: the
3421: .Pq huge
3422: token be rescanned from the beginning.
3423: Thus if performance is vital, it is better to attempt to match
3424: .Qq large
3425: quantities of text but not
3426: .Qq huge
3427: quantities, where the cutoff between the two is at about 8K characters/token.
3428: .Sh GENERATING C++ SCANNERS
3429: .Nm
3430: provides two different ways to generate scanners for use with C++.
3431: The first way is to simply compile a scanner generated by
3432: .Nm
3433: using a C++ compiler instead of a C compiler.
3434: This should not generate any compilation errors
3435: (please report any found to the email address given in the
3436: .Sx AUTHORS
3437: section below).
3438: C++ code can then be used in rule actions instead of C code.
3439: Note that the default input source for scanners remains
3440: .Fa yyin ,
1.1 deraadt 3441: and default echoing is still done to
1.16 jmc 3442: .Fa yyout .
1.1 deraadt 3443: Both of these remain
1.16 jmc 3444: .Fa FILE *
3445: variables and not C++ streams.
3446: .Pp
3447: .Nm
3448: can also be used to generate a C++ scanner class, using the
3449: .Fl +
1.1 deraadt 3450: option (or, equivalently,
1.16 jmc 3451: .Dq %option c++ ) ,
3452: which is automatically specified if the name of the flex executable ends in a
3453: .Sq + ,
3454: such as
3455: .Nm flex++ .
3456: When using this option,
3457: .Nm
3458: defaults to generating the scanner to the file
3459: .Pa lex.yy.cc
1.1 deraadt 3460: instead of
1.16 jmc 3461: .Pa lex.yy.c .
1.1 deraadt 3462: The generated scanner includes the header file
1.16 jmc 3463: .Aq Pa g++/FlexLexer.h ,
1.1 deraadt 3464: which defines the interface to two C++ classes.
1.16 jmc 3465: .Pp
1.1 deraadt 3466: The first class,
1.16 jmc 3467: .Em FlexLexer ,
3468: provides an abstract base class defining the general scanner class interface.
3469: It provides the following member functions:
3470: .Bl -tag -width Ds
3471: .It const char* YYText()
3472: Returns the text of the most recently matched token, the equivalent of
3473: .Fa yytext .
3474: .It int YYLeng()
3475: Returns the length of the most recently matched token, the equivalent of
3476: .Fa yyleng .
3477: .It int lineno() const
3478: Returns the current input line number
1.1 deraadt 3479: (see
1.16 jmc 3480: .Dq %option yylineno ) ,
3481: or 1 if
3482: .Dq %option yylineno
1.1 deraadt 3483: was not used.
1.16 jmc 3484: .It void set_debug(int flag)
3485: Sets the debugging flag for the scanner, equivalent to assigning to
3486: .Fa yy_flex_debug
3487: (see the
3488: .Sx OPTIONS
3489: section above).
3490: Note that the scanner must be built using
3491: .Dq %option debug
1.1 deraadt 3492: to include debugging information in it.
1.16 jmc 3493: .It int debug() const
3494: Returns the current setting of the debugging flag.
3495: .El
3496: .Pp
1.1 deraadt 3497: Also provided are member functions equivalent to
1.16 jmc 3498: .Fn yy_switch_to_buffer ,
3499: .Fn yy_create_buffer
1.1 deraadt 3500: (though the first argument is an
1.18 ! espie 3501: .Fa std::istream*
1.1 deraadt 3502: object pointer and not a
1.16 jmc 3503: .Fa FILE* ) ,
3504: .Fn yy_flush_buffer ,
3505: .Fn yy_delete_buffer ,
1.1 deraadt 3506: and
1.16 jmc 3507: .Fn yyrestart
1.10 deraadt 3508: (again, the first argument is an
1.18 ! espie 3509: .Fa std::istream*
1.1 deraadt 3510: object pointer).
1.16 jmc 3511: .Pp
1.1 deraadt 3512: The second class defined in
1.16 jmc 3513: .Aq Pa g++/FlexLexer.h
1.1 deraadt 3514: is
1.16 jmc 3515: .Fa yyFlexLexer ,
1.1 deraadt 3516: which is derived from
1.16 jmc 3517: .Fa FlexLexer .
1.1 deraadt 3518: It defines the following additional member functions:
1.16 jmc 3519: .Bl -tag -width Ds
1.18 ! espie 3520: .It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
1.16 jmc 3521: Constructs a
3522: .Fa yyFlexLexer
3523: object using the given streams for input and output.
3524: If not specified, the streams default to
3525: .Fa cin
1.1 deraadt 3526: and
1.16 jmc 3527: .Fa cout ,
1.1 deraadt 3528: respectively.
1.16 jmc 3529: .It virtual int yylex()
3530: Performs the same role as
3531: .Fn yylex
1.1 deraadt 3532: does for ordinary flex scanners: it scans the input stream, consuming
1.16 jmc 3533: tokens, until a rule's action returns a value.
3534: If subclass
3535: .Sq S
3536: is derived from
3537: .Fa yyFlexLexer ,
3538: in order to access the member functions and variables of
3539: .Sq S
1.1 deraadt 3540: inside
1.16 jmc 3541: .Fn yylex ,
3542: use
3543: .Dq %option yyclass="S"
1.1 deraadt 3544: to inform
1.16 jmc 3545: .Nm
3546: that the
3547: .Sq S
3548: subclass will be used instead of
3549: .Fa yyFlexLexer .
1.1 deraadt 3550: In this case, rather than generating
1.16 jmc 3551: .Dq yyFlexLexer::yylex() ,
3552: .Nm
1.1 deraadt 3553: generates
1.16 jmc 3554: .Dq S::yylex()
1.1 deraadt 3555: (and also generates a dummy
1.16 jmc 3556: .Dq yyFlexLexer::yylex()
1.1 deraadt 3557: that calls
1.16 jmc 3558: .Dq yyFlexLexer::LexerError()
1.1 deraadt 3559: if called).
1.18 ! espie 3560: .It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
1.16 jmc 3561: Reassigns
3562: .Fa yyin
1.1 deraadt 3563: to
1.16 jmc 3564: .Fa new_in
3565: .Pq if non-nil
1.1 deraadt 3566: and
1.16 jmc 3567: .Fa yyout
1.1 deraadt 3568: to
1.16 jmc 3569: .Fa new_out
3570: .Pq ditto ,
3571: deleting the previous input buffer if
3572: .Fa yyin
1.1 deraadt 3573: is reassigned.
1.18 ! espie 3574: .It int yylex(std::istream* new_in, std::ostream* new_out = 0)
1.16 jmc 3575: First switches the input streams via
3576: .Dq switch_streams(new_in, new_out)
1.1 deraadt 3577: and then returns the value of
1.16 jmc 3578: .Fn yylex .
3579: .El
3580: .Pp
1.1 deraadt 3581: In addition,
1.16 jmc 3582: .Fa yyFlexLexer
3583: defines the following protected virtual functions which can be redefined
1.1 deraadt 3584: in derived classes to tailor the scanner:
1.16 jmc 3585: .Bl -tag -width Ds
3586: .It virtual int LexerInput(char* buf, int max_size)
3587: Reads up to
3588: .Fa max_size
1.1 deraadt 3589: characters into
1.16 jmc 3590: .Fa buf
3591: and returns the number of characters read.
3592: To indicate end-of-input, return 0 characters.
3593: Note that
3594: .Qq interactive
3595: scanners (see the
3596: .Fl B
1.1 deraadt 3597: and
1.16 jmc 3598: .Fl I
1.1 deraadt 3599: flags) define the macro
1.16 jmc 3600: .Dv YY_INTERACTIVE .
3601: If
3602: .Fn LexerInput
3603: has been redefined, and it's necessary to take different actions depending on
3604: whether or not the scanner might be scanning an interactive input source,
3605: it's possible to test for the presence of this name via
3606: .Dq #ifdef .
3607: .It virtual void LexerOutput(const char* buf, int size)
3608: Writes out
3609: .Fa size
1.1 deraadt 3610: characters from the buffer
1.16 jmc 3611: .Fa buf ,
3612: which, while NUL-terminated, may also contain
3613: .Qq internal
3614: NUL's if the scanner's rules can match text with NUL's in them.
3615: .It virtual void LexerError(const char* msg)
3616: Reports a fatal error message.
3617: The default version of this function writes the message to the stream
3618: .Fa cerr
1.1 deraadt 3619: and exits.
1.16 jmc 3620: .El
3621: .Pp
1.1 deraadt 3622: Note that a
1.16 jmc 3623: .Fa yyFlexLexer
3624: object contains its entire scanning state.
3625: Thus such objects can be used to create reentrant scanners.
3626: Multiple instances of the same
3627: .Fa yyFlexLexer
3628: class can be instantiated, and multiple C++ scanner classes can be combined
1.1 deraadt 3629: in the same program using the
1.16 jmc 3630: .Fl P
1.1 deraadt 3631: option discussed above.
1.16 jmc 3632: .Pp
1.1 deraadt 3633: Finally, note that the
1.16 jmc 3634: .Dq %array
3635: feature is not available to C++ scanner classes;
3636: .Dq %pointer
3637: must be used
3638: .Pq the default .
3639: .Pp
1.1 deraadt 3640: Here is an example of a simple C++ scanner:
1.16 jmc 3641: .Bd -literal -offset indent
3642: // An example of using the flex C++ scanner class.
1.1 deraadt 3643:
1.16 jmc 3644: %{
3645: #include <errno.h>
3646: int mylineno = 0;
3647: %}
1.1 deraadt 3648:
1.16 jmc 3649: string \e"[^\en"]+\e"
1.1 deraadt 3650:
1.16 jmc 3651: ws [ \et]+
1.1 deraadt 3652:
1.16 jmc 3653: alpha [A-Za-z]
3654: dig [0-9]
3655: name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3656: num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3657: num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3658: number {num1}|{num2}
1.1 deraadt 3659:
1.16 jmc 3660: %%
1.1 deraadt 3661:
1.16 jmc 3662: {ws} /* skip blanks and tabs */
1.1 deraadt 3663:
1.16 jmc 3664: "/*" {
3665: int c;
1.1 deraadt 3666:
1.16 jmc 3667: while ((c = yyinput()) != 0) {
3668: if(c == '\en')
1.1 deraadt 3669: ++mylineno;
1.16 jmc 3670: else if(c == '*') {
3671: if ((c = yyinput()) == '/')
1.1 deraadt 3672: break;
3673: else
3674: unput(c);
3675: }
1.16 jmc 3676: }
3677: }
1.1 deraadt 3678:
1.16 jmc 3679: {number} cout << "number " << YYText() << '\en';
1.1 deraadt 3680:
1.16 jmc 3681: \en mylineno++;
1.1 deraadt 3682:
1.16 jmc 3683: {name} cout << "name " << YYText() << '\en';
1.1 deraadt 3684:
1.16 jmc 3685: {string} cout << "string " << YYText() << '\en';
3686:
3687: %%
3688:
3689: int main(int /* argc */, char** /* argv */)
3690: {
3691: FlexLexer* lexer = new yyFlexLexer;
3692: while(lexer->yylex() != 0)
3693: ;
3694: return 0;
3695: }
3696: .Ed
3697: .Pp
3698: To create multiple
3699: .Pq different
3700: lexer classes, use the
3701: .Fl P
3702: flag
3703: (or the
3704: .Dq prefix=
3705: option)
3706: to rename each
3707: .Fa yyFlexLexer
1.1 deraadt 3708: to some other
1.16 jmc 3709: .Fa xxFlexLexer .
3710: .Aq Pa g++/FlexLexer.h
3711: can then be included in other sources once per lexer class, first renaming
3712: .Fa yyFlexLexer
1.1 deraadt 3713: as follows:
1.16 jmc 3714: .Bd -literal -offset indent
3715: #undef yyFlexLexer
3716: #define yyFlexLexer xxFlexLexer
3717: #include <g++/FlexLexer.h>
3718:
3719: #undef yyFlexLexer
3720: #define yyFlexLexer zzFlexLexer
3721: #include <g++/FlexLexer.h>
3722: .Ed
3723: .Pp
3724: If, for example,
3725: .Dq %option prefix="xx"
3726: is used for one scanner and
3727: .Dq %option prefix="zz"
3728: is used for the other.
3729: .Pp
3730: .Sy IMPORTANT :
3731: the present form of the scanning class is experimental
1.7 aaron 3732: and may change considerably between major releases.
1.16 jmc 3733: .Sh INCOMPATIBILITIES WITH LEX AND POSIX
3734: .Nm
1.1 deraadt 3735: is a rewrite of the AT&T Unix
1.16 jmc 3736: .Nm lex
3737: tool
3738: (the two implementations do not share any code, though),
3739: with some extensions and incompatibilities, both of which are of concern
3740: to those who wish to write scanners acceptable to either implementation.
3741: .Nm
3742: is fully compliant with the
3743: .Tn POSIX
3744: .Nm lex
1.1 deraadt 3745: specification, except that when using
1.16 jmc 3746: .Dq %pointer
3747: .Pq the default ,
3748: a call to
3749: .Fn unput
1.1 deraadt 3750: destroys the contents of
1.16 jmc 3751: .Fa yytext ,
3752: which is counter to the
3753: .Tn POSIX
3754: specification.
3755: .Pp
3756: In this section we discuss all of the known areas of incompatibility between
3757: .Nm ,
3758: AT&T
3759: .Nm lex ,
3760: and the
3761: .Tn POSIX
3762: specification.
3763: .Pp
3764: .Nm flex Ns 's
3765: .Fl l
1.1 deraadt 3766: option turns on maximum compatibility with the original AT&T
1.16 jmc 3767: .Nm lex
1.1 deraadt 3768: implementation, at the cost of a major loss in the generated scanner's
1.16 jmc 3769: performance.
3770: We note below which incompatibilities can be overcome using the
3771: .Fl l
1.1 deraadt 3772: option.
1.16 jmc 3773: .Pp
3774: .Nm
1.1 deraadt 3775: is fully compatible with
1.16 jmc 3776: .Nm lex
1.1 deraadt 3777: with the following exceptions:
1.16 jmc 3778: .Bl -dash
3779: .It
1.1 deraadt 3780: The undocumented
1.16 jmc 3781: .Nm lex
1.1 deraadt 3782: scanner internal variable
1.16 jmc 3783: .Fa yylineno
1.1 deraadt 3784: is not supported unless
1.16 jmc 3785: .Fl l
1.1 deraadt 3786: or
1.16 jmc 3787: .Dq %option yylineno
1.1 deraadt 3788: is used.
1.16 jmc 3789: .Pp
3790: .Fa yylineno
1.1 deraadt 3791: should be maintained on a per-buffer basis, rather than a per-scanner
1.16 jmc 3792: .Pq single global variable
3793: basis.
3794: .Pp
3795: .Fa yylineno
3796: is not part of the
3797: .Tn POSIX
3798: specification.
3799: .It
1.1 deraadt 3800: The
1.16 jmc 3801: .Fn input
1.1 deraadt 3802: routine is not redefinable, though it may be called to read characters
1.16 jmc 3803: following whatever has been matched by a rule.
3804: If
3805: .Fn input
3806: encounters an end-of-file, the normal
3807: .Fn yywrap
3808: processing is done.
3809: A
3810: .Dq real
3811: end-of-file is returned by
3812: .Fn input
1.1 deraadt 3813: as
1.16 jmc 3814: .Dv EOF .
3815: .Pp
1.1 deraadt 3816: Input is instead controlled by defining the
1.16 jmc 3817: .Dv YY_INPUT
1.1 deraadt 3818: macro.
1.16 jmc 3819: .Pp
1.1 deraadt 3820: The
1.16 jmc 3821: .Nm
1.1 deraadt 3822: restriction that
1.16 jmc 3823: .Fn input
3824: cannot be redefined is in accordance with the
3825: .Tn POSIX
3826: specification, which simply does not specify any way of controlling the
1.1 deraadt 3827: scanner's input other than by making an initial assignment to
1.16 jmc 3828: .Fa yyin .
3829: .It
1.1 deraadt 3830: The
1.16 jmc 3831: .Fn unput
3832: routine is not redefinable.
3833: This restriction is in accordance with
3834: .Tn POSIX .
3835: .It
3836: .Nm
1.1 deraadt 3837: scanners are not as reentrant as
1.16 jmc 3838: .Nm lex
3839: scanners.
3840: In particular, if a scanner is interactive and
3841: an interrupt handler long-jumps out of the scanner,
3842: and the scanner is subsequently called again,
3843: the following error message may be displayed:
3844: .Pp
3845: .D1 fatal flex scanner internal error--end of buffer missed
3846: .Pp
1.1 deraadt 3847: To reenter the scanner, first use
1.16 jmc 3848: .Pp
3849: .Dl yyrestart(yyin);
3850: .Pp
3851: Note that this call will throw away any buffered input;
3852: usually this isn't a problem with an interactive scanner.
3853: .Pp
3854: Also note that flex C++ scanner classes are reentrant,
3855: so if using C++ is an option , they should be used instead.
3856: See
3857: .Sx GENERATING C++ SCANNERS
3858: above for details.
3859: .It
3860: .Fn output
1.1 deraadt 3861: is not supported.
3862: Output from the
1.16 jmc 3863: .Em ECHO
1.1 deraadt 3864: macro is done to the file-pointer
1.16 jmc 3865: .Fa yyout
3866: .Pq default stdout .
3867: .Pp
3868: .Fn output
3869: is not part of the
3870: .Tn POSIX
3871: specification.
3872: .It
3873: .Nm lex
3874: does not support exclusive start conditions
3875: .Pq %x ,
3876: though they are in the
3877: .Tn POSIX
3878: specification.
3879: .It
1.1 deraadt 3880: When definitions are expanded,
1.16 jmc 3881: .Nm
1.1 deraadt 3882: encloses them in parentheses.
1.16 jmc 3883: With
3884: .Nm lex ,
3885: the following:
3886: .Bd -literal -offset indent
3887: NAME [A-Z][A-Z0-9]*
3888: %%
3889: foo{NAME}? printf("Found it\en");
3890: %%
3891: .Ed
3892: .Pp
3893: will not match the string
3894: .Qq foo
3895: because when the macro is expanded the rule is equivalent to
3896: .Qq foo[A-Z][A-Z0-9]*?
3897: and the precedence is such that the
3898: .Sq ?\&
3899: is associated with
3900: .Qq [A-Z0-9]* .
3901: With
3902: .Nm ,
1.1 deraadt 3903: the rule will be expanded to
1.16 jmc 3904: .Qq foo([A-Z][A-Z0-9]*)?
3905: and so the string
3906: .Qq foo
3907: will match.
3908: .Pp
1.1 deraadt 3909: Note that if the definition begins with
1.16 jmc 3910: .Sq ^
1.1 deraadt 3911: or ends with
1.16 jmc 3912: .Sq $
3913: then it is not expanded with parentheses, to allow these operators to appear in
3914: definitions without losing their special meanings.
3915: But the
3916: .Sq Aq s ,
3917: .Sq / ,
1.1 deraadt 3918: and
1.16 jmc 3919: .Aq Aq EOF
1.1 deraadt 3920: operators cannot be used in a
1.16 jmc 3921: .Nm
1.1 deraadt 3922: definition.
1.16 jmc 3923: .Pp
1.1 deraadt 3924: Using
1.16 jmc 3925: .Fl l
1.1 deraadt 3926: results in the
1.16 jmc 3927: .Nm lex
1.1 deraadt 3928: behavior of no parentheses around the definition.
1.16 jmc 3929: .Pp
3930: The
3931: .Tn POSIX
3932: specification is that the definition be enclosed in parentheses.
3933: .It
1.1 deraadt 3934: Some implementations of
1.16 jmc 3935: .Nm lex
3936: allow a rule's action to begin on a separate line,
3937: if the rule's pattern has trailing whitespace:
3938: .Bd -literal -offset indent
3939: %%
3940: foo|bar<space here>
3941: { foobar_action(); }
3942: .Ed
3943: .Pp
3944: .Nm
1.1 deraadt 3945: does not support this feature.
1.16 jmc 3946: .It
1.1 deraadt 3947: The
1.16 jmc 3948: .Nm lex
3949: .Sq %r
3950: .Pq generate a Ratfor scanner
3951: option is not supported.
3952: It is not part of the
3953: .Tn POSIX
3954: specification.
3955: .It
1.1 deraadt 3956: After a call to
1.16 jmc 3957: .Fn unput ,
3958: .Fa yytext
3959: is undefined until the next token is matched,
3960: unless the scanner was built using
3961: .Dq %array .
1.1 deraadt 3962: This is not the case with
1.16 jmc 3963: .Nm lex
3964: or the
3965: .Tn POSIX
3966: specification.
3967: The
3968: .Fl l
1.1 deraadt 3969: option does away with this incompatibility.
1.16 jmc 3970: .It
1.1 deraadt 3971: The precedence of the
1.16 jmc 3972: .Sq {}
3973: .Pq numeric range
3974: operator is different.
3975: .Nm lex
3976: interprets
3977: .Qq abc{1,3}
3978: as match one, two, or three occurrences of
3979: .Sq abc ,
3980: whereas
3981: .Nm
3982: interprets it as match
3983: .Sq ab
3984: followed by one, two, or three occurrences of
3985: .Sq c .
3986: The latter is in agreement with the
3987: .Tn POSIX
3988: specification.
3989: .It
1.1 deraadt 3990: The precedence of the
1.16 jmc 3991: .Sq ^
1.1 deraadt 3992: operator is different.
1.16 jmc 3993: .Nm lex
3994: interprets
3995: .Qq ^foo|bar
3996: as match either
3997: .Sq foo
3998: at the beginning of a line, or
3999: .Sq bar
4000: anywhere, whereas
4001: .Nm
4002: interprets it as match either
4003: .Sq foo
4004: or
4005: .Sq bar
4006: if they come at the beginning of a line.
4007: The latter is in agreement with the
4008: .Tn POSIX
4009: specification.
4010: .It
1.1 deraadt 4011: The special table-size declarations such as
1.16 jmc 4012: .Sq %a
1.1 deraadt 4013: supported by
1.16 jmc 4014: .Nm lex
1.1 deraadt 4015: are not required by
1.16 jmc 4016: .Nm
1.1 deraadt 4017: scanners;
1.16 jmc 4018: .Nm
1.1 deraadt 4019: ignores them.
1.16 jmc 4020: .It
1.1 deraadt 4021: The name
1.16 jmc 4022: .Dv FLEX_SCANNER
1.1 deraadt 4023: is #define'd so scanners may be written for use with either
1.16 jmc 4024: .Nm
1.1 deraadt 4025: or
1.16 jmc 4026: .Nm lex .
1.1 deraadt 4027: Scanners also include
1.16 jmc 4028: .Dv YY_FLEX_MAJOR_VERSION
1.1 deraadt 4029: and
1.16 jmc 4030: .Dv YY_FLEX_MINOR_VERSION
1.1 deraadt 4031: indicating which version of
1.16 jmc 4032: .Nm
1.1 deraadt 4033: generated the scanner
1.16 jmc 4034: (for example, for the 2.5 release, these defines would be 2 and 5,
1.1 deraadt 4035: respectively).
1.16 jmc 4036: .El
4037: .Pp
1.1 deraadt 4038: The following
1.16 jmc 4039: .Nm
1.1 deraadt 4040: features are not included in
1.16 jmc 4041: .Nm lex
4042: or the
4043: .Tn POSIX
4044: specification:
4045: .Bd -unfilled -offset indent
4046: C++ scanners
4047: %option
4048: start condition scopes
4049: start condition stacks
4050: interactive/non-interactive scanners
4051: yy_scan_string() and friends
4052: yyterminate()
4053: yy_set_interactive()
4054: yy_set_bol()
4055: YY_AT_BOL()
4056: <<EOF>>
4057: <*>
4058: YY_DECL
4059: YY_START
4060: YY_USER_ACTION
4061: YY_USER_INIT
4062: #line directives
4063: %{}'s around actions
4064: multiple actions on a line
4065: .Ed
4066: .Pp
4067: plus almost all of the
4068: .Nm
4069: flags.
1.1 deraadt 4070: The last feature in the list refers to the fact that with
1.16 jmc 4071: .Nm
4072: Multiple actions ican be placed on the same line,
4073: separated with semi-colons, while with
4074: .Nm lex ,
1.1 deraadt 4075: the following
1.16 jmc 4076: .Pp
4077: .Dl foo handle_foo(); ++num_foos_seen;
4078: .Pp
4079: is
4080: .Pq rather surprisingly
4081: truncated to
4082: .Pp
4083: .Dl foo handle_foo();
4084: .Pp
4085: .Nm
4086: does not truncate the action.
4087: Actions that are not enclosed in braces
4088: are simply terminated at the end of the line.
4089: .Sh FILES
4090: .Bl -tag -width "<g++/FlexLexer.h>"
4091: .It flex.skl
4092: Skeleton scanner.
4093: This file is only used when building flex, not when
4094: .Nm
4095: executes.
4096: .It lex.backup
4097: Backing-up information for the
4098: .Fl b
4099: flag (called
4100: .Pa lex.bck
4101: on some systems).
4102: .It lex.yy.c
4103: Generated scanner
4104: (called
4105: .Pa lexyy.c
4106: on some systems).
4107: .It lex.yy.cc
4108: Generated C++ scanner class, when using
4109: .Fl + .
4110: .It Aq g++/FlexLexer.h
4111: Header file defining the C++ scanner base class,
4112: .Fa FlexLexer ,
4113: and its derived class,
4114: .Fa yyFlexLexer .
4115: .It /usr/lib/libl.*
4116: .Nm
4117: libraries.
4118: The
4119: .Pa /usr/lib/libfl.*\&
4120: libraries are links to these.
4121: Scanners must be linked using either
4122: .Fl \&ll
4123: or
4124: .Fl lfl .
4125: .El
4126: .Sh DIAGNOSTICS
4127: .Bl -diag
4128: .It warning, rule cannot be matched
4129: Indicates that the given rule cannot be matched because it follows other rules
4130: that will always match the same text as it.
4131: For example, in the following
4132: .Dq foo
4133: cannot be matched because it comes after an identifier
4134: .Qq catch-all
4135: rule:
4136: .Bd -literal -offset indent
4137: [a-z]+ got_identifier();
4138: foo got_foo();
4139: .Ed
4140: .Pp
1.1 deraadt 4141: Using
1.16 jmc 4142: .Em REJECT
1.1 deraadt 4143: in a scanner suppresses this warning.
1.16 jmc 4144: .It "warning, \-s option given but default rule can be matched"
4145: Means that it is possible
4146: .Pq perhaps only in a particular start condition
4147: that the default rule
4148: .Pq match any single character
4149: is the only one that will match a particular input.
4150: Since
4151: .Fl s
1.1 deraadt 4152: was given, presumably this is not intended.
1.16 jmc 4153: .It reject_used_but_not_detected undefined
4154: .It yymore_used_but_not_detected undefined
4155: These errors can occur at compile time.
4156: They indicate that the scanner uses
4157: .Em REJECT
1.1 deraadt 4158: or
1.16 jmc 4159: .Fn yymore
1.1 deraadt 4160: but that
1.16 jmc 4161: .Nm
1.1 deraadt 4162: failed to notice the fact, meaning that
1.16 jmc 4163: .Nm
1.1 deraadt 4164: scanned the first two sections looking for occurrences of these actions
1.16 jmc 4165: and failed to find any, but somehow they snuck in
4166: .Pq via an #include file, for example .
4167: Use
4168: .Dq %option reject
4169: or
4170: .Dq %option yymore
4171: to indicate to
4172: .Nm
4173: that these features are really needed.
4174: .It flex scanner jammed
4175: A scanner compiled with
4176: .Fl s
4177: has encountered an input string which wasn't matched by any of its rules.
4178: This error can also occur due to internal problems.
4179: .It token too large, exceeds YYLMAX
4180: The scanner uses
4181: .Dq %array
1.1 deraadt 4182: and one of its rules matched a string longer than the
1.16 jmc 4183: .Dv YYLMAX
4184: constant
4185: .Pq 8K bytes by default .
4186: The value can be increased by #define'ing
4187: .Dv YYLMAX
4188: in the definitions section of
4189: .Nm
1.1 deraadt 4190: input.
1.16 jmc 4191: .It "scanner requires \-8 flag to use the character 'x'"
4192: The scanner specification includes recognizing the 8-bit character
4193: .Sq x
4194: and the
4195: .Fl 8
4196: flag was not specified, and defaulted to 7-bit because the
4197: .Fl Cf
4198: or
4199: .Fl CF
4200: table compression options were used.
4201: See the discussion of the
4202: .Fl 7
1.1 deraadt 4203: flag for details.
1.16 jmc 4204: .It flex scanner push-back overflow
4205: unput() was used to push back so much text that the scanner's buffer
4206: could not hold both the pushed-back text and the current token in
4207: .Fa yytext .
4208: Ideally the scanner should dynamically resize the buffer in this case,
4209: but at present it does not.
4210: .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4211: The scanner was working on matching an extremely large token and needed
4212: to expand the input buffer.
4213: This doesn't work with scanners that use
4214: .Em REJECT .
4215: .It "fatal flex scanner internal error--end of buffer missed"
1.1 deraadt 4216: This can occur in an scanner which is reentered after a long-jump
1.16 jmc 4217: has jumped out
4218: .Pq or over
4219: the scanner's activation frame.
4220: Before reentering the scanner, use:
4221: .Pp
4222: .Dl yyrestart(yyin);
4223: .Pp
1.1 deraadt 4224: or, as noted above, switch to using the C++ scanner class.
1.16 jmc 4225: .It "too many start conditions in <> construct!"
4226: More start conditions than exist were listed in a <> construct
4227: (so at least one of them must have been listed twice).
4228: .El
4229: .Sh SEE ALSO
4230: .Xr awk 1 ,
4231: .Xr lex 1 ,
4232: .Xr sed 1 ,
4233: .Xr yacc 1
4234: .Pp
4235: .Rs
4236: .%A John Levine
4237: .%A Tony Mason
4238: .%A Doug Brown
4239: .%B Lex & Yacc
4240: .%I O'Reilly and Associates
4241: .%N 2nd edition
4242: .Re
4243: .Rs
4244: .%A M. E. Lesk
4245: .%A E. Schmidt
4246: .%B LEX \- Lexical Analyzer Generator
4247: .Re
4248: .Rs
4249: .%A Alfred Aho
4250: .%A Ravi Sethi
4251: .%A Jeffrey Ullman
4252: .%B Compilers: Principles, Techniques and Tools
4253: .%I Addison-Wesley
4254: .%D 1986
4255: .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4256: .Re
4257: .Sh AUTHORS
1.1 deraadt 4258: Vern Paxson, with the help of many ideas and much inspiration from
1.16 jmc 4259: Van Jacobson.
4260: Original version by Jef Poskanzer.
4261: The fast table representation is a partial implementation of a design done by
4262: Van Jacobson.
4263: The implementation was done by Kevin Gong and Vern Paxson.
4264: .Pp
1.1 deraadt 4265: Thanks to the many
1.16 jmc 4266: .Nm
1.1 deraadt 4267: beta-testers, feedbackers, and contributors, especially Francois Pinard,
4268: Casey Leedom,
4269: Robert Abramovitz,
4270: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4271: Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4272: Karl Berry, Peter A. Bigot, Simon Blanchard,
4273: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4274: Brian Clapper, J.T. Conklin,
4275: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
1.11 deraadt 4276: Daniels, Chris G. Demetriou, Theo de Raadt,
1.1 deraadt 4277: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4278: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4279: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4280: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4281: Jan Hajic, Charles Hemphill, NORO Hideo,
4282: Jarkko Hietaniemi, Scott Hofmann,
4283: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4284: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4285: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4286: Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4287: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4288: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4289: David Loffredo, Mike Long,
4290: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4291: Bengt Martensson, Chris Metcalf,
4292: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4293: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4294: Richard Ohnemus, Karsten Pahnke,
1.16 jmc 4295: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4296: Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
1.1 deraadt 4297: Frederic Raimbault, Pat Rankin, Rick Richardson,
4298: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4299: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4300: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4301: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4302: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4303: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
1.16 jmc 4304: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4305: Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4306: and those whose names have slipped my marginal mail-archiving skills
4307: but whose contributions are appreciated all the
1.1 deraadt 4308: same.
1.16 jmc 4309: .Pp
1.1 deraadt 4310: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4311: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4312: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4313: distribution headaches.
1.16 jmc 4314: .Pp
4315: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4316: to Benson Margulies and Fred Burke for C++ support;
4317: to Kent Williams and Tom Epperly for C++ class support;
4318: to Ove Ewerlid for support of NUL's;
4319: and to Eric Hughes for support of multiple buffers.
4320: .Pp
1.1 deraadt 4321: This work was primarily done when I was with the Real Time Systems Group
1.16 jmc 4322: at the Lawrence Berkeley Laboratory in Berkeley, CA.
4323: Many thanks to all there for the support I received.
4324: .Pp
4325: Send comments to
4326: .Aq vern@ee.lbl.gov .
4327: .Sh BUGS
4328: Some trailing context patterns cannot be properly matched and generate
4329: warning messages
4330: .Pq "dangerous trailing context" .
4331: These are patterns where the ending of the first part of the rule
4332: matches the beginning of the second part, such as
4333: .Qq zx*/xy* ,
4334: where the
4335: .Sq x*
4336: matches the
4337: .Sq x
4338: at the beginning of the trailing context.
4339: (Note that the POSIX draft states that the text matched by such patterns
4340: is undefined.)
4341: .Pp
4342: For some trailing context rules, parts which are actually fixed-length are
4343: not recognized as such, leading to the above mentioned performance loss.
4344: In particular, parts using
4345: .Sq |\&
4346: or
4347: .Sq {n}
4348: (such as
4349: .Qq foo{3} )
4350: are always considered variable-length.
4351: .Pp
4352: Combining trailing context with the special
4353: .Sq |\&
4354: action can result in fixed trailing context being turned into
4355: the more expensive variable trailing context.
4356: For example, in the following:
4357: .Bd -literal -offset indent
4358: %%
4359: abc |
4360: xyz/def
4361: .Ed
4362: .Pp
4363: Use of
4364: .Fn unput
4365: invalidates yytext and yyleng, unless the
4366: .Dq %array
4367: directive
4368: or the
4369: .Fl l
4370: option has been used.
4371: .Pp
4372: Pattern-matching of NUL's is substantially slower than matching other
4373: characters.
4374: .Pp
4375: Dynamic resizing of the input buffer is slow, as it entails rescanning
4376: all the text matched so far by the current
4377: .Pq generally huge
4378: token.
4379: .Pp
4380: Due to both buffering of input and read-ahead,
4381: it is not possible to intermix calls to
4382: .Aq Pa stdio.h
4383: routines, such as, for example,
4384: .Fn getchar ,
4385: with
4386: .Nm
4387: rules and expect it to work.
4388: Call
4389: .Fn input
4390: instead.
4391: .Pp
4392: The total table entries listed by the
4393: .Fl v
4394: flag excludes the number of table entries needed to determine
4395: what rule has been matched.
4396: The number of entries is equal to the number of DFA states
4397: if the scanner does not use
4398: .Em REJECT ,
4399: and somewhat greater than the number of states if it does.
4400: .Pp
4401: .Em REJECT
4402: cannot be used with the
4403: .Fl f
4404: or
4405: .Fl F
4406: options.
4407: .Pp
4408: The
4409: .Nm
4410: internal algorithms need documentation.