Annotation of src/usr.bin/lex/flex.1, Revision 1.6
1.6 ! aaron 1: .\" $OpenBSD: flex.1,v 1.5 1998/08/17 03:20:23 deraadt Exp $
1.2 deraadt 2: .\"
1.1 deraadt 3: .TH FLEX 1 "April 1995" "Version 2.5"
4: .SH NAME
5: flex \- fast lexical analyzer generator
6: .SH SYNOPSIS
7: .B flex
8: .B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton]
9: .B [\-\-help \-\-version]
10: .I [filename ...]
11: .SH OVERVIEW
12: This manual describes
13: .I flex,
14: a tool for generating programs that perform pattern-matching on text. The
15: manual includes both tutorial and reference sections:
16: .nf
17:
18: Description
19: a brief overview of the tool
20:
21: Some Simple Examples
22:
23: Format Of The Input File
24:
25: Patterns
26: the extended regular expressions used by flex
27:
28: How The Input Is Matched
29: the rules for determining what has been matched
30:
31: Actions
32: how to specify what to do when a pattern is matched
33:
34: The Generated Scanner
35: details regarding the scanner that flex produces;
36: how to control the input source
37:
38: Start Conditions
39: introducing context into your scanners, and
40: managing "mini-scanners"
41:
42: Multiple Input Buffers
43: how to manipulate multiple input sources; how to
44: scan from strings instead of files
45:
46: End-of-file Rules
47: special rules for matching the end of the input
48:
49: Miscellaneous Macros
50: a summary of macros available to the actions
51:
52: Values Available To The User
53: a summary of values available to the actions
54:
55: Interfacing With Yacc
56: connecting flex scanners together with yacc parsers
57:
58: Options
59: flex command-line options, and the "%option"
60: directive
61:
62: Performance Considerations
63: how to make your scanner go as fast as possible
64:
65: Generating C++ Scanners
66: the (experimental) facility for generating C++
67: scanner classes
68:
69: Incompatibilities With Lex And POSIX
70: how flex differs from AT&T lex and the POSIX lex
71: standard
72:
73: Diagnostics
74: those error messages produced by flex (or scanners
75: it generates) whose meanings might not be apparent
76:
77: Files
78: files used by flex
79:
80: Deficiencies / Bugs
81: known problems with flex
82:
83: See Also
84: other documentation, related tools
85:
86: Author
87: includes contact information
88:
89: .fi
90: .SH DESCRIPTION
91: .I flex
92: is a tool for generating
93: .I scanners:
94: programs which recognized lexical patterns in text.
95: .I flex
96: reads
97: the given input files, or its standard input if no file names are given,
98: for a description of a scanner to generate. The description is in
99: the form of pairs
100: of regular expressions and C code, called
101: .I rules. flex
102: generates as output a C source file,
103: .B lex.yy.c,
104: which defines a routine
105: .B yylex().
106: This file is compiled and linked with the
107: .B \-lfl
108: library to produce an executable. When the executable is run,
109: it analyzes its input for occurrences
110: of the regular expressions. Whenever it finds one, it executes
111: the corresponding C code.
112: .SH SOME SIMPLE EXAMPLES
113: .PP
114: First some simple examples to get the flavor of how one uses
115: .I flex.
116: The following
117: .I flex
118: input specifies a scanner which whenever it encounters the string
119: "username" will replace it with the user's login name:
120: .nf
121:
122: %%
123: username printf( "%s", getlogin() );
124:
125: .fi
126: By default, any text not matched by a
127: .I flex
128: scanner
129: is copied to the output, so the net effect of this scanner is
130: to copy its input file to its output with each occurrence
131: of "username" expanded.
132: In this input, there is just one rule. "username" is the
133: .I pattern
134: and the "printf" is the
135: .I action.
136: The "%%" marks the beginning of the rules.
137: .PP
138: Here's another simple example:
139: .nf
140:
141: int num_lines = 0, num_chars = 0;
142:
143: %%
144: \\n ++num_lines; ++num_chars;
145: . ++num_chars;
146:
147: %%
148: main()
149: {
150: yylex();
151: printf( "# of lines = %d, # of chars = %d\\n",
152: num_lines, num_chars );
153: }
154:
155: .fi
156: This scanner counts the number of characters and the number
157: of lines in its input (it produces no output other than the
158: final report on the counts). The first line
159: declares two globals, "num_lines" and "num_chars", which are accessible
160: both inside
161: .B yylex()
162: and in the
163: .B main()
164: routine declared after the second "%%". There are two rules, one
165: which matches a newline ("\\n") and increments both the line count and
166: the character count, and one which matches any character other than
167: a newline (indicated by the "." regular expression).
168: .PP
169: A somewhat more complicated example:
170: .nf
171:
172: /* scanner for a toy Pascal-like language */
173:
174: %{
175: /* need this for the call to atof() below */
176: #include <math.h>
177: %}
178:
179: DIGIT [0-9]
180: ID [a-z][a-z0-9]*
181:
182: %%
183:
184: {DIGIT}+ {
185: printf( "An integer: %s (%d)\\n", yytext,
186: atoi( yytext ) );
187: }
188:
189: {DIGIT}+"."{DIGIT}* {
190: printf( "A float: %s (%g)\\n", yytext,
191: atof( yytext ) );
192: }
193:
194: if|then|begin|end|procedure|function {
195: printf( "A keyword: %s\\n", yytext );
196: }
197:
198: {ID} printf( "An identifier: %s\\n", yytext );
199:
200: "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
201:
202: "{"[^}\\n]*"}" /* eat up one-line comments */
203:
204: [ \\t\\n]+ /* eat up whitespace */
205:
206: . printf( "Unrecognized character: %s\\n", yytext );
207:
208: %%
209:
210: main( argc, argv )
211: int argc;
212: char **argv;
213: {
214: ++argv, --argc; /* skip over program name */
215: if ( argc > 0 )
216: yyin = fopen( argv[0], "r" );
217: else
218: yyin = stdin;
219:
220: yylex();
221: }
222:
223: .fi
224: This is the beginnings of a simple scanner for a language like
225: Pascal. It identifies different types of
226: .I tokens
227: and reports on what it has seen.
228: .PP
229: The details of this example will be explained in the following
230: sections.
231: .SH FORMAT OF THE INPUT FILE
232: The
233: .I flex
234: input file consists of three sections, separated by a line with just
235: .B %%
236: in it:
237: .nf
238:
239: definitions
240: %%
241: rules
242: %%
243: user code
244:
245: .fi
246: The
247: .I definitions
248: section contains declarations of simple
249: .I name
250: definitions to simplify the scanner specification, and declarations of
251: .I start conditions,
252: which are explained in a later section.
253: .PP
254: Name definitions have the form:
255: .nf
256:
257: name definition
258:
259: .fi
260: The "name" is a word beginning with a letter or an underscore ('_')
261: followed by zero or more letters, digits, '_', or '-' (dash).
262: The definition is taken to begin at the first non-white-space character
263: following the name and continuing to the end of the line.
264: The definition can subsequently be referred to using "{name}", which
265: will expand to "(definition)". For example,
266: .nf
267:
268: DIGIT [0-9]
269: ID [a-z][a-z0-9]*
270:
271: .fi
272: defines "DIGIT" to be a regular expression which matches a
273: single digit, and
274: "ID" to be a regular expression which matches a letter
275: followed by zero-or-more letters-or-digits.
276: A subsequent reference to
277: .nf
278:
279: {DIGIT}+"."{DIGIT}*
280:
281: .fi
282: is identical to
283: .nf
284:
285: ([0-9])+"."([0-9])*
286:
287: .fi
288: and matches one-or-more digits followed by a '.' followed
289: by zero-or-more digits.
290: .PP
291: The
292: .I rules
293: section of the
294: .I flex
295: input contains a series of rules of the form:
296: .nf
297:
298: pattern action
299:
300: .fi
301: where the pattern must be unindented and the action must begin
302: on the same line.
303: .PP
304: See below for a further description of patterns and actions.
305: .PP
306: Finally, the user code section is simply copied to
307: .B lex.yy.c
308: verbatim.
309: It is used for companion routines which call or are called
310: by the scanner. The presence of this section is optional;
311: if it is missing, the second
312: .B %%
313: in the input file may be skipped, too.
314: .PP
315: In the definitions and rules sections, any
316: .I indented
317: text or text enclosed in
318: .B %{
319: and
320: .B %}
321: is copied verbatim to the output (with the %{}'s removed).
322: The %{}'s must appear unindented on lines by themselves.
323: .PP
324: In the rules section,
325: any indented or %{} text appearing before the
326: first rule may be used to declare variables
327: which are local to the scanning routine and (after the declarations)
328: code which is to be executed whenever the scanning routine is entered.
329: Other indented or %{} text in the rule section is still copied to the output,
330: but its meaning is not well-defined and it may well cause compile-time
331: errors (this feature is present for
332: .I POSIX
333: compliance; see below for other such features).
334: .PP
335: In the definitions section (but not in the rules section),
336: an unindented comment (i.e., a line
337: beginning with "/*") is also copied verbatim to the output up
338: to the next "*/".
339: .SH PATTERNS
340: The patterns in the input are written using an extended set of regular
341: expressions. These are:
342: .nf
343:
344: x match the character 'x'
345: . any character (byte) except newline
346: [xyz] a "character class"; in this case, the pattern
347: matches either an 'x', a 'y', or a 'z'
348: [abj-oZ] a "character class" with a range in it; matches
349: an 'a', a 'b', any letter from 'j' through 'o',
350: or a 'Z'
351: [^A-Z] a "negated character class", i.e., any character
352: but those in the class. In this case, any
353: character EXCEPT an uppercase letter.
354: [^A-Z\\n] any character EXCEPT an uppercase letter or
355: a newline
356: r* zero or more r's, where r is any regular expression
357: r+ one or more r's
358: r? zero or one r's (that is, "an optional r")
359: r{2,5} anywhere from two to five r's
360: r{2,} two or more r's
361: r{4} exactly 4 r's
362: {name} the expansion of the "name" definition
363: (see above)
364: "[xyz]\\"foo"
365: the literal string: [xyz]"foo
366: \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
367: then the ANSI-C interpretation of \\x.
368: Otherwise, a literal 'X' (used to escape
369: operators such as '*')
370: \\0 a NUL character (ASCII code 0)
371: \\123 the character with octal value 123
372: \\x2a the character with hexadecimal value 2a
373: (r) match an r; parentheses are used to override
374: precedence (see below)
375:
376:
377: rs the regular expression r followed by the
378: regular expression s; called "concatenation"
379:
380:
381: r|s either an r or an s
382:
383:
384: r/s an r but only if it is followed by an s. The
385: text matched by s is included when determining
386: whether this rule is the "longest match",
387: but is then returned to the input before
388: the action is executed. So the action only
389: sees the text matched by r. This type
390: of pattern is called trailing context".
391: (There are some combinations of r/s that flex
392: cannot match correctly; see notes in the
393: Deficiencies / Bugs section below regarding
394: "dangerous trailing context".)
395: ^r an r, but only at the beginning of a line (i.e.,
396: which just starting to scan, or right after a
397: newline has been scanned).
398: r$ an r, but only at the end of a line (i.e., just
399: before a newline). Equivalent to "r/\\n".
400:
401: Note that flex's notion of "newline" is exactly
402: whatever the C compiler used to compile flex
403: interprets '\\n' as; in particular, on some DOS
404: systems you must either filter out \\r's in the
405: input yourself, or explicitly use r/\\r\\n for "r$".
406:
407:
408: <s>r an r, but only in start condition s (see
409: below for discussion of start conditions)
410: <s1,s2,s3>r
411: same, but in any of start conditions s1,
412: s2, or s3
413: <*>r an r in any start condition, even an exclusive one.
414:
415:
416: <<EOF>> an end-of-file
417: <s1,s2><<EOF>>
418: an end-of-file when in start condition s1 or s2
419:
420: .fi
421: Note that inside of a character class, all regular expression operators
422: lose their special meaning except escape ('\\') and the character class
423: operators, '-', ']', and, at the beginning of the class, '^'.
424: .PP
425: The regular expressions listed above are grouped according to
426: precedence, from highest precedence at the top to lowest at the bottom.
427: Those grouped together have equal precedence. For example,
428: .nf
429:
430: foo|bar*
431:
432: .fi
433: is the same as
434: .nf
435:
436: (foo)|(ba(r*))
437:
438: .fi
439: since the '*' operator has higher precedence than concatenation,
440: and concatenation higher than alternation ('|'). This pattern
441: therefore matches
442: .I either
443: the string "foo"
444: .I or
445: the string "ba" followed by zero-or-more r's.
446: To match "foo" or zero-or-more "bar"'s, use:
447: .nf
448:
449: foo|(bar)*
450:
451: .fi
452: and to match zero-or-more "foo"'s-or-"bar"'s:
453: .nf
454:
455: (foo|bar)*
456:
457: .fi
458: .PP
459: In addition to characters and ranges of characters, character classes
460: can also contain character class
461: .I expressions.
462: These are expressions enclosed inside
463: .B [:
464: and
465: .B :]
466: delimiters (which themselves must appear between the '[' and ']' of the
467: character class; other elements may occur inside the character class, too).
468: The valid expressions are:
469: .nf
470:
471: [:alnum:] [:alpha:] [:blank:]
472: [:cntrl:] [:digit:] [:graph:]
473: [:lower:] [:print:] [:punct:]
474: [:space:] [:upper:] [:xdigit:]
475:
476: .fi
477: These expressions all designate a set of characters equivalent to
478: the corresponding standard C
479: .B isXXX
480: function. For example,
481: .B [:alnum:]
482: designates those characters for which
483: .B isalnum()
484: returns true - i.e., any alphabetic or numeric.
485: Some systems don't provide
486: .B isblank(),
487: so flex defines
488: .B [:blank:]
489: as a blank or a tab.
490: .PP
491: For example, the following character classes are all equivalent:
492: .nf
493:
494: [[:alnum:]]
1.4 deraadt 495: [[:alpha:][:digit:]]
1.1 deraadt 496: [[:alpha:]0-9]
497: [a-zA-Z0-9]
498:
499: .fi
500: If your scanner is case-insensitive (the
501: .B \-i
502: flag), then
503: .B [:upper:]
504: and
505: .B [:lower:]
506: are equivalent to
507: .B [:alpha:].
508: .PP
509: Some notes on patterns:
510: .IP -
511: A negated character class such as the example "[^A-Z]"
512: above
513: .I will match a newline
514: unless "\\n" (or an equivalent escape sequence) is one of the
515: characters explicitly present in the negated character class
516: (e.g., "[^A-Z\\n]"). This is unlike how many other regular
517: expression tools treat negated character classes, but unfortunately
518: the inconsistency is historically entrenched.
519: Matching newlines means that a pattern like [^"]* can match the entire
520: input unless there's another quote in the input.
521: .IP -
522: A rule can have at most one instance of trailing context (the '/' operator
523: or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
524: can only occur at the beginning of a pattern, and, as well as with '/' and '$',
525: cannot be grouped inside parentheses. A '^' which does not occur at
526: the beginning of a rule or a '$' which does not occur at the end of
527: a rule loses its special properties and is treated as a normal character.
528: .IP
529: The following are illegal:
530: .nf
531:
532: foo/bar$
533: <sc1>foo<sc2>bar
534:
535: .fi
536: Note that the first of these, can be written "foo/bar\\n".
537: .IP
538: The following will result in '$' or '^' being treated as a normal character:
539: .nf
540:
541: foo|(bar$)
542: foo|^bar
543:
544: .fi
545: If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
546: could be used (the special '|' action is explained below):
547: .nf
548:
549: foo |
550: bar$ /* action goes here */
551:
552: .fi
553: A similar trick will work for matching a foo or a
554: bar-at-the-beginning-of-a-line.
555: .SH HOW THE INPUT IS MATCHED
556: When the generated scanner is run, it analyzes its input looking
557: for strings which match any of its patterns. If it finds more than
558: one match, it takes the one matching the most text (for trailing
559: context rules, this includes the length of the trailing part, even
560: though it will then be returned to the input). If it finds two
561: or more matches of the same length, the
562: rule listed first in the
563: .I flex
564: input file is chosen.
565: .PP
566: Once the match is determined, the text corresponding to the match
567: (called the
568: .I token)
569: is made available in the global character pointer
570: .B yytext,
571: and its length in the global integer
572: .B yyleng.
573: The
574: .I action
575: corresponding to the matched pattern is then executed (a more
576: detailed description of actions follows), and then the remaining
577: input is scanned for another match.
578: .PP
579: If no match is found, then the
580: .I default rule
581: is executed: the next character in the input is considered matched and
582: copied to the standard output. Thus, the simplest legal
583: .I flex
584: input is:
585: .nf
586:
587: %%
588:
589: .fi
590: which generates a scanner that simply copies its input (one character
591: at a time) to its output.
592: .PP
593: Note that
594: .B yytext
595: can be defined in two different ways: either as a character
596: .I pointer
597: or as a character
598: .I array.
599: You can control which definition
600: .I flex
601: uses by including one of the special directives
602: .B %pointer
603: or
604: .B %array
605: in the first (definitions) section of your flex input. The default is
606: .B %pointer,
607: unless you use the
608: .B -l
609: lex compatibility option, in which case
610: .B yytext
611: will be an array.
612: The advantage of using
613: .B %pointer
614: is substantially faster scanning and no buffer overflow when matching
615: very large tokens (unless you run out of dynamic memory). The disadvantage
616: is that you are restricted in how your actions can modify
617: .B yytext
618: (see the next section), and calls to the
619: .B unput()
620: function destroys the present contents of
621: .B yytext,
622: which can be a considerable porting headache when moving between different
623: .I lex
624: versions.
625: .PP
626: The advantage of
627: .B %array
628: is that you can then modify
629: .B yytext
630: to your heart's content, and calls to
631: .B unput()
632: do not destroy
633: .B yytext
634: (see below). Furthermore, existing
635: .I lex
636: programs sometimes access
637: .B yytext
638: externally using declarations of the form:
639: .nf
640: extern char yytext[];
641: .fi
642: This definition is erroneous when used with
643: .B %pointer,
644: but correct for
645: .B %array.
646: .PP
647: .B %array
648: defines
649: .B yytext
650: to be an array of
651: .B YYLMAX
652: characters, which defaults to a fairly large value. You can change
653: the size by simply #define'ing
654: .B YYLMAX
655: to a different value in the first section of your
656: .I flex
657: input. As mentioned above, with
658: .B %pointer
659: yytext grows dynamically to accommodate large tokens. While this means your
660: .B %pointer
661: scanner can accommodate very large tokens (such as matching entire blocks
662: of comments), bear in mind that each time the scanner must resize
663: .B yytext
664: it also must rescan the entire token from the beginning, so matching such
665: tokens can prove slow.
666: .B yytext
667: presently does
668: .I not
669: dynamically grow if a call to
670: .B unput()
671: results in too much text being pushed back; instead, a run-time error results.
672: .PP
673: Also note that you cannot use
674: .B %array
675: with C++ scanner classes
676: (the
677: .B c++
678: option; see below).
679: .SH ACTIONS
680: Each pattern in a rule has a corresponding action, which can be any
681: arbitrary C statement. The pattern ends at the first non-escaped
682: whitespace character; the remainder of the line is its action. If the
683: action is empty, then when the pattern is matched the input token
684: is simply discarded. For example, here is the specification for a program
685: which deletes all occurrences of "zap me" from its input:
686: .nf
687:
688: %%
689: "zap me"
690:
691: .fi
692: (It will copy all other characters in the input to the output since
693: they will be matched by the default rule.)
694: .PP
695: Here is a program which compresses multiple blanks and tabs down to
696: a single blank, and throws away whitespace found at the end of a line:
697: .nf
698:
699: %%
700: [ \\t]+ putchar( ' ' );
701: [ \\t]+$ /* ignore this token */
702:
703: .fi
704: .PP
705: If the action contains a '{', then the action spans till the balancing '}'
706: is found, and the action may cross multiple lines.
707: .I flex
708: knows about C strings and comments and won't be fooled by braces found
709: within them, but also allows actions to begin with
710: .B %{
711: and will consider the action to be all the text up to the next
712: .B %}
713: (regardless of ordinary braces inside the action).
714: .PP
715: An action consisting solely of a vertical bar ('|') means "same as
716: the action for the next rule." See below for an illustration.
717: .PP
718: Actions can include arbitrary C code, including
719: .B return
720: statements to return a value to whatever routine called
721: .B yylex().
722: Each time
723: .B yylex()
724: is called it continues processing tokens from where it last left
725: off until it either reaches
726: the end of the file or executes a return.
727: .PP
728: Actions are free to modify
729: .B yytext
730: except for lengthening it (adding
731: characters to its end--these will overwrite later characters in the
732: input stream). This however does not apply when using
733: .B %array
734: (see above); in that case,
735: .B yytext
736: may be freely modified in any way.
737: .PP
738: Actions are free to modify
739: .B yyleng
740: except they should not do so if the action also includes use of
741: .B yymore()
742: (see below).
743: .PP
744: There are a number of special directives which can be included within
745: an action:
746: .IP -
747: .B ECHO
748: copies yytext to the scanner's output.
749: .IP -
750: .B BEGIN
751: followed by the name of a start condition places the scanner in the
752: corresponding start condition (see below).
753: .IP -
754: .B REJECT
755: directs the scanner to proceed on to the "second best" rule which matched the
756: input (or a prefix of the input). The rule is chosen as described
757: above in "How the Input is Matched", and
758: .B yytext
759: and
760: .B yyleng
761: set up appropriately.
762: It may either be one which matched as much text
763: as the originally chosen rule but came later in the
764: .I flex
765: input file, or one which matched less text.
766: For example, the following will both count the
767: words in the input and call the routine special() whenever "frob" is seen:
768: .nf
769:
770: int word_count = 0;
771: %%
772:
773: frob special(); REJECT;
774: [^ \\t\\n]+ ++word_count;
775:
776: .fi
777: Without the
778: .B REJECT,
779: any "frob"'s in the input would not be counted as words, since the
780: scanner normally executes only one action per token.
781: Multiple
782: .B REJECT's
783: are allowed, each one finding the next best choice to the currently
784: active rule. For example, when the following scanner scans the token
785: "abcd", it will write "abcdabcaba" to the output:
786: .nf
787:
788: %%
789: a |
790: ab |
791: abc |
792: abcd ECHO; REJECT;
793: .|\\n /* eat up any unmatched character */
794:
795: .fi
796: (The first three rules share the fourth's action since they use
797: the special '|' action.)
798: .B REJECT
799: is a particularly expensive feature in terms of scanner performance;
800: if it is used in
801: .I any
802: of the scanner's actions it will slow down
803: .I all
804: of the scanner's matching. Furthermore,
805: .B REJECT
806: cannot be used with the
807: .I -Cf
808: or
809: .I -CF
810: options (see below).
811: .IP
812: Note also that unlike the other special actions,
813: .B REJECT
814: is a
815: .I branch;
816: code immediately following it in the action will
817: .I not
818: be executed.
819: .IP -
820: .B yymore()
821: tells the scanner that the next time it matches a rule, the corresponding
822: token should be
823: .I appended
824: onto the current value of
825: .B yytext
826: rather than replacing it. For example, given the input "mega-kludge"
827: the following will write "mega-mega-kludge" to the output:
828: .nf
829:
830: %%
831: mega- ECHO; yymore();
832: kludge ECHO;
833:
834: .fi
835: First "mega-" is matched and echoed to the output. Then "kludge"
836: is matched, but the previous "mega-" is still hanging around at the
837: beginning of
838: .B yytext
839: so the
840: .B ECHO
841: for the "kludge" rule will actually write "mega-kludge".
842: .PP
843: Two notes regarding use of
844: .B yymore().
845: First,
846: .B yymore()
847: depends on the value of
848: .I yyleng
849: correctly reflecting the size of the current token, so you must not
850: modify
851: .I yyleng
852: if you are using
853: .B yymore().
854: Second, the presence of
855: .B yymore()
856: in the scanner's action entails a minor performance penalty in the
857: scanner's matching speed.
858: .IP -
859: .B yyless(n)
860: returns all but the first
861: .I n
862: characters of the current token back to the input stream, where they
863: will be rescanned when the scanner looks for the next match.
864: .B yytext
865: and
866: .B yyleng
867: are adjusted appropriately (e.g.,
868: .B yyleng
869: will now be equal to
870: .I n
871: ). For example, on the input "foobar" the following will write out
872: "foobarbar":
873: .nf
874:
875: %%
876: foobar ECHO; yyless(3);
877: [a-z]+ ECHO;
878:
879: .fi
880: An argument of 0 to
881: .B yyless
882: will cause the entire current input string to be scanned again. Unless you've
883: changed how the scanner will subsequently process its input (using
884: .B BEGIN,
885: for example), this will result in an endless loop.
886: .PP
887: Note that
888: .B yyless
889: is a macro and can only be used in the flex input file, not from
890: other source files.
891: .IP -
892: .B unput(c)
893: puts the character
894: .I c
895: back onto the input stream. It will be the next character scanned.
896: The following action will take the current token and cause it
897: to be rescanned enclosed in parentheses.
898: .nf
899:
900: {
901: int i;
902: /* Copy yytext because unput() trashes yytext */
903: char *yycopy = strdup( yytext );
904: unput( ')' );
905: for ( i = yyleng - 1; i >= 0; --i )
906: unput( yycopy[i] );
907: unput( '(' );
908: free( yycopy );
909: }
910:
911: .fi
912: Note that since each
913: .B unput()
914: puts the given character back at the
915: .I beginning
916: of the input stream, pushing back strings must be done back-to-front.
917: .PP
918: An important potential problem when using
919: .B unput()
920: is that if you are using
921: .B %pointer
922: (the default), a call to
923: .B unput()
924: .I destroys
925: the contents of
926: .I yytext,
927: starting with its rightmost character and devouring one character to
928: the left with each call. If you need the value of yytext preserved
929: after a call to
930: .B unput()
931: (as in the above example),
932: you must either first copy it elsewhere, or build your scanner using
933: .B %array
934: instead (see How The Input Is Matched).
935: .PP
936: Finally, note that you cannot put back
937: .B EOF
938: to attempt to mark the input stream with an end-of-file.
939: .IP -
940: .B input()
941: reads the next character from the input stream. For example,
942: the following is one way to eat up C comments:
943: .nf
944:
945: %%
946: "/*" {
947: register int c;
948:
949: for ( ; ; )
950: {
951: while ( (c = input()) != '*' &&
952: c != EOF )
953: ; /* eat up text of comment */
954:
955: if ( c == '*' )
956: {
957: while ( (c = input()) == '*' )
958: ;
959: if ( c == '/' )
960: break; /* found the end */
961: }
962:
963: if ( c == EOF )
964: {
965: error( "EOF in comment" );
966: break;
967: }
968: }
969: }
970:
971: .fi
972: (Note that if the scanner is compiled using
973: .B C++,
974: then
975: .B input()
976: is instead referred to as
977: .B yyinput(),
978: in order to avoid a name clash with the
979: .B C++
980: stream by the name of
981: .I input.)
982: .IP -
983: .B YY_FLUSH_BUFFER
984: flushes the scanner's internal buffer
985: so that the next time the scanner attempts to match a token, it will
986: first refill the buffer using
987: .B YY_INPUT
988: (see The Generated Scanner, below). This action is a special case
989: of the more general
990: .B yy_flush_buffer()
991: function, described below in the section Multiple Input Buffers.
992: .IP -
993: .B yyterminate()
994: can be used in lieu of a return statement in an action. It terminates
995: the scanner and returns a 0 to the scanner's caller, indicating "all done".
996: By default,
997: .B yyterminate()
998: is also called when an end-of-file is encountered. It is a macro and
999: may be redefined.
1000: .SH THE GENERATED SCANNER
1001: The output of
1002: .I flex
1003: is the file
1004: .B lex.yy.c,
1005: which contains the scanning routine
1006: .B yylex(),
1007: a number of tables used by it for matching tokens, and a number
1008: of auxiliary routines and macros. By default,
1009: .B yylex()
1010: is declared as follows:
1011: .nf
1012:
1013: int yylex()
1014: {
1015: ... various definitions and the actions in here ...
1016: }
1017:
1018: .fi
1019: (If your environment supports function prototypes, then it will
1020: be "int yylex( void )".) This definition may be changed by defining
1021: the "YY_DECL" macro. For example, you could use:
1022: .nf
1023:
1024: #define YY_DECL float lexscan( a, b ) float a, b;
1025:
1026: .fi
1027: to give the scanning routine the name
1028: .I lexscan,
1029: returning a float, and taking two floats as arguments. Note that
1030: if you give arguments to the scanning routine using a
1031: K&R-style/non-prototyped function declaration, you must terminate
1032: the definition with a semi-colon (;).
1033: .PP
1034: Whenever
1035: .B yylex()
1036: is called, it scans tokens from the global input file
1037: .I yyin
1038: (which defaults to stdin). It continues until it either reaches
1039: an end-of-file (at which point it returns the value 0) or
1040: one of its actions executes a
1041: .I return
1042: statement.
1043: .PP
1044: If the scanner reaches an end-of-file, subsequent calls are undefined
1045: unless either
1046: .I yyin
1047: is pointed at a new input file (in which case scanning continues from
1048: that file), or
1049: .B yyrestart()
1050: is called.
1051: .B yyrestart()
1052: takes one argument, a
1053: .B FILE *
1054: pointer (which can be nil, if you've set up
1055: .B YY_INPUT
1056: to scan from a source other than
1057: .I yyin),
1058: and initializes
1059: .I yyin
1060: for scanning from that file. Essentially there is no difference between
1061: just assigning
1062: .I yyin
1063: to a new input file or using
1064: .B yyrestart()
1065: to do so; the latter is available for compatibility with previous versions
1066: of
1067: .I flex,
1068: and because it can be used to switch input files in the middle of scanning.
1069: It can also be used to throw away the current input buffer, by calling
1070: it with an argument of
1071: .I yyin;
1072: but better is to use
1073: .B YY_FLUSH_BUFFER
1074: (see above).
1075: Note that
1076: .B yyrestart()
1077: does
1078: .I not
1079: reset the start condition to
1080: .B INITIAL
1081: (see Start Conditions, below).
1082: .PP
1083: If
1084: .B yylex()
1085: stops scanning due to executing a
1086: .I return
1087: statement in one of the actions, the scanner may then be called again and it
1088: will resume scanning where it left off.
1089: .PP
1090: By default (and for purposes of efficiency), the scanner uses
1091: block-reads rather than simple
1092: .I getc()
1093: calls to read characters from
1094: .I yyin.
1095: The nature of how it gets its input can be controlled by defining the
1096: .B YY_INPUT
1097: macro.
1098: YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
1099: action is to place up to
1100: .I max_size
1101: characters in the character array
1102: .I buf
1103: and return in the integer variable
1104: .I result
1105: either the
1106: number of characters read or the constant YY_NULL (0 on Unix systems)
1107: to indicate EOF. The default YY_INPUT reads from the
1108: global file-pointer "yyin".
1109: .PP
1110: A sample definition of YY_INPUT (in the definitions
1111: section of the input file):
1112: .nf
1113:
1114: %{
1115: #define YY_INPUT(buf,result,max_size) \\
1116: { \\
1117: int c = getchar(); \\
1118: result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
1119: }
1120: %}
1121:
1122: .fi
1123: This definition will change the input processing to occur
1124: one character at a time.
1125: .PP
1126: When the scanner receives an end-of-file indication from YY_INPUT,
1127: it then checks the
1128: .B yywrap()
1129: function. If
1130: .B yywrap()
1131: returns false (zero), then it is assumed that the
1132: function has gone ahead and set up
1133: .I yyin
1134: to point to another input file, and scanning continues. If it returns
1135: true (non-zero), then the scanner terminates, returning 0 to its
1136: caller. Note that in either case, the start condition remains unchanged;
1137: it does
1138: .I not
1139: revert to
1140: .B INITIAL.
1141: .PP
1142: If you do not supply your own version of
1143: .B yywrap(),
1144: then you must either use
1145: .B %option noyywrap
1146: (in which case the scanner behaves as though
1147: .B yywrap()
1148: returned 1), or you must link with
1149: .B \-lfl
1150: to obtain the default version of the routine, which always returns 1.
1151: .PP
1152: Three routines are available for scanning from in-memory buffers rather
1153: than files:
1154: .B yy_scan_string(), yy_scan_bytes(),
1155: and
1156: .B yy_scan_buffer().
1157: See the discussion of them below in the section Multiple Input Buffers.
1158: .PP
1159: The scanner writes its
1160: .B ECHO
1161: output to the
1162: .I yyout
1163: global (default, stdout), which may be redefined by the user simply
1164: by assigning it to some other
1165: .B FILE
1166: pointer.
1167: .SH START CONDITIONS
1168: .I flex
1169: provides a mechanism for conditionally activating rules. Any rule
1170: whose pattern is prefixed with "<sc>" will only be active when
1171: the scanner is in the start condition named "sc". For example,
1172: .nf
1173:
1174: <STRING>[^"]* { /* eat up the string body ... */
1175: ...
1176: }
1177:
1178: .fi
1179: will be active only when the scanner is in the "STRING" start
1180: condition, and
1181: .nf
1182:
1183: <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
1184: ...
1185: }
1186:
1187: .fi
1188: will be active only when the current start condition is
1189: either "INITIAL", "STRING", or "QUOTE".
1190: .PP
1191: Start conditions
1192: are declared in the definitions (first) section of the input
1193: using unindented lines beginning with either
1194: .B %s
1195: or
1196: .B %x
1197: followed by a list of names.
1198: The former declares
1199: .I inclusive
1200: start conditions, the latter
1201: .I exclusive
1202: start conditions. A start condition is activated using the
1203: .B BEGIN
1204: action. Until the next
1205: .B BEGIN
1206: action is executed, rules with the given start
1207: condition will be active and
1208: rules with other start conditions will be inactive.
1209: If the start condition is
1210: .I inclusive,
1211: then rules with no start conditions at all will also be active.
1212: If it is
1213: .I exclusive,
1214: then
1215: .I only
1216: rules qualified with the start condition will be active.
1217: A set of rules contingent on the same exclusive start condition
1218: describe a scanner which is independent of any of the other rules in the
1219: .I flex
1220: input. Because of this,
1221: exclusive start conditions make it easy to specify "mini-scanners"
1222: which scan portions of the input that are syntactically different
1223: from the rest (e.g., comments).
1224: .PP
1225: If the distinction between inclusive and exclusive start conditions
1226: is still a little vague, here's a simple example illustrating the
1227: connection between the two. The set of rules:
1228: .nf
1229:
1230: %s example
1231: %%
1232:
1233: <example>foo do_something();
1234:
1235: bar something_else();
1236:
1237: .fi
1238: is equivalent to
1239: .nf
1240:
1241: %x example
1242: %%
1243:
1244: <example>foo do_something();
1245:
1246: <INITIAL,example>bar something_else();
1247:
1248: .fi
1249: Without the
1250: .B <INITIAL,example>
1251: qualifier, the
1252: .I bar
1253: pattern in the second example wouldn't be active (i.e., couldn't match)
1254: when in start condition
1255: .B example.
1256: If we just used
1257: .B <example>
1258: to qualify
1259: .I bar,
1260: though, then it would only be active in
1261: .B example
1262: and not in
1263: .B INITIAL,
1264: while in the first example it's active in both, because in the first
1265: example the
1266: .B example
1267: startion condition is an
1268: .I inclusive
1269: .B (%s)
1270: start condition.
1271: .PP
1272: Also note that the special start-condition specifier
1273: .B <*>
1274: matches every start condition. Thus, the above example could also
1275: have been written;
1276: .nf
1277:
1278: %x example
1279: %%
1280:
1281: <example>foo do_something();
1282:
1283: <*>bar something_else();
1284:
1285: .fi
1286: .PP
1287: The default rule (to
1288: .B ECHO
1289: any unmatched character) remains active in start conditions. It
1290: is equivalent to:
1291: .nf
1292:
1293: <*>.|\\n ECHO;
1294:
1295: .fi
1296: .PP
1297: .B BEGIN(0)
1298: returns to the original state where only the rules with
1299: no start conditions are active. This state can also be
1300: referred to as the start-condition "INITIAL", so
1301: .B BEGIN(INITIAL)
1302: is equivalent to
1303: .B BEGIN(0).
1304: (The parentheses around the start condition name are not required but
1305: are considered good style.)
1306: .PP
1307: .B BEGIN
1308: actions can also be given as indented code at the beginning
1309: of the rules section. For example, the following will cause
1310: the scanner to enter the "SPECIAL" start condition whenever
1311: .B yylex()
1312: is called and the global variable
1313: .I enter_special
1314: is true:
1315: .nf
1316:
1317: int enter_special;
1318:
1319: %x SPECIAL
1320: %%
1321: if ( enter_special )
1322: BEGIN(SPECIAL);
1323:
1324: <SPECIAL>blahblahblah
1325: ...more rules follow...
1326:
1327: .fi
1328: .PP
1329: To illustrate the uses of start conditions,
1330: here is a scanner which provides two different interpretations
1331: of a string like "123.456". By default it will treat it as
1332: three tokens, the integer "123", a dot ('.'), and the integer "456".
1333: But if the string is preceded earlier in the line by the string
1334: "expect-floats"
1335: it will treat it as a single token, the floating-point number
1336: 123.456:
1337: .nf
1338:
1339: %{
1340: #include <math.h>
1341: %}
1342: %s expect
1343:
1344: %%
1345: expect-floats BEGIN(expect);
1346:
1347: <expect>[0-9]+"."[0-9]+ {
1348: printf( "found a float, = %f\\n",
1349: atof( yytext ) );
1350: }
1351: <expect>\\n {
1352: /* that's the end of the line, so
1353: * we need another "expect-number"
1354: * before we'll recognize any more
1355: * numbers
1356: */
1357: BEGIN(INITIAL);
1358: }
1359:
1360: [0-9]+ {
1361: printf( "found an integer, = %d\\n",
1362: atoi( yytext ) );
1363: }
1364:
1365: "." printf( "found a dot\\n" );
1366:
1367: .fi
1368: Here is a scanner which recognizes (and discards) C comments while
1369: maintaining a count of the current input line.
1370: .nf
1371:
1372: %x comment
1373: %%
1374: int line_num = 1;
1375:
1376: "/*" BEGIN(comment);
1377:
1378: <comment>[^*\\n]* /* eat anything that's not a '*' */
1379: <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1380: <comment>\\n ++line_num;
1381: <comment>"*"+"/" BEGIN(INITIAL);
1382:
1383: .fi
1384: This scanner goes to a bit of trouble to match as much
1385: text as possible with each rule. In general, when attempting to write
1386: a high-speed scanner try to match as much possible in each rule, as
1387: it's a big win.
1388: .PP
1389: Note that start-conditions names are really integer values and
1390: can be stored as such. Thus, the above could be extended in the
1391: following fashion:
1392: .nf
1393:
1394: %x comment foo
1395: %%
1396: int line_num = 1;
1397: int comment_caller;
1398:
1399: "/*" {
1400: comment_caller = INITIAL;
1401: BEGIN(comment);
1402: }
1403:
1404: ...
1405:
1406: <foo>"/*" {
1407: comment_caller = foo;
1408: BEGIN(comment);
1409: }
1410:
1411: <comment>[^*\\n]* /* eat anything that's not a '*' */
1412: <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1413: <comment>\\n ++line_num;
1414: <comment>"*"+"/" BEGIN(comment_caller);
1415:
1416: .fi
1417: Furthermore, you can access the current start condition using
1418: the integer-valued
1419: .B YY_START
1420: macro. For example, the above assignments to
1421: .I comment_caller
1422: could instead be written
1423: .nf
1424:
1425: comment_caller = YY_START;
1426:
1427: .fi
1428: Flex provides
1429: .B YYSTATE
1430: as an alias for
1431: .B YY_START
1432: (since that is what's used by AT&T
1433: .I lex).
1434: .PP
1435: Note that start conditions do not have their own name-space; %s's and %x's
1436: declare names in the same fashion as #define's.
1437: .PP
1438: Finally, here's an example of how to match C-style quoted strings using
1439: exclusive start conditions, including expanded escape sequences (but
1440: not including checking for a string that's too long):
1441: .nf
1442:
1443: %x str
1444:
1445: %%
1446: char string_buf[MAX_STR_CONST];
1447: char *string_buf_ptr;
1448:
1449:
1450: \\" string_buf_ptr = string_buf; BEGIN(str);
1451:
1452: <str>\\" { /* saw closing quote - all done */
1453: BEGIN(INITIAL);
1454: *string_buf_ptr = '\\0';
1455: /* return string constant token type and
1456: * value to parser
1457: */
1458: }
1459:
1460: <str>\\n {
1461: /* error - unterminated string constant */
1462: /* generate error message */
1463: }
1464:
1465: <str>\\\\[0-7]{1,3} {
1466: /* octal escape sequence */
1467: int result;
1468:
1469: (void) sscanf( yytext + 1, "%o", &result );
1470:
1471: if ( result > 0xff )
1472: /* error, constant is out-of-bounds */
1473:
1474: *string_buf_ptr++ = result;
1475: }
1476:
1477: <str>\\\\[0-9]+ {
1478: /* generate error - bad escape sequence; something
1479: * like '\\48' or '\\0777777'
1480: */
1481: }
1482:
1483: <str>\\\\n *string_buf_ptr++ = '\\n';
1484: <str>\\\\t *string_buf_ptr++ = '\\t';
1485: <str>\\\\r *string_buf_ptr++ = '\\r';
1486: <str>\\\\b *string_buf_ptr++ = '\\b';
1487: <str>\\\\f *string_buf_ptr++ = '\\f';
1488:
1489: <str>\\\\(.|\\n) *string_buf_ptr++ = yytext[1];
1490:
1491: <str>[^\\\\\\n\\"]+ {
1492: char *yptr = yytext;
1493:
1494: while ( *yptr )
1495: *string_buf_ptr++ = *yptr++;
1496: }
1497:
1498: .fi
1499: .PP
1500: Often, such as in some of the examples above, you wind up writing a
1501: whole bunch of rules all preceded by the same start condition(s). Flex
1502: makes this a little easier and cleaner by introducing a notion of
1503: start condition
1504: .I scope.
1505: A start condition scope is begun with:
1506: .nf
1507:
1508: <SCs>{
1509:
1510: .fi
1511: where
1512: .I SCs
1513: is a list of one or more start conditions. Inside the start condition
1514: scope, every rule automatically has the prefix
1515: .I <SCs>
1516: applied to it, until a
1517: .I '}'
1518: which matches the initial
1519: .I '{'.
1520: So, for example,
1521: .nf
1522:
1523: <ESC>{
1524: "\\\\n" return '\\n';
1525: "\\\\r" return '\\r';
1526: "\\\\f" return '\\f';
1527: "\\\\0" return '\\0';
1528: }
1529:
1530: .fi
1531: is equivalent to:
1532: .nf
1533:
1534: <ESC>"\\\\n" return '\\n';
1535: <ESC>"\\\\r" return '\\r';
1536: <ESC>"\\\\f" return '\\f';
1537: <ESC>"\\\\0" return '\\0';
1538:
1539: .fi
1540: Start condition scopes may be nested.
1541: .PP
1542: Three routines are available for manipulating stacks of start conditions:
1543: .TP
1544: .B void yy_push_state(int new_state)
1545: pushes the current start condition onto the top of the start condition
1546: stack and switches to
1547: .I new_state
1548: as though you had used
1549: .B BEGIN new_state
1550: (recall that start condition names are also integers).
1551: .TP
1552: .B void yy_pop_state()
1553: pops the top of the stack and switches to it via
1554: .B BEGIN.
1555: .TP
1556: .B int yy_top_state()
1557: returns the top of the stack without altering the stack's contents.
1558: .PP
1559: The start condition stack grows dynamically and so has no built-in
1560: size limitation. If memory is exhausted, program execution aborts.
1561: .PP
1562: To use start condition stacks, your scanner must include a
1563: .B %option stack
1564: directive (see Options below).
1565: .SH MULTIPLE INPUT BUFFERS
1566: Some scanners (such as those which support "include" files)
1567: require reading from several input streams. As
1568: .I flex
1569: scanners do a large amount of buffering, one cannot control
1570: where the next input will be read from by simply writing a
1571: .B YY_INPUT
1572: which is sensitive to the scanning context.
1573: .B YY_INPUT
1574: is only called when the scanner reaches the end of its buffer, which
1575: may be a long time after scanning a statement such as an "include"
1576: which requires switching the input source.
1577: .PP
1578: To negotiate these sorts of problems,
1579: .I flex
1580: provides a mechanism for creating and switching between multiple
1581: input buffers. An input buffer is created by using:
1582: .nf
1583:
1584: YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1585:
1586: .fi
1587: which takes a
1588: .I FILE
1589: pointer and a size and creates a buffer associated with the given
1590: file and large enough to hold
1591: .I size
1592: characters (when in doubt, use
1593: .B YY_BUF_SIZE
1594: for the size). It returns a
1595: .B YY_BUFFER_STATE
1596: handle, which may then be passed to other routines (see below). The
1597: .B YY_BUFFER_STATE
1598: type is a pointer to an opaque
1599: .B struct yy_buffer_state
1600: structure, so you may safely initialize YY_BUFFER_STATE variables to
1601: .B ((YY_BUFFER_STATE) 0)
1602: if you wish, and also refer to the opaque structure in order to
1603: correctly declare input buffers in source files other than that
1604: of your scanner. Note that the
1605: .I FILE
1606: pointer in the call to
1607: .B yy_create_buffer
1608: is only used as the value of
1609: .I yyin
1610: seen by
1611: .B YY_INPUT;
1612: if you redefine
1613: .B YY_INPUT
1614: so it no longer uses
1615: .I yyin,
1616: then you can safely pass a nil
1617: .I FILE
1618: pointer to
1619: .B yy_create_buffer.
1620: You select a particular buffer to scan from using:
1621: .nf
1622:
1623: void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1624:
1625: .fi
1626: switches the scanner's input buffer so subsequent tokens will
1627: come from
1628: .I new_buffer.
1629: Note that
1630: .B yy_switch_to_buffer()
1631: may be used by yywrap() to set things up for continued scanning, instead
1632: of opening a new file and pointing
1633: .I yyin
1634: at it. Note also that switching input sources via either
1635: .B yy_switch_to_buffer()
1636: or
1637: .B yywrap()
1638: does
1639: .I not
1640: change the start condition.
1641: .nf
1642:
1643: void yy_delete_buffer( YY_BUFFER_STATE buffer )
1644:
1645: .fi
1646: is used to reclaim the storage associated with a buffer. (
1647: .B buffer
1648: can be nil, in which case the routine does nothing.)
1649: You can also clear the current contents of a buffer using:
1650: .nf
1651:
1652: void yy_flush_buffer( YY_BUFFER_STATE buffer )
1653:
1654: .fi
1655: This function discards the buffer's contents,
1656: so the next time the scanner attempts to match a token from the
1657: buffer, it will first fill the buffer anew using
1658: .B YY_INPUT.
1659: .PP
1660: .B yy_new_buffer()
1661: is an alias for
1662: .B yy_create_buffer(),
1663: provided for compatibility with the C++ use of
1664: .I new
1665: and
1666: .I delete
1667: for creating and destroying dynamic objects.
1668: .PP
1669: Finally, the
1670: .B YY_CURRENT_BUFFER
1671: macro returns a
1672: .B YY_BUFFER_STATE
1673: handle to the current buffer.
1674: .PP
1675: Here is an example of using these features for writing a scanner
1676: which expands include files (the
1677: .B <<EOF>>
1678: feature is discussed below):
1679: .nf
1680:
1681: /* the "incl" state is used for picking up the name
1682: * of an include file
1683: */
1684: %x incl
1685:
1686: %{
1687: #define MAX_INCLUDE_DEPTH 10
1688: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1689: int include_stack_ptr = 0;
1690: %}
1691:
1692: %%
1693: include BEGIN(incl);
1694:
1695: [a-z]+ ECHO;
1696: [^a-z\\n]*\\n? ECHO;
1697:
1698: <incl>[ \\t]* /* eat the whitespace */
1699: <incl>[^ \\t\\n]+ { /* got the include file name */
1700: if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1701: {
1702: fprintf( stderr, "Includes nested too deeply" );
1703: exit( 1 );
1704: }
1705:
1706: include_stack[include_stack_ptr++] =
1707: YY_CURRENT_BUFFER;
1708:
1709: yyin = fopen( yytext, "r" );
1710:
1711: if ( ! yyin )
1712: error( ... );
1713:
1714: yy_switch_to_buffer(
1715: yy_create_buffer( yyin, YY_BUF_SIZE ) );
1716:
1717: BEGIN(INITIAL);
1718: }
1719:
1720: <<EOF>> {
1721: if ( --include_stack_ptr < 0 )
1722: {
1723: yyterminate();
1724: }
1725:
1726: else
1727: {
1728: yy_delete_buffer( YY_CURRENT_BUFFER );
1729: yy_switch_to_buffer(
1730: include_stack[include_stack_ptr] );
1731: }
1732: }
1733:
1734: .fi
1735: Three routines are available for setting up input buffers for
1736: scanning in-memory strings instead of files. All of them create
1737: a new input buffer for scanning the string, and return a corresponding
1738: .B YY_BUFFER_STATE
1739: handle (which you should delete with
1740: .B yy_delete_buffer()
1741: when done with it). They also switch to the new buffer using
1742: .B yy_switch_to_buffer(),
1743: so the next call to
1744: .B yylex()
1745: will start scanning the string.
1746: .TP
1747: .B yy_scan_string(const char *str)
1748: scans a NUL-terminated string.
1749: .TP
1750: .B yy_scan_bytes(const char *bytes, int len)
1751: scans
1752: .I len
1753: bytes (including possibly NUL's)
1754: starting at location
1755: .I bytes.
1756: .PP
1757: Note that both of these functions create and scan a
1758: .I copy
1759: of the string or bytes. (This may be desirable, since
1760: .B yylex()
1761: modifies the contents of the buffer it is scanning.) You can avoid the
1762: copy by using:
1763: .TP
1764: .B yy_scan_buffer(char *base, yy_size_t size)
1765: which scans in place the buffer starting at
1766: .I base,
1767: consisting of
1768: .I size
1769: bytes, the last two bytes of which
1770: .I must
1771: be
1772: .B YY_END_OF_BUFFER_CHAR
1773: (ASCII NUL).
1774: These last two bytes are not scanned; thus, scanning
1775: consists of
1776: .B base[0]
1777: through
1778: .B base[size-2],
1779: inclusive.
1780: .IP
1781: If you fail to set up
1782: .I base
1783: in this manner (i.e., forget the final two
1784: .B YY_END_OF_BUFFER_CHAR
1785: bytes), then
1786: .B yy_scan_buffer()
1787: returns a nil pointer instead of creating a new input buffer.
1788: .IP
1789: The type
1790: .B yy_size_t
1791: is an integral type to which you can cast an integer expression
1792: reflecting the size of the buffer.
1793: .SH END-OF-FILE RULES
1794: The special rule "<<EOF>>" indicates
1795: actions which are to be taken when an end-of-file is
1796: encountered and yywrap() returns non-zero (i.e., indicates
1797: no further files to process). The action must finish
1798: by doing one of four things:
1799: .IP -
1800: assigning
1801: .I yyin
1802: to a new input file (in previous versions of flex, after doing the
1803: assignment you had to call the special action
1804: .B YY_NEW_FILE;
1805: this is no longer necessary);
1806: .IP -
1807: executing a
1808: .I return
1809: statement;
1810: .IP -
1811: executing the special
1812: .B yyterminate()
1813: action;
1814: .IP -
1815: or, switching to a new buffer using
1816: .B yy_switch_to_buffer()
1817: as shown in the example above.
1818: .PP
1819: <<EOF>> rules may not be used with other
1820: patterns; they may only be qualified with a list of start
1821: conditions. If an unqualified <<EOF>> rule is given, it
1822: applies to
1823: .I all
1824: start conditions which do not already have <<EOF>> actions. To
1825: specify an <<EOF>> rule for only the initial start condition, use
1826: .nf
1827:
1828: <INITIAL><<EOF>>
1829:
1830: .fi
1831: .PP
1832: These rules are useful for catching things like unclosed comments.
1833: An example:
1834: .nf
1835:
1836: %x quote
1837: %%
1838:
1839: ...other rules for dealing with quotes...
1840:
1841: <quote><<EOF>> {
1842: error( "unterminated quote" );
1843: yyterminate();
1844: }
1845: <<EOF>> {
1846: if ( *++filelist )
1847: yyin = fopen( *filelist, "r" );
1848: else
1849: yyterminate();
1850: }
1851:
1852: .fi
1853: .SH MISCELLANEOUS MACROS
1854: The macro
1855: .B YY_USER_ACTION
1856: can be defined to provide an action
1857: which is always executed prior to the matched rule's action. For example,
1858: it could be #define'd to call a routine to convert yytext to lower-case.
1859: When
1860: .B YY_USER_ACTION
1861: is invoked, the variable
1862: .I yy_act
1863: gives the number of the matched rule (rules are numbered starting with 1).
1864: Suppose you want to profile how often each of your rules is matched. The
1865: following would do the trick:
1866: .nf
1867:
1868: #define YY_USER_ACTION ++ctr[yy_act]
1869:
1870: .fi
1871: where
1872: .I ctr
1873: is an array to hold the counts for the different rules. Note that
1874: the macro
1875: .B YY_NUM_RULES
1876: gives the total number of rules (including the default rule, even if
1877: you use
1878: .B \-s),
1879: so a correct declaration for
1880: .I ctr
1881: is:
1882: .nf
1883:
1884: int ctr[YY_NUM_RULES];
1885:
1886: .fi
1887: .PP
1888: The macro
1889: .B YY_USER_INIT
1890: may be defined to provide an action which is always executed before
1891: the first scan (and before the scanner's internal initializations are done).
1892: For example, it could be used to call a routine to read
1893: in a data table or open a logging file.
1894: .PP
1895: The macro
1896: .B yy_set_interactive(is_interactive)
1897: can be used to control whether the current buffer is considered
1898: .I interactive.
1899: An interactive buffer is processed more slowly,
1900: but must be used when the scanner's input source is indeed
1901: interactive to avoid problems due to waiting to fill buffers
1902: (see the discussion of the
1903: .B \-I
1904: flag below). A non-zero value
1905: in the macro invocation marks the buffer as interactive, a zero
1906: value as non-interactive. Note that use of this macro overrides
1907: .B %option always-interactive
1908: or
1909: .B %option never-interactive
1910: (see Options below).
1911: .B yy_set_interactive()
1912: must be invoked prior to beginning to scan the buffer that is
1913: (or is not) to be considered interactive.
1914: .PP
1915: The macro
1916: .B yy_set_bol(at_bol)
1917: can be used to control whether the current buffer's scanning
1918: context for the next token match is done as though at the
1919: beginning of a line. A non-zero macro argument makes rules anchored with
1920: '^' active, while a zero argument makes '^' rules inactive.
1921: .PP
1922: The macro
1923: .B YY_AT_BOL()
1924: returns true if the next token scanned from the current buffer
1925: will have '^' rules active, false otherwise.
1926: .PP
1927: In the generated scanner, the actions are all gathered in one large
1928: switch statement and separated using
1929: .B YY_BREAK,
1930: which may be redefined. By default, it is simply a "break", to separate
1931: each rule's action from the following rule's.
1932: Redefining
1933: .B YY_BREAK
1934: allows, for example, C++ users to
1935: #define YY_BREAK to do nothing (while being very careful that every
1936: rule ends with a "break" or a "return"!) to avoid suffering from
1937: unreachable statement warnings where because a rule's action ends with
1938: "return", the
1939: .B YY_BREAK
1940: is inaccessible.
1941: .SH VALUES AVAILABLE TO THE USER
1942: This section summarizes the various values available to the user
1943: in the rule actions.
1944: .IP -
1945: .B char *yytext
1946: holds the text of the current token. It may be modified but not lengthened
1947: (you cannot append characters to the end).
1948: .IP
1949: If the special directive
1950: .B %array
1951: appears in the first section of the scanner description, then
1952: .B yytext
1953: is instead declared
1954: .B char yytext[YYLMAX],
1955: where
1956: .B YYLMAX
1957: is a macro definition that you can redefine in the first section
1958: if you don't like the default value (generally 8KB). Using
1959: .B %array
1960: results in somewhat slower scanners, but the value of
1961: .B yytext
1962: becomes immune to calls to
1963: .I input()
1964: and
1965: .I unput(),
1966: which potentially destroy its value when
1967: .B yytext
1968: is a character pointer. The opposite of
1969: .B %array
1970: is
1971: .B %pointer,
1972: which is the default.
1973: .IP
1974: You cannot use
1975: .B %array
1976: when generating C++ scanner classes
1977: (the
1978: .B \-+
1979: flag).
1980: .IP -
1981: .B int yyleng
1982: holds the length of the current token.
1983: .IP -
1984: .B FILE *yyin
1985: is the file which by default
1986: .I flex
1987: reads from. It may be redefined but doing so only makes sense before
1988: scanning begins or after an EOF has been encountered. Changing it in
1989: the midst of scanning will have unexpected results since
1990: .I flex
1991: buffers its input; use
1992: .B yyrestart()
1993: instead.
1994: Once scanning terminates because an end-of-file
1995: has been seen, you can assign
1996: .I yyin
1997: at the new input file and then call the scanner again to continue scanning.
1998: .IP -
1999: .B void yyrestart( FILE *new_file )
2000: may be called to point
2001: .I yyin
2002: at the new input file. The switch-over to the new file is immediate
2003: (any previously buffered-up input is lost). Note that calling
2004: .B yyrestart()
2005: with
2006: .I yyin
2007: as an argument thus throws away the current input buffer and continues
2008: scanning the same input file.
2009: .IP -
2010: .B FILE *yyout
2011: is the file to which
2012: .B ECHO
2013: actions are done. It can be reassigned by the user.
2014: .IP -
2015: .B YY_CURRENT_BUFFER
2016: returns a
2017: .B YY_BUFFER_STATE
2018: handle to the current buffer.
2019: .IP -
2020: .B YY_START
2021: returns an integer value corresponding to the current start
2022: condition. You can subsequently use this value with
2023: .B BEGIN
2024: to return to that start condition.
2025: .SH INTERFACING WITH YACC
2026: One of the main uses of
2027: .I flex
2028: is as a companion to the
2029: .I yacc
2030: parser-generator.
2031: .I yacc
2032: parsers expect to call a routine named
2033: .B yylex()
2034: to find the next input token. The routine is supposed to
2035: return the type of the next token as well as putting any associated
2036: value in the global
2037: .B yylval.
2038: To use
2039: .I flex
2040: with
2041: .I yacc,
2042: one specifies the
2043: .B \-d
2044: option to
2045: .I yacc
2046: to instruct it to generate the file
2047: .B y.tab.h
2048: containing definitions of all the
2049: .B %tokens
2050: appearing in the
2051: .I yacc
2052: input. This file is then included in the
2053: .I flex
2054: scanner. For example, if one of the tokens is "TOK_NUMBER",
2055: part of the scanner might look like:
2056: .nf
2057:
2058: %{
2059: #include "y.tab.h"
2060: %}
2061:
2062: %%
2063:
2064: [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
2065:
2066: .fi
2067: .SH OPTIONS
2068: .I flex
2069: has the following options:
2070: .TP
2071: .B \-b
2072: Generate backing-up information to
2073: .I lex.backup.
2074: This is a list of scanner states which require backing up
2075: and the input characters on which they do so. By adding rules one
2076: can remove backing-up states. If
2077: .I all
2078: backing-up states are eliminated and
2079: .B \-Cf
2080: or
2081: .B \-CF
2082: is used, the generated scanner will run faster (see the
2083: .B \-p
2084: flag). Only users who wish to squeeze every last cycle out of their
2085: scanners need worry about this option. (See the section on Performance
2086: Considerations below.)
2087: .TP
2088: .B \-c
2089: is a do-nothing, deprecated option included for POSIX compliance.
2090: .TP
2091: .B \-d
2092: makes the generated scanner run in
2093: .I debug
2094: mode. Whenever a pattern is recognized and the global
2095: .B yy_flex_debug
2096: is non-zero (which is the default),
2097: the scanner will write to
2098: .I stderr
2099: a line of the form:
2100: .nf
2101:
2102: --accepting rule at line 53 ("the matched text")
2103:
2104: .fi
2105: The line number refers to the location of the rule in the file
2106: defining the scanner (i.e., the file that was fed to flex). Messages
2107: are also generated when the scanner backs up, accepts the
2108: default rule, reaches the end of its input buffer (or encounters
2109: a NUL; at this point, the two look the same as far as the scanner's concerned),
2110: or reaches an end-of-file.
2111: .TP
2112: .B \-f
2113: specifies
2114: .I fast scanner.
2115: No table compression is done and stdio is bypassed.
2116: The result is large but fast. This option is equivalent to
2117: .B \-Cfr
2118: (see below).
2119: .TP
2120: .B \-h
2121: generates a "help" summary of
2122: .I flex's
2123: options to
2124: .I stdout
2125: and then exits.
2126: .B \-?
2127: and
2128: .B \-\-help
2129: are synonyms for
2130: .B \-h.
2131: .TP
2132: .B \-i
2133: instructs
2134: .I flex
2135: to generate a
2136: .I case-insensitive
2137: scanner. The case of letters given in the
2138: .I flex
2139: input patterns will
2140: be ignored, and tokens in the input will be matched regardless of case. The
2141: matched text given in
2142: .I yytext
2143: will have the preserved case (i.e., it will not be folded).
2144: .TP
2145: .B \-l
2146: turns on maximum compatibility with the original AT&T
2147: .I lex
2148: implementation. Note that this does not mean
2149: .I full
2150: compatibility. Use of this option costs a considerable amount of
2151: performance, and it cannot be used with the
2152: .B \-+, -f, -F, -Cf,
2153: or
2154: .B -CF
2155: options. For details on the compatibilities it provides, see the section
2156: "Incompatibilities With Lex And POSIX" below. This option also results
2157: in the name
2158: .B YY_FLEX_LEX_COMPAT
2159: being #define'd in the generated scanner.
2160: .TP
2161: .B \-n
2162: is another do-nothing, deprecated option included only for
2163: POSIX compliance.
2164: .TP
2165: .B \-p
2166: generates a performance report to stderr. The report
2167: consists of comments regarding features of the
2168: .I flex
2169: input file which will cause a serious loss of performance in the resulting
2170: scanner. If you give the flag twice, you will also get comments regarding
2171: features that lead to minor performance losses.
2172: .IP
2173: Note that the use of
2174: .B REJECT,
2175: .B %option yylineno,
2176: and variable trailing context (see the Deficiencies / Bugs section below)
2177: entails a substantial performance penalty; use of
2178: .I yymore(),
2179: the
2180: .B ^
2181: operator,
2182: and the
2183: .B \-I
2184: flag entail minor performance penalties.
2185: .TP
2186: .B \-s
2187: causes the
2188: .I default rule
2189: (that unmatched scanner input is echoed to
2190: .I stdout)
2191: to be suppressed. If the scanner encounters input that does not
2192: match any of its rules, it aborts with an error. This option is
2193: useful for finding holes in a scanner's rule set.
2194: .TP
2195: .B \-t
2196: instructs
2197: .I flex
2198: to write the scanner it generates to standard output instead
2199: of
2200: .B lex.yy.c.
2201: .TP
2202: .B \-v
2203: specifies that
2204: .I flex
2205: should write to
2206: .I stderr
2207: a summary of statistics regarding the scanner it generates.
2208: Most of the statistics are meaningless to the casual
2209: .I flex
2210: user, but the first line identifies the version of
2211: .I flex
2212: (same as reported by
2213: .B \-V),
2214: and the next line the flags used when generating the scanner, including
2215: those that are on by default.
2216: .TP
2217: .B \-w
2218: suppresses warning messages.
2219: .TP
2220: .B \-B
2221: instructs
2222: .I flex
2223: to generate a
2224: .I batch
2225: scanner, the opposite of
2226: .I interactive
2227: scanners generated by
2228: .B \-I
2229: (see below). In general, you use
2230: .B \-B
2231: when you are
2232: .I certain
2233: that your scanner will never be used interactively, and you want to
2234: squeeze a
2235: .I little
2236: more performance out of it. If your goal is instead to squeeze out a
2237: .I lot
2238: more performance, you should be using the
2239: .B \-Cf
2240: or
2241: .B \-CF
2242: options (discussed below), which turn on
2243: .B \-B
2244: automatically anyway.
2245: .TP
2246: .B \-F
2247: specifies that the
2248: .ul
2249: fast
2250: scanner table representation should be used (and stdio
2251: bypassed). This representation is
2252: about as fast as the full table representation
2253: .B (-f),
2254: and for some sets of patterns will be considerably smaller (and for
2255: others, larger). In general, if the pattern set contains both "keywords"
2256: and a catch-all, "identifier" rule, such as in the set:
2257: .nf
2258:
2259: "case" return TOK_CASE;
2260: "switch" return TOK_SWITCH;
2261: ...
2262: "default" return TOK_DEFAULT;
2263: [a-z]+ return TOK_ID;
2264:
2265: .fi
2266: then you're better off using the full table representation. If only
2267: the "identifier" rule is present and you then use a hash table or some such
2268: to detect the keywords, you're better off using
2269: .B -F.
2270: .IP
2271: This option is equivalent to
2272: .B \-CFr
2273: (see below). It cannot be used with
2274: .B \-+.
2275: .TP
2276: .B \-I
2277: instructs
2278: .I flex
2279: to generate an
2280: .I interactive
2281: scanner. An interactive scanner is one that only looks ahead to decide
2282: what token has been matched if it absolutely must. It turns out that
2283: always looking one extra character ahead, even if the scanner has already
2284: seen enough text to disambiguate the current token, is a bit faster than
2285: only looking ahead when necessary. But scanners that always look ahead
2286: give dreadful interactive performance; for example, when a user types
2287: a newline, it is not recognized as a newline token until they enter
2288: .I another
2289: token, which often means typing in another whole line.
2290: .IP
2291: .I Flex
2292: scanners default to
2293: .I interactive
2294: unless you use the
2295: .B \-Cf
2296: or
2297: .B \-CF
2298: table-compression options (see below). That's because if you're looking
2299: for high-performance you should be using one of these options, so if you
2300: didn't,
2301: .I flex
2302: assumes you'd rather trade off a bit of run-time performance for intuitive
2303: interactive behavior. Note also that you
2304: .I cannot
2305: use
2306: .B \-I
2307: in conjunction with
2308: .B \-Cf
2309: or
2310: .B \-CF.
2311: Thus, this option is not really needed; it is on by default for all those
2312: cases in which it is allowed.
2313: .IP
2314: You can force a scanner to
2315: .I not
2316: be interactive by using
2317: .B \-B
2318: (see above).
2319: .TP
2320: .B \-L
2321: instructs
2322: .I flex
2323: not to generate
2324: .B #line
2325: directives. Without this option,
2326: .I flex
2327: peppers the generated scanner
2328: with #line directives so error messages in the actions will be correctly
2329: located with respect to either the original
2330: .I flex
2331: input file (if the errors are due to code in the input file), or
2332: .B lex.yy.c
2333: (if the errors are
2334: .I flex's
2335: fault -- you should report these sorts of errors to the email address
2336: given below).
2337: .TP
2338: .B \-T
2339: makes
2340: .I flex
2341: run in
2342: .I trace
2343: mode. It will generate a lot of messages to
2344: .I stderr
2345: concerning
2346: the form of the input and the resultant non-deterministic and deterministic
2347: finite automata. This option is mostly for use in maintaining
2348: .I flex.
2349: .TP
2350: .B \-V
2351: prints the version number to
2352: .I stdout
2353: and exits.
2354: .B \-\-version
2355: is a synonym for
2356: .B \-V.
2357: .TP
2358: .B \-7
2359: instructs
2360: .I flex
2361: to generate a 7-bit scanner, i.e., one which can only recognized 7-bit
2362: characters in its input. The advantage of using
2363: .B \-7
2364: is that the scanner's tables can be up to half the size of those generated
2365: using the
2366: .B \-8
2367: option (see below). The disadvantage is that such scanners often hang
2368: or crash if their input contains an 8-bit character.
2369: .IP
2370: Note, however, that unless you generate your scanner using the
2371: .B \-Cf
2372: or
2373: .B \-CF
2374: table compression options, use of
2375: .B \-7
2376: will save only a small amount of table space, and make your scanner
2377: considerably less portable.
2378: .I Flex's
2379: default behavior is to generate an 8-bit scanner unless you use the
2380: .B \-Cf
2381: or
2382: .B \-CF,
2383: in which case
2384: .I flex
2385: defaults to generating 7-bit scanners unless your site was always
2386: configured to generate 8-bit scanners (as will often be the case
2387: with non-USA sites). You can tell whether flex generated a 7-bit
2388: or an 8-bit scanner by inspecting the flag summary in the
2389: .B \-v
2390: output as described above.
2391: .IP
2392: Note that if you use
2393: .B \-Cfe
2394: or
2395: .B \-CFe
2396: (those table compression options, but also using equivalence classes as
2397: discussed see below), flex still defaults to generating an 8-bit
2398: scanner, since usually with these compression options full 8-bit tables
2399: are not much more expensive than 7-bit tables.
2400: .TP
2401: .B \-8
2402: instructs
2403: .I flex
2404: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
2405: characters. This flag is only needed for scanners generated using
2406: .B \-Cf
2407: or
2408: .B \-CF,
2409: as otherwise flex defaults to generating an 8-bit scanner anyway.
2410: .IP
2411: See the discussion of
2412: .B \-7
2413: above for flex's default behavior and the tradeoffs between 7-bit
2414: and 8-bit scanners.
2415: .TP
2416: .B \-+
2417: specifies that you want flex to generate a C++
2418: scanner class. See the section on Generating C++ Scanners below for
2419: details.
2420: .TP
2421: .B \-C[aefFmr]
2422: controls the degree of table compression and, more generally, trade-offs
2423: between small scanners and fast scanners.
2424: .IP
2425: .B \-Ca
2426: ("align") instructs flex to trade off larger tables in the
2427: generated scanner for faster performance because the elements of
2428: the tables are better aligned for memory access and computation. On some
2429: RISC architectures, fetching and manipulating longwords is more efficient
2430: than with smaller-sized units such as shortwords. This option can
2431: double the size of the tables used by your scanner.
2432: .IP
2433: .B \-Ce
2434: directs
2435: .I flex
2436: to construct
2437: .I equivalence classes,
2438: i.e., sets of characters
2439: which have identical lexical properties (for example, if the only
2440: appearance of digits in the
2441: .I flex
2442: input is in the character class
2443: "[0-9]" then the digits '0', '1', ..., '9' will all be put
2444: in the same equivalence class). Equivalence classes usually give
2445: dramatic reductions in the final table/object file sizes (typically
2446: a factor of 2-5) and are pretty cheap performance-wise (one array
2447: look-up per character scanned).
2448: .IP
2449: .B \-Cf
2450: specifies that the
2451: .I full
2452: scanner tables should be generated -
2453: .I flex
2454: should not compress the
2455: tables by taking advantages of similar transition functions for
2456: different states.
2457: .IP
2458: .B \-CF
2459: specifies that the alternate fast scanner representation (described
2460: above under the
2461: .B \-F
2462: flag)
2463: should be used. This option cannot be used with
2464: .B \-+.
2465: .IP
2466: .B \-Cm
2467: directs
2468: .I flex
2469: to construct
2470: .I meta-equivalence classes,
2471: which are sets of equivalence classes (or characters, if equivalence
2472: classes are not being used) that are commonly used together. Meta-equivalence
2473: classes are often a big win when using compressed tables, but they
2474: have a moderate performance impact (one or two "if" tests and one
2475: array look-up per character scanned).
2476: .IP
2477: .B \-Cr
2478: causes the generated scanner to
2479: .I bypass
2480: use of the standard I/O library (stdio) for input. Instead of calling
2481: .B fread()
2482: or
2483: .B getc(),
2484: the scanner will use the
2485: .B read()
2486: system call, resulting in a performance gain which varies from system
2487: to system, but in general is probably negligible unless you are also using
2488: .B \-Cf
2489: or
2490: .B \-CF.
2491: Using
2492: .B \-Cr
2493: can cause strange behavior if, for example, you read from
2494: .I yyin
2495: using stdio prior to calling the scanner (because the scanner will miss
2496: whatever text your previous reads left in the stdio input buffer).
2497: .IP
2498: .B \-Cr
2499: has no effect if you define
2500: .B YY_INPUT
2501: (see The Generated Scanner above).
2502: .IP
2503: A lone
2504: .B \-C
2505: specifies that the scanner tables should be compressed but neither
2506: equivalence classes nor meta-equivalence classes should be used.
2507: .IP
2508: The options
2509: .B \-Cf
2510: or
2511: .B \-CF
2512: and
2513: .B \-Cm
2514: do not make sense together - there is no opportunity for meta-equivalence
2515: classes if the table is not being compressed. Otherwise the options
2516: may be freely mixed, and are cumulative.
2517: .IP
2518: The default setting is
2519: .B \-Cem,
2520: which specifies that
2521: .I flex
2522: should generate equivalence classes
2523: and meta-equivalence classes. This setting provides the highest
2524: degree of table compression. You can trade off
2525: faster-executing scanners at the cost of larger tables with
2526: the following generally being true:
2527: .nf
2528:
2529: slowest & smallest
2530: -Cem
2531: -Cm
2532: -Ce
2533: -C
2534: -C{f,F}e
2535: -C{f,F}
2536: -C{f,F}a
2537: fastest & largest
2538:
2539: .fi
2540: Note that scanners with the smallest tables are usually generated and
2541: compiled the quickest, so
2542: during development you will usually want to use the default, maximal
2543: compression.
2544: .IP
2545: .B \-Cfe
2546: is often a good compromise between speed and size for production
2547: scanners.
2548: .TP
2549: .B \-ooutput
2550: directs flex to write the scanner to the file
2551: .B output
2552: instead of
2553: .B lex.yy.c.
2554: If you combine
2555: .B \-o
2556: with the
2557: .B \-t
2558: option, then the scanner is written to
2559: .I stdout
2560: but its
2561: .B #line
2562: directives (see the
2563: .B \\-L
2564: option above) refer to the file
2565: .B output.
2566: .TP
2567: .B \-Pprefix
2568: changes the default
2569: .I "yy"
2570: prefix used by
2571: .I flex
1.6 ! aaron 2572: for all globally visible variable and function names to instead be
1.1 deraadt 2573: .I prefix.
2574: For example,
2575: .B \-Pfoo
2576: changes the name of
2577: .B yytext
2578: to
2579: .B footext.
2580: It also changes the name of the default output file from
2581: .B lex.yy.c
2582: to
2583: .B lex.foo.c.
2584: Here are all of the names affected:
2585: .nf
2586:
2587: yy_create_buffer
2588: yy_delete_buffer
2589: yy_flex_debug
2590: yy_init_buffer
2591: yy_flush_buffer
2592: yy_load_buffer_state
2593: yy_switch_to_buffer
2594: yyin
2595: yyleng
2596: yylex
2597: yylineno
2598: yyout
2599: yyrestart
2600: yytext
2601: yywrap
2602:
2603: .fi
2604: (If you are using a C++ scanner, then only
2605: .B yywrap
2606: and
2607: .B yyFlexLexer
2608: are affected.)
2609: Within your scanner itself, you can still refer to the global variables
2610: and functions using either version of their name; but externally, they
2611: have the modified name.
2612: .IP
2613: This option lets you easily link together multiple
2614: .I flex
2615: programs into the same executable. Note, though, that using this
2616: option also renames
2617: .B yywrap(),
2618: so you now
2619: .I must
2620: either
1.6 ! aaron 2621: provide your own (appropriately named) version of the routine for your
1.1 deraadt 2622: scanner, or use
2623: .B %option noyywrap,
2624: as linking with
2625: .B \-lfl
2626: no longer provides one for you by default.
2627: .TP
2628: .B \-Sskeleton_file
2629: overrides the default skeleton file from which
2630: .I flex
2631: constructs its scanners. You'll never need this option unless you are doing
2632: .I flex
2633: maintenance or development.
2634: .PP
2635: .I flex
2636: also provides a mechanism for controlling options within the
2637: scanner specification itself, rather than from the flex command-line.
2638: This is done by including
2639: .B %option
2640: directives in the first section of the scanner specification.
2641: You can specify multiple options with a single
2642: .B %option
2643: directive, and multiple directives in the first section of your flex input
2644: file.
2645: .PP
2646: Most options are given simply as names, optionally preceded by the
2647: word "no" (with no intervening whitespace) to negate their meaning.
2648: A number are equivalent to flex flags or their negation:
2649: .nf
2650:
2651: 7bit -7 option
2652: 8bit -8 option
2653: align -Ca option
2654: backup -b option
2655: batch -B option
2656: c++ -+ option
2657:
2658: caseful or
2659: case-sensitive opposite of -i (default)
2660:
2661: case-insensitive or
2662: caseless -i option
2663:
2664: debug -d option
2665: default opposite of -s option
2666: ecs -Ce option
2667: fast -F option
2668: full -f option
2669: interactive -I option
2670: lex-compat -l option
2671: meta-ecs -Cm option
2672: perf-report -p option
2673: read -Cr option
2674: stdout -t option
2675: verbose -v option
2676: warn opposite of -w option
2677: (use "%option nowarn" for -w)
2678:
2679: array equivalent to "%array"
2680: pointer equivalent to "%pointer" (default)
2681:
2682: .fi
2683: Some
2684: .B %option's
2685: provide features otherwise not available:
2686: .TP
2687: .B always-interactive
2688: instructs flex to generate a scanner which always considers its input
2689: "interactive". Normally, on each new input file the scanner calls
2690: .B isatty()
2691: in an attempt to determine whether
2692: the scanner's input source is interactive and thus should be read a
2693: character at a time. When this option is used, however, then no
2694: such call is made.
2695: .TP
2696: .B main
2697: directs flex to provide a default
2698: .B main()
2699: program for the scanner, which simply calls
2700: .B yylex().
2701: This option implies
2702: .B noyywrap
2703: (see below).
2704: .TP
2705: .B never-interactive
2706: instructs flex to generate a scanner which never considers its input
2707: "interactive" (again, no call made to
2708: .B isatty()).
2709: This is the opposite of
2710: .B always-interactive.
2711: .TP
2712: .B stack
2713: enables the use of start condition stacks (see Start Conditions above).
2714: .TP
2715: .B stdinit
2716: if set (i.e.,
2717: .B %option stdinit)
2718: initializes
2719: .I yyin
2720: and
2721: .I yyout
2722: to
2723: .I stdin
2724: and
2725: .I stdout,
2726: instead of the default of
2727: .I nil.
2728: Some existing
2729: .I lex
2730: programs depend on this behavior, even though it is not compliant with
2731: ANSI C, which does not require
2732: .I stdin
2733: and
2734: .I stdout
2735: to be compile-time constant.
2736: .TP
2737: .B yylineno
2738: directs
2739: .I flex
2740: to generate a scanner that maintains the number of the current line
2741: read from its input in the global variable
2742: .B yylineno.
2743: This option is implied by
2744: .B %option lex-compat.
2745: .TP
2746: .B yywrap
2747: if unset (i.e.,
2748: .B %option noyywrap),
2749: makes the scanner not call
2750: .B yywrap()
2751: upon an end-of-file, but simply assume that there are no more
2752: files to scan (until the user points
2753: .I yyin
2754: at a new file and calls
2755: .B yylex()
2756: again).
2757: .PP
2758: .I flex
2759: scans your rule actions to determine whether you use the
2760: .B REJECT
2761: or
2762: .B yymore()
2763: features. The
2764: .B reject
2765: and
2766: .B yymore
2767: options are available to override its decision as to whether you use the
2768: options, either by setting them (e.g.,
2769: .B %option reject)
2770: to indicate the feature is indeed used, or
2771: unsetting them to indicate it actually is not used
2772: (e.g.,
2773: .B %option noyymore).
2774: .PP
2775: Three options take string-delimited values, offset with '=':
2776: .nf
2777:
2778: %option outfile="ABC"
2779:
2780: .fi
2781: is equivalent to
2782: .B -oABC,
2783: and
2784: .nf
2785:
2786: %option prefix="XYZ"
2787:
2788: .fi
2789: is equivalent to
2790: .B -PXYZ.
2791: Finally,
2792: .nf
2793:
2794: %option yyclass="foo"
2795:
2796: .fi
2797: only applies when generating a C++ scanner (
2798: .B \-+
2799: option). It informs
2800: .I flex
2801: that you have derived
2802: .B foo
2803: as a subclass of
2804: .B yyFlexLexer,
2805: so
2806: .I flex
2807: will place your actions in the member function
2808: .B foo::yylex()
2809: instead of
2810: .B yyFlexLexer::yylex().
2811: It also generates a
2812: .B yyFlexLexer::yylex()
2813: member function that emits a run-time error (by invoking
2814: .B yyFlexLexer::LexerError())
2815: if called.
2816: See Generating C++ Scanners, below, for additional information.
2817: .PP
2818: A number of options are available for lint purists who want to suppress
2819: the appearance of unneeded routines in the generated scanner. Each of the
2820: following, if unset
2821: (e.g.,
2822: .B %option nounput
2823: ), results in the corresponding routine not appearing in
2824: the generated scanner:
2825: .nf
2826:
2827: input, unput
2828: yy_push_state, yy_pop_state, yy_top_state
2829: yy_scan_buffer, yy_scan_bytes, yy_scan_string
2830:
2831: .fi
2832: (though
2833: .B yy_push_state()
2834: and friends won't appear anyway unless you use
2835: .B %option stack).
2836: .SH PERFORMANCE CONSIDERATIONS
2837: The main design goal of
2838: .I flex
2839: is that it generate high-performance scanners. It has been optimized
2840: for dealing well with large sets of rules. Aside from the effects on
2841: scanner speed of the table compression
2842: .B \-C
2843: options outlined above,
2844: there are a number of options/actions which degrade performance. These
2845: are, from most expensive to least:
2846: .nf
2847:
2848: REJECT
2849: %option yylineno
2850: arbitrary trailing context
2851:
2852: pattern sets that require backing up
2853: %array
2854: %option interactive
2855: %option always-interactive
2856:
2857: '^' beginning-of-line operator
2858: yymore()
2859:
2860: .fi
2861: with the first three all being quite expensive and the last two
2862: being quite cheap. Note also that
2863: .B unput()
2864: is implemented as a routine call that potentially does quite a bit of
2865: work, while
2866: .B yyless()
2867: is a quite-cheap macro; so if just putting back some excess text you
2868: scanned, use
2869: .B yyless().
2870: .PP
2871: .B REJECT
2872: should be avoided at all costs when performance is important.
2873: It is a particularly expensive option.
2874: .PP
2875: Getting rid of backing up is messy and often may be an enormous
2876: amount of work for a complicated scanner. In principal, one begins
2877: by using the
2878: .B \-b
2879: flag to generate a
2880: .I lex.backup
2881: file. For example, on the input
2882: .nf
2883:
2884: %%
2885: foo return TOK_KEYWORD;
2886: foobar return TOK_KEYWORD;
2887:
2888: .fi
2889: the file looks like:
2890: .nf
2891:
2892: State #6 is non-accepting -
2893: associated rule line numbers:
2894: 2 3
2895: out-transitions: [ o ]
2896: jam-transitions: EOF [ \\001-n p-\\177 ]
2897:
2898: State #8 is non-accepting -
2899: associated rule line numbers:
2900: 3
2901: out-transitions: [ a ]
2902: jam-transitions: EOF [ \\001-` b-\\177 ]
2903:
2904: State #9 is non-accepting -
2905: associated rule line numbers:
2906: 3
2907: out-transitions: [ r ]
2908: jam-transitions: EOF [ \\001-q s-\\177 ]
2909:
2910: Compressed tables always back up.
2911:
2912: .fi
2913: The first few lines tell us that there's a scanner state in
2914: which it can make a transition on an 'o' but not on any other
2915: character, and that in that state the currently scanned text does not match
2916: any rule. The state occurs when trying to match the rules found
2917: at lines 2 and 3 in the input file.
2918: If the scanner is in that state and then reads
2919: something other than an 'o', it will have to back up to find
2920: a rule which is matched. With
2921: a bit of headscratching one can see that this must be the
2922: state it's in when it has seen "fo". When this has happened,
2923: if anything other than another 'o' is seen, the scanner will
2924: have to back up to simply match the 'f' (by the default rule).
2925: .PP
2926: The comment regarding State #8 indicates there's a problem
2927: when "foob" has been scanned. Indeed, on any character other
2928: than an 'a', the scanner will have to back up to accept "foo".
2929: Similarly, the comment for State #9 concerns when "fooba" has
2930: been scanned and an 'r' does not follow.
2931: .PP
2932: The final comment reminds us that there's no point going to
2933: all the trouble of removing backing up from the rules unless
2934: we're using
2935: .B \-Cf
2936: or
2937: .B \-CF,
2938: since there's no performance gain doing so with compressed scanners.
2939: .PP
2940: The way to remove the backing up is to add "error" rules:
2941: .nf
2942:
2943: %%
2944: foo return TOK_KEYWORD;
2945: foobar return TOK_KEYWORD;
2946:
2947: fooba |
2948: foob |
2949: fo {
2950: /* false alarm, not really a keyword */
2951: return TOK_ID;
2952: }
2953:
2954: .fi
2955: .PP
2956: Eliminating backing up among a list of keywords can also be
2957: done using a "catch-all" rule:
2958: .nf
2959:
2960: %%
2961: foo return TOK_KEYWORD;
2962: foobar return TOK_KEYWORD;
2963:
2964: [a-z]+ return TOK_ID;
2965:
2966: .fi
2967: This is usually the best solution when appropriate.
2968: .PP
2969: Backing up messages tend to cascade.
2970: With a complicated set of rules it's not uncommon to get hundreds
2971: of messages. If one can decipher them, though, it often
2972: only takes a dozen or so rules to eliminate the backing up (though
2973: it's easy to make a mistake and have an error rule accidentally match
2974: a valid token. A possible future
2975: .I flex
2976: feature will be to automatically add rules to eliminate backing up).
2977: .PP
2978: It's important to keep in mind that you gain the benefits of eliminating
2979: backing up only if you eliminate
2980: .I every
2981: instance of backing up. Leaving just one means you gain nothing.
2982: .PP
2983: .I Variable
2984: trailing context (where both the leading and trailing parts do not have
2985: a fixed length) entails almost the same performance loss as
2986: .B REJECT
2987: (i.e., substantial). So when possible a rule like:
2988: .nf
2989:
2990: %%
2991: mouse|rat/(cat|dog) run();
2992:
2993: .fi
2994: is better written:
2995: .nf
2996:
2997: %%
2998: mouse/cat|dog run();
2999: rat/cat|dog run();
3000:
3001: .fi
3002: or as
3003: .nf
3004:
3005: %%
3006: mouse|rat/cat run();
3007: mouse|rat/dog run();
3008:
3009: .fi
3010: Note that here the special '|' action does
3011: .I not
3012: provide any savings, and can even make things worse (see
3013: Deficiencies / Bugs below).
3014: .LP
3015: Another area where the user can increase a scanner's performance
3016: (and one that's easier to implement) arises from the fact that
3017: the longer the tokens matched, the faster the scanner will run.
3018: This is because with long tokens the processing of most input
3019: characters takes place in the (short) inner scanning loop, and
3020: does not often have to go through the additional work of setting up
3021: the scanning environment (e.g.,
3022: .B yytext)
3023: for the action. Recall the scanner for C comments:
3024: .nf
3025:
3026: %x comment
3027: %%
3028: int line_num = 1;
3029:
3030: "/*" BEGIN(comment);
3031:
3032: <comment>[^*\\n]*
3033: <comment>"*"+[^*/\\n]*
3034: <comment>\\n ++line_num;
3035: <comment>"*"+"/" BEGIN(INITIAL);
3036:
3037: .fi
3038: This could be sped up by writing it as:
3039: .nf
3040:
3041: %x comment
3042: %%
3043: int line_num = 1;
3044:
3045: "/*" BEGIN(comment);
3046:
3047: <comment>[^*\\n]*
3048: <comment>[^*\\n]*\\n ++line_num;
3049: <comment>"*"+[^*/\\n]*
3050: <comment>"*"+[^*/\\n]*\\n ++line_num;
3051: <comment>"*"+"/" BEGIN(INITIAL);
3052:
3053: .fi
3054: Now instead of each newline requiring the processing of another
3055: action, recognizing the newlines is "distributed" over the other rules
3056: to keep the matched text as long as possible. Note that
3057: .I adding
3058: rules does
3059: .I not
3060: slow down the scanner! The speed of the scanner is independent
3061: of the number of rules or (modulo the considerations given at the
3062: beginning of this section) how complicated the rules are with
3063: regard to operators such as '*' and '|'.
3064: .PP
3065: A final example in speeding up a scanner: suppose you want to scan
3066: through a file containing identifiers and keywords, one per line
3067: and with no other extraneous characters, and recognize all the
3068: keywords. A natural first approach is:
3069: .nf
3070:
3071: %%
3072: asm |
3073: auto |
3074: break |
3075: ... etc ...
3076: volatile |
3077: while /* it's a keyword */
3078:
3079: .|\\n /* it's not a keyword */
3080:
3081: .fi
3082: To eliminate the back-tracking, introduce a catch-all rule:
3083: .nf
3084:
3085: %%
3086: asm |
3087: auto |
3088: break |
3089: ... etc ...
3090: volatile |
3091: while /* it's a keyword */
3092:
3093: [a-z]+ |
3094: .|\\n /* it's not a keyword */
3095:
3096: .fi
3097: Now, if it's guaranteed that there's exactly one word per line,
3098: then we can reduce the total number of matches by a half by
3099: merging in the recognition of newlines with that of the other
3100: tokens:
3101: .nf
3102:
3103: %%
3104: asm\\n |
3105: auto\\n |
3106: break\\n |
3107: ... etc ...
3108: volatile\\n |
3109: while\\n /* it's a keyword */
3110:
3111: [a-z]+\\n |
3112: .|\\n /* it's not a keyword */
3113:
3114: .fi
3115: One has to be careful here, as we have now reintroduced backing up
3116: into the scanner. In particular, while
3117: .I we
3118: know that there will never be any characters in the input stream
3119: other than letters or newlines,
3120: .I flex
3121: can't figure this out, and it will plan for possibly needing to back up
3122: when it has scanned a token like "auto" and then the next character
3123: is something other than a newline or a letter. Previously it would
3124: then just match the "auto" rule and be done, but now it has no "auto"
3125: rule, only a "auto\\n" rule. To eliminate the possibility of backing up,
3126: we could either duplicate all rules but without final newlines, or,
3127: since we never expect to encounter such an input and therefore don't
3128: how it's classified, we can introduce one more catch-all rule, this
3129: one which doesn't include a newline:
3130: .nf
3131:
3132: %%
3133: asm\\n |
3134: auto\\n |
3135: break\\n |
3136: ... etc ...
3137: volatile\\n |
3138: while\\n /* it's a keyword */
3139:
3140: [a-z]+\\n |
3141: [a-z]+ |
3142: .|\\n /* it's not a keyword */
3143:
3144: .fi
3145: Compiled with
3146: .B \-Cf,
3147: this is about as fast as one can get a
3148: .I flex
3149: scanner to go for this particular problem.
3150: .PP
3151: A final note:
3152: .I flex
3153: is slow when matching NUL's, particularly when a token contains
3154: multiple NUL's.
3155: It's best to write rules which match
3156: .I short
3157: amounts of text if it's anticipated that the text will often include NUL's.
3158: .PP
3159: Another final note regarding performance: as mentioned above in the section
3160: How the Input is Matched, dynamically resizing
3161: .B yytext
3162: to accommodate huge tokens is a slow process because it presently requires that
3163: the (huge) token be rescanned from the beginning. Thus if performance is
3164: vital, you should attempt to match "large" quantities of text but not
3165: "huge" quantities, where the cutoff between the two is at about 8K
3166: characters/token.
3167: .SH GENERATING C++ SCANNERS
3168: .I flex
3169: provides two different ways to generate scanners for use with C++. The
3170: first way is to simply compile a scanner generated by
3171: .I flex
3172: using a C++ compiler instead of a C compiler. You should not encounter
3173: any compilations errors (please report any you find to the email address
3174: given in the Author section below). You can then use C++ code in your
3175: rule actions instead of C code. Note that the default input source for
3176: your scanner remains
3177: .I yyin,
3178: and default echoing is still done to
3179: .I yyout.
3180: Both of these remain
3181: .I FILE *
3182: variables and not C++
3183: .I streams.
3184: .PP
3185: You can also use
3186: .I flex
3187: to generate a C++ scanner class, using the
3188: .B \-+
3189: option (or, equivalently,
3190: .B %option c++),
3191: which is automatically specified if the name of the flex
3192: executable ends in a '+', such as
3193: .I flex++.
3194: When using this option, flex defaults to generating the scanner to the file
3195: .B lex.yy.cc
3196: instead of
3197: .B lex.yy.c.
3198: The generated scanner includes the header file
1.5 deraadt 3199: .I g++/FlexLexer.h,
1.1 deraadt 3200: which defines the interface to two C++ classes.
3201: .PP
3202: The first class,
3203: .B FlexLexer,
3204: provides an abstract base class defining the general scanner class
3205: interface. It provides the following member functions:
3206: .TP
3207: .B const char* YYText()
3208: returns the text of the most recently matched token, the equivalent of
3209: .B yytext.
3210: .TP
3211: .B int YYLeng()
3212: returns the length of the most recently matched token, the equivalent of
3213: .B yyleng.
3214: .TP
3215: .B int lineno() const
3216: returns the current input line number
3217: (see
3218: .B %option yylineno),
3219: or
3220: .B 1
3221: if
3222: .B %option yylineno
3223: was not used.
3224: .TP
3225: .B void set_debug( int flag )
3226: sets the debugging flag for the scanner, equivalent to assigning to
3227: .B yy_flex_debug
3228: (see the Options section above). Note that you must build the scanner
3229: using
3230: .B %option debug
3231: to include debugging information in it.
3232: .TP
3233: .B int debug() const
3234: returns the current setting of the debugging flag.
3235: .PP
3236: Also provided are member functions equivalent to
3237: .B yy_switch_to_buffer(),
3238: .B yy_create_buffer()
3239: (though the first argument is an
3240: .B istream*
3241: object pointer and not a
3242: .B FILE*),
3243: .B yy_flush_buffer(),
3244: .B yy_delete_buffer(),
3245: and
3246: .B yyrestart()
3247: (again, the first argument is a
3248: .B istream*
3249: object pointer).
3250: .PP
3251: The second class defined in
1.5 deraadt 3252: .I g++/FlexLexer.h
1.1 deraadt 3253: is
3254: .B yyFlexLexer,
3255: which is derived from
3256: .B FlexLexer.
3257: It defines the following additional member functions:
3258: .TP
3259: .B
3260: yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
3261: constructs a
3262: .B yyFlexLexer
3263: object using the given streams for input and output. If not specified,
3264: the streams default to
3265: .B cin
3266: and
3267: .B cout,
3268: respectively.
3269: .TP
3270: .B virtual int yylex()
3271: performs the same role is
3272: .B yylex()
3273: does for ordinary flex scanners: it scans the input stream, consuming
3274: tokens, until a rule's action returns a value. If you derive a subclass
3275: .B S
3276: from
3277: .B yyFlexLexer
3278: and want to access the member functions and variables of
3279: .B S
3280: inside
3281: .B yylex(),
3282: then you need to use
3283: .B %option yyclass="S"
3284: to inform
3285: .I flex
3286: that you will be using that subclass instead of
3287: .B yyFlexLexer.
3288: In this case, rather than generating
3289: .B yyFlexLexer::yylex(),
3290: .I flex
3291: generates
3292: .B S::yylex()
3293: (and also generates a dummy
3294: .B yyFlexLexer::yylex()
3295: that calls
3296: .B yyFlexLexer::LexerError()
3297: if called).
3298: .TP
3299: .B
3300: virtual void switch_streams(istream* new_in = 0,
3301: .B
3302: ostream* new_out = 0)
3303: reassigns
3304: .B yyin
3305: to
3306: .B new_in
3307: (if non-nil)
3308: and
3309: .B yyout
3310: to
3311: .B new_out
3312: (ditto), deleting the previous input buffer if
3313: .B yyin
3314: is reassigned.
3315: .TP
3316: .B
3317: int yylex( istream* new_in, ostream* new_out = 0 )
3318: first switches the input streams via
3319: .B switch_streams( new_in, new_out )
3320: and then returns the value of
3321: .B yylex().
3322: .PP
3323: In addition,
3324: .B yyFlexLexer
3325: defines the following protected virtual functions which you can redefine
3326: in derived classes to tailor the scanner:
3327: .TP
3328: .B
3329: virtual int LexerInput( char* buf, int max_size )
3330: reads up to
3331: .B max_size
3332: characters into
3333: .B buf
3334: and returns the number of characters read. To indicate end-of-input,
3335: return 0 characters. Note that "interactive" scanners (see the
3336: .B \-B
3337: and
3338: .B \-I
3339: flags) define the macro
3340: .B YY_INTERACTIVE.
3341: If you redefine
3342: .B LexerInput()
3343: and need to take different actions depending on whether or not
3344: the scanner might be scanning an interactive input source, you can
3345: test for the presence of this name via
3346: .B #ifdef.
3347: .TP
3348: .B
3349: virtual void LexerOutput( const char* buf, int size )
3350: writes out
3351: .B size
3352: characters from the buffer
3353: .B buf,
3354: which, while NUL-terminated, may also contain "internal" NUL's if
3355: the scanner's rules can match text with NUL's in them.
3356: .TP
3357: .B
3358: virtual void LexerError( const char* msg )
3359: reports a fatal error message. The default version of this function
3360: writes the message to the stream
3361: .B cerr
3362: and exits.
3363: .PP
3364: Note that a
3365: .B yyFlexLexer
3366: object contains its
3367: .I entire
3368: scanning state. Thus you can use such objects to create reentrant
3369: scanners. You can instantiate multiple instances of the same
3370: .B yyFlexLexer
3371: class, and you can also combine multiple C++ scanner classes together
3372: in the same program using the
3373: .B \-P
3374: option discussed above.
3375: .PP
3376: Finally, note that the
3377: .B %array
3378: feature is not available to C++ scanner classes; you must use
3379: .B %pointer
3380: (the default).
3381: .PP
3382: Here is an example of a simple C++ scanner:
3383: .nf
3384:
3385: // An example of using the flex C++ scanner class.
3386:
3387: %{
3388: int mylineno = 0;
3389: %}
3390:
3391: string \\"[^\\n"]+\\"
3392:
3393: ws [ \\t]+
3394:
3395: alpha [A-Za-z]
3396: dig [0-9]
3397: name ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])*
3398: num1 [-+]?{dig}+\\.?([eE][-+]?{dig}+)?
3399: num2 [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)?
3400: number {num1}|{num2}
3401:
3402: %%
3403:
3404: {ws} /* skip blanks and tabs */
3405:
3406: "/*" {
3407: int c;
3408:
3409: while((c = yyinput()) != 0)
3410: {
3411: if(c == '\\n')
3412: ++mylineno;
3413:
3414: else if(c == '*')
3415: {
3416: if((c = yyinput()) == '/')
3417: break;
3418: else
3419: unput(c);
3420: }
3421: }
3422: }
3423:
3424: {number} cout << "number " << YYText() << '\\n';
3425:
3426: \\n mylineno++;
3427:
3428: {name} cout << "name " << YYText() << '\\n';
3429:
3430: {string} cout << "string " << YYText() << '\\n';
3431:
3432: %%
3433:
3434: int main( int /* argc */, char** /* argv */ )
3435: {
3436: FlexLexer* lexer = new yyFlexLexer;
3437: while(lexer->yylex() != 0)
3438: ;
3439: return 0;
3440: }
3441: .fi
3442: If you want to create multiple (different) lexer classes, you use the
3443: .B \-P
3444: flag (or the
3445: .B prefix=
3446: option) to rename each
3447: .B yyFlexLexer
3448: to some other
3449: .B xxFlexLexer.
3450: You then can include
1.5 deraadt 3451: .B <g++/FlexLexer.h>
1.1 deraadt 3452: in your other sources once per lexer class, first renaming
3453: .B yyFlexLexer
3454: as follows:
3455: .nf
3456:
3457: #undef yyFlexLexer
3458: #define yyFlexLexer xxFlexLexer
1.5 deraadt 3459: #include <g++/FlexLexer.h>
1.1 deraadt 3460:
3461: #undef yyFlexLexer
3462: #define yyFlexLexer zzFlexLexer
1.5 deraadt 3463: #include <g++/FlexLexer.h>
1.1 deraadt 3464:
3465: .fi
3466: if, for example, you used
3467: .B %option prefix="xx"
3468: for one of your scanners and
3469: .B %option prefix="zz"
3470: for the other.
3471: .PP
3472: IMPORTANT: the present form of the scanning class is
3473: .I experimental
3474: and may change considerably between major releases.
3475: .SH INCOMPATIBILITIES WITH LEX AND POSIX
3476: .I flex
3477: is a rewrite of the AT&T Unix
3478: .I lex
3479: tool (the two implementations do not share any code, though),
3480: with some extensions and incompatibilities, both of which
3481: are of concern to those who wish to write scanners acceptable
3482: to either implementation. Flex is fully compliant with the POSIX
3483: .I lex
3484: specification, except that when using
3485: .B %pointer
3486: (the default), a call to
3487: .B unput()
3488: destroys the contents of
3489: .B yytext,
3490: which is counter to the POSIX specification.
3491: .PP
3492: In this section we discuss all of the known areas of incompatibility
3493: between flex, AT&T lex, and the POSIX specification.
3494: .PP
3495: .I flex's
3496: .B \-l
3497: option turns on maximum compatibility with the original AT&T
3498: .I lex
3499: implementation, at the cost of a major loss in the generated scanner's
3500: performance. We note below which incompatibilities can be overcome
3501: using the
3502: .B \-l
3503: option.
3504: .PP
3505: .I flex
3506: is fully compatible with
3507: .I lex
3508: with the following exceptions:
3509: .IP -
3510: The undocumented
3511: .I lex
3512: scanner internal variable
3513: .B yylineno
3514: is not supported unless
3515: .B \-l
3516: or
3517: .B %option yylineno
3518: is used.
3519: .IP
3520: .B yylineno
3521: should be maintained on a per-buffer basis, rather than a per-scanner
3522: (single global variable) basis.
3523: .IP
3524: .B yylineno
3525: is not part of the POSIX specification.
3526: .IP -
3527: The
3528: .B input()
3529: routine is not redefinable, though it may be called to read characters
3530: following whatever has been matched by a rule. If
3531: .B input()
3532: encounters an end-of-file the normal
3533: .B yywrap()
3534: processing is done. A ``real'' end-of-file is returned by
3535: .B input()
3536: as
3537: .I EOF.
3538: .IP
3539: Input is instead controlled by defining the
3540: .B YY_INPUT
3541: macro.
3542: .IP
3543: The
3544: .I flex
3545: restriction that
3546: .B input()
3547: cannot be redefined is in accordance with the POSIX specification,
3548: which simply does not specify any way of controlling the
3549: scanner's input other than by making an initial assignment to
3550: .I yyin.
3551: .IP -
3552: The
3553: .B unput()
3554: routine is not redefinable. This restriction is in accordance with POSIX.
3555: .IP -
3556: .I flex
3557: scanners are not as reentrant as
3558: .I lex
3559: scanners. In particular, if you have an interactive scanner and
3560: an interrupt handler which long-jumps out of the scanner, and
3561: the scanner is subsequently called again, you may get the following
3562: message:
3563: .nf
3564:
3565: fatal flex scanner internal error--end of buffer missed
3566:
3567: .fi
3568: To reenter the scanner, first use
3569: .nf
3570:
3571: yyrestart( yyin );
3572:
3573: .fi
3574: Note that this call will throw away any buffered input; usually this
3575: isn't a problem with an interactive scanner.
3576: .IP
3577: Also note that flex C++ scanner classes
3578: .I are
3579: reentrant, so if using C++ is an option for you, you should use
3580: them instead. See "Generating C++ Scanners" above for details.
3581: .IP -
3582: .B output()
3583: is not supported.
3584: Output from the
3585: .B ECHO
3586: macro is done to the file-pointer
3587: .I yyout
3588: (default
3589: .I stdout).
3590: .IP
3591: .B output()
3592: is not part of the POSIX specification.
3593: .IP -
3594: .I lex
3595: does not support exclusive start conditions (%x), though they
3596: are in the POSIX specification.
3597: .IP -
3598: When definitions are expanded,
3599: .I flex
3600: encloses them in parentheses.
3601: With lex, the following:
3602: .nf
3603:
3604: NAME [A-Z][A-Z0-9]*
3605: %%
3606: foo{NAME}? printf( "Found it\\n" );
3607: %%
3608:
3609: .fi
3610: will not match the string "foo" because when the macro
3611: is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
3612: and the precedence is such that the '?' is associated with
3613: "[A-Z0-9]*". With
3614: .I flex,
3615: the rule will be expanded to
3616: "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
3617: .IP
3618: Note that if the definition begins with
3619: .B ^
3620: or ends with
3621: .B $
3622: then it is
3623: .I not
3624: expanded with parentheses, to allow these operators to appear in
3625: definitions without losing their special meanings. But the
3626: .B <s>, /,
3627: and
3628: .B <<EOF>>
3629: operators cannot be used in a
3630: .I flex
3631: definition.
3632: .IP
3633: Using
3634: .B \-l
3635: results in the
3636: .I lex
3637: behavior of no parentheses around the definition.
3638: .IP
3639: The POSIX specification is that the definition be enclosed in parentheses.
3640: .IP -
3641: Some implementations of
3642: .I lex
3643: allow a rule's action to begin on a separate line, if the rule's pattern
3644: has trailing whitespace:
3645: .nf
3646:
3647: %%
3648: foo|bar<space here>
3649: { foobar_action(); }
3650:
3651: .fi
3652: .I flex
3653: does not support this feature.
3654: .IP -
3655: The
3656: .I lex
3657: .B %r
3658: (generate a Ratfor scanner) option is not supported. It is not part
3659: of the POSIX specification.
3660: .IP -
3661: After a call to
3662: .B unput(),
3663: .I yytext
3664: is undefined until the next token is matched, unless the scanner
3665: was built using
3666: .B %array.
3667: This is not the case with
3668: .I lex
3669: or the POSIX specification. The
3670: .B \-l
3671: option does away with this incompatibility.
3672: .IP -
3673: The precedence of the
3674: .B {}
3675: (numeric range) operator is different.
3676: .I lex
3677: interprets "abc{1,3}" as "match one, two, or
3678: three occurrences of 'abc'", whereas
3679: .I flex
3680: interprets it as "match 'ab'
3681: followed by one, two, or three occurrences of 'c'". The latter is
3682: in agreement with the POSIX specification.
3683: .IP -
3684: The precedence of the
3685: .B ^
3686: operator is different.
3687: .I lex
3688: interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
3689: or 'bar' anywhere", whereas
3690: .I flex
3691: interprets it as "match either 'foo' or 'bar' if they come at the beginning
3692: of a line". The latter is in agreement with the POSIX specification.
3693: .IP -
3694: The special table-size declarations such as
3695: .B %a
3696: supported by
3697: .I lex
3698: are not required by
3699: .I flex
3700: scanners;
3701: .I flex
3702: ignores them.
3703: .IP -
3704: The name
3705: .bd
3706: FLEX_SCANNER
3707: is #define'd so scanners may be written for use with either
3708: .I flex
3709: or
3710: .I lex.
3711: Scanners also include
3712: .B YY_FLEX_MAJOR_VERSION
3713: and
3714: .B YY_FLEX_MINOR_VERSION
3715: indicating which version of
3716: .I flex
3717: generated the scanner
3718: (for example, for the 2.5 release, these defines would be 2 and 5
3719: respectively).
3720: .PP
3721: The following
3722: .I flex
3723: features are not included in
3724: .I lex
3725: or the POSIX specification:
3726: .nf
3727:
3728: C++ scanners
3729: %option
3730: start condition scopes
3731: start condition stacks
3732: interactive/non-interactive scanners
3733: yy_scan_string() and friends
3734: yyterminate()
3735: yy_set_interactive()
3736: yy_set_bol()
3737: YY_AT_BOL()
3738: <<EOF>>
3739: <*>
3740: YY_DECL
3741: YY_START
3742: YY_USER_ACTION
3743: YY_USER_INIT
3744: #line directives
3745: %{}'s around actions
3746: multiple actions on a line
3747:
3748: .fi
3749: plus almost all of the flex flags.
3750: The last feature in the list refers to the fact that with
3751: .I flex
3752: you can put multiple actions on the same line, separated with
3753: semi-colons, while with
3754: .I lex,
3755: the following
3756: .nf
3757:
3758: foo handle_foo(); ++num_foos_seen;
3759:
3760: .fi
3761: is (rather surprisingly) truncated to
3762: .nf
3763:
3764: foo handle_foo();
3765:
3766: .fi
3767: .I flex
3768: does not truncate the action. Actions that are not enclosed in
3769: braces are simply terminated at the end of the line.
3770: .SH DIAGNOSTICS
3771: .PP
3772: .I warning, rule cannot be matched
3773: indicates that the given rule
3774: cannot be matched because it follows other rules that will
3775: always match the same text as it. For
3776: example, in the following "foo" cannot be matched because it comes after
3777: an identifier "catch-all" rule:
3778: .nf
3779:
3780: [a-z]+ got_identifier();
3781: foo got_foo();
3782:
3783: .fi
3784: Using
3785: .B REJECT
3786: in a scanner suppresses this warning.
3787: .PP
3788: .I warning,
3789: .B \-s
3790: .I
3791: option given but default rule can be matched
3792: means that it is possible (perhaps only in a particular start condition)
3793: that the default rule (match any single character) is the only one
3794: that will match a particular input. Since
3795: .B \-s
3796: was given, presumably this is not intended.
3797: .PP
3798: .I reject_used_but_not_detected undefined
3799: or
3800: .I yymore_used_but_not_detected undefined -
3801: These errors can occur at compile time. They indicate that the
3802: scanner uses
3803: .B REJECT
3804: or
3805: .B yymore()
3806: but that
3807: .I flex
3808: failed to notice the fact, meaning that
3809: .I flex
3810: scanned the first two sections looking for occurrences of these actions
3811: and failed to find any, but somehow you snuck some in (via a #include
3812: file, for example). Use
3813: .B %option reject
3814: or
3815: .B %option yymore
3816: to indicate to flex that you really do use these features.
3817: .PP
3818: .I flex scanner jammed -
3819: a scanner compiled with
3820: .B \-s
3821: has encountered an input string which wasn't matched by
3822: any of its rules. This error can also occur due to internal problems.
3823: .PP
3824: .I token too large, exceeds YYLMAX -
3825: your scanner uses
3826: .B %array
3827: and one of its rules matched a string longer than the
3828: .B YYLMAX
3829: constant (8K bytes by default). You can increase the value by
3830: #define'ing
3831: .B YYLMAX
3832: in the definitions section of your
3833: .I flex
3834: input.
3835: .PP
3836: .I scanner requires \-8 flag to
3837: .I use the character 'x' -
3838: Your scanner specification includes recognizing the 8-bit character
3839: .I 'x'
3840: and you did not specify the \-8 flag, and your scanner defaulted to 7-bit
3841: because you used the
3842: .B \-Cf
3843: or
3844: .B \-CF
3845: table compression options. See the discussion of the
3846: .B \-7
3847: flag for details.
3848: .PP
3849: .I flex scanner push-back overflow -
3850: you used
3851: .B unput()
3852: to push back so much text that the scanner's buffer could not hold
3853: both the pushed-back text and the current token in
3854: .B yytext.
3855: Ideally the scanner should dynamically resize the buffer in this case, but at
3856: present it does not.
3857: .PP
3858: .I
3859: input buffer overflow, can't enlarge buffer because scanner uses REJECT -
3860: the scanner was working on matching an extremely large token and needed
3861: to expand the input buffer. This doesn't work with scanners that use
3862: .B
3863: REJECT.
3864: .PP
3865: .I
3866: fatal flex scanner internal error--end of buffer missed -
3867: This can occur in an scanner which is reentered after a long-jump
3868: has jumped out (or over) the scanner's activation frame. Before
3869: reentering the scanner, use:
3870: .nf
3871:
3872: yyrestart( yyin );
3873:
3874: .fi
3875: or, as noted above, switch to using the C++ scanner class.
3876: .PP
3877: .I too many start conditions in <> construct! -
3878: you listed more start conditions in a <> construct than exist (so
3879: you must have listed at least one of them twice).
3880: .SH FILES
3881: .TP
3882: .B \-lfl
3883: library with which scanners must be linked.
3884: .TP
3885: .I lex.yy.c
3886: generated scanner (called
3887: .I lexyy.c
3888: on some systems).
3889: .TP
3890: .I lex.yy.cc
3891: generated C++ scanner class, when using
3892: .B -+.
3893: .TP
1.5 deraadt 3894: .I <g++/FlexLexer.h>
1.1 deraadt 3895: header file defining the C++ scanner base class,
3896: .B FlexLexer,
3897: and its derived class,
3898: .B yyFlexLexer.
3899: .TP
3900: .I flex.skl
3901: skeleton scanner. This file is only used when building flex, not when
3902: flex executes.
3903: .TP
3904: .I lex.backup
3905: backing-up information for
3906: .B \-b
3907: flag (called
3908: .I lex.bck
3909: on some systems).
3910: .SH DEFICIENCIES / BUGS
3911: .PP
3912: Some trailing context
3913: patterns cannot be properly matched and generate
3914: warning messages ("dangerous trailing context"). These are
3915: patterns where the ending of the
3916: first part of the rule matches the beginning of the second
3917: part, such as "zx*/xy*", where the 'x*' matches the 'x' at
3918: the beginning of the trailing context. (Note that the POSIX draft
3919: states that the text matched by such patterns is undefined.)
3920: .PP
3921: For some trailing context rules, parts which are actually fixed-length are
1.3 deraadt 3922: not recognized as such, leading to the above mentioned performance loss.
1.1 deraadt 3923: In particular, parts using '|' or {n} (such as "foo{3}") are always
3924: considered variable-length.
3925: .PP
3926: Combining trailing context with the special '|' action can result in
3927: .I fixed
3928: trailing context being turned into the more expensive
3929: .I variable
3930: trailing context. For example, in the following:
3931: .nf
3932:
3933: %%
3934: abc |
3935: xyz/def
3936:
3937: .fi
3938: .PP
3939: Use of
3940: .B unput()
3941: invalidates yytext and yyleng, unless the
3942: .B %array
3943: directive
3944: or the
3945: .B \-l
3946: option has been used.
3947: .PP
3948: Pattern-matching of NUL's is substantially slower than matching other
3949: characters.
3950: .PP
3951: Dynamic resizing of the input buffer is slow, as it entails rescanning
3952: all the text matched so far by the current (generally huge) token.
3953: .PP
3954: Due to both buffering of input and read-ahead, you cannot intermix
3955: calls to <stdio.h> routines, such as, for example,
3956: .B getchar(),
3957: with
3958: .I flex
3959: rules and expect it to work. Call
3960: .B input()
3961: instead.
3962: .PP
3963: The total table entries listed by the
3964: .B \-v
3965: flag excludes the number of table entries needed to determine
3966: what rule has been matched. The number of entries is equal
3967: to the number of DFA states if the scanner does not use
3968: .B REJECT,
3969: and somewhat greater than the number of states if it does.
3970: .PP
3971: .B REJECT
3972: cannot be used with the
3973: .B \-f
3974: or
3975: .B \-F
3976: options.
3977: .PP
3978: The
3979: .I flex
3980: internal algorithms need documentation.
3981: .SH SEE ALSO
3982: .PP
3983: lex(1), yacc(1), sed(1), awk(1).
3984: .PP
3985: John Levine, Tony Mason, and Doug Brown,
3986: .I Lex & Yacc,
3987: O'Reilly and Associates. Be sure to get the 2nd edition.
3988: .PP
3989: M. E. Lesk and E. Schmidt,
3990: .I LEX \- Lexical Analyzer Generator
3991: .PP
3992: Alfred Aho, Ravi Sethi and Jeffrey Ullman,
3993: .I Compilers: Principles, Techniques and Tools,
3994: Addison-Wesley (1986). Describes the pattern-matching techniques used by
3995: .I flex
3996: (deterministic finite automata).
3997: .SH AUTHOR
3998: Vern Paxson, with the help of many ideas and much inspiration from
3999: Van Jacobson. Original version by Jef Poskanzer. The fast table
4000: representation is a partial implementation of a design done by Van
4001: Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
4002: .PP
4003: Thanks to the many
4004: .I flex
4005: beta-testers, feedbackers, and contributors, especially Francois Pinard,
4006: Casey Leedom,
4007: Robert Abramovitz,
4008: Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4009: Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4010: Karl Berry, Peter A. Bigot, Simon Blanchard,
4011: Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4012: Brian Clapper, J.T. Conklin,
4013: Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
4014: Daniels, Chris G. Demetriou, Theo Deraadt,
4015: Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4016: Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4017: Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4018: Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4019: Jan Hajic, Charles Hemphill, NORO Hideo,
4020: Jarkko Hietaniemi, Scott Hofmann,
4021: Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4022: Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4023: Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4024: Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4025: Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4026: Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4027: David Loffredo, Mike Long,
4028: Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4029: Bengt Martensson, Chris Metcalf,
4030: Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4031: G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4032: Richard Ohnemus, Karsten Pahnke,
4033: Sven Panne, Roland Pesch, Walter Pelissero, Gaumond
4034: Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
4035: Frederic Raimbault, Pat Rankin, Rick Richardson,
4036: Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4037: Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4038: Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4039: Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4040: Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4041: Chris Thewalt, Richard M. Timoney, Jodi Tsai,
4042: Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
4043: Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4044: and those whose names have slipped my marginal
4045: mail-archiving skills but whose contributions are appreciated all the
4046: same.
4047: .PP
4048: Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4049: John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4050: Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4051: distribution headaches.
4052: .PP
4053: Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to
4054: Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom
4055: Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to
4056: Eric Hughes for support of multiple buffers.
4057: .PP
4058: This work was primarily done when I was with the Real Time Systems Group
4059: at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
4060: for the support I received.
4061: .PP
4062: Send comments to vern@ee.lbl.gov.