Lex Provider Update

A while ago, I wrote about the DTrace provider I made for lex, the lexical
analyzer that ships with Illumos.

I added one new probe to it, that solves a simple but particularly vexing
issue: tracing and aggregating on the regular expression used to match a token
instead of the actual token or the returned token-id. This new probe is called
match_regex.

The returned token-id (value returned by yylex()) would only be adequate in
identifying the most commonly matched regular expressions, if yylex()
necessarily returned on every single match. It doesn’t.

The two closest ways to achieve what the new probe achieves, is by (1)
recording the final state that occurred before a match, and then digging
through the generated lex.yy.c file to see what regular expression that state
correlates to, and (2) using the action probe to see where in the switch
statement we are jumping to. (2) No longer works, as the action probe has
been removed (it’s redundant with match_regex, and less helpful).

The first method is convoluted and inelegant. The second method isn’t so bad,
and is actually remarkably similar to match_regex. With action, one can go
to the specific case within the switch-statement that gets exectuded. and find
that lex has helpfully placed a # line macro, with a line number that
matches the line of the regular expression in the .lex source file. Using
this information, one could find which regular expressions are being matched,
without depending on the return value of yylex().

The actual match_regex probe, allows one the luxury of not having to juggle
two files (the .lex source file and the generated lex.yy.c file), and to
instead focus on just one (the .lex source file) by giving the line number of
the defined regular expression as its first argument. And, should one have
access to the generated lex.yy.c file, but not the .lex file, one can use
the second argument to get the line number of the probe itself (it’s located
within the switch statement).

Now a quick demo.

Here is a simple lexer (.lex) for a calculator:

#include <stdio.h>
#include <stdlib.h>

%%

[0-9]+          { return (0); }
"-"[0-9]+       { return (1); }
"+"             { return (2); }
"-"             { return (3); }
"*"             { return (4); }
"/"             { return (5); }

%%

int
main(int ac, char **av)
{

    ++av; --ac;
    if (ac > 0) {
        yyin = fopen(av[0], "r");
    } else {
        yyin = stdin;
    }

    while (!feof(yyin)) {
        yylex();
    }

}

Here is an input file for the lexer to process:

2 + 2
-2 + 2
3 * 90
60 / 5
4321 - 1234
+ 2 2 3 8 9
- 9 8 1

A simple dtrace invocation that runs the lexer, and indicates which regular
expressions were matched the most:

pfexec dtrace -c './calc_lex calc_in' -n 'lex$target:::match_regex {@[arg0] = count();}'

 7                1
10                1
11                1
 9                2
 8                3
 6               17

A slightly different dtrace invocation that is the same as the previous one,
but indicates the part of the switch statement we jump to when we match a
regex:

pfexec dtrace -c './calc_lex calc_in' -n 'lex$target:::match_regex {@[arg1] = count();}'

10                1
13                1
14                1
12                2
11                3
99               17

Admittedly, this is less useful than the previous invocation, but it’s better
than nothing, and I’ve seen repositories where people mindlessly include just
the generated file, but leave the .lex file out.

Not being able to quickly find the most commonly matched regexes was a major
deficiency of the initial iteration of this provider. Only two lines of code
were added (since the last iteration) to the lex source code (which is now on
github) to make this happen. (See the diffs for sub1.c, for details).

I’ve only used lex a dozen times or so, thus far, but the new provider has
dramatically sped up the compile-test-debug cycles, by allowing one to ask very
specific questions and use DTrace constructs like aggregations. Ultimately the
amount of times one has to recompile is reduced, and the volume of debug-output
one has to burrow through is microscopic compared to the dump of debug-printf’s
that lex uses when #ifdef LEXDEBUG is true. Also, one can use the lex
provider in conjunction with other providers offered by Illumos and
DTrace-aware processes.

The modified code is located in my own branch of illumos-joyent, on
Github.

Happy lexing.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s