Adventures of a DTrace Addict: Part 1.0

Lex, Yacc, and Embedding DTrace Probes

The purpose of life is to do whatever turns you on. -me

Lex and Yacc are two tools that have aided Unix developers in writing
compilers for decades. Yet, these tools are lacking in basic debugging
facilities. This is my attempt to add some DTrace magic to Lex and Yacc.

Two unix tools that have always fascinated me were lex and yacc. Admittedly, it doesn’t take much to fascinate me; in fact, I often
get distracted by shiny stuff.

I’ve always wanted to understand how lex and yacc work, and I’ve wanted to
be able to observe what they’re doing in real time. Not for any particular
reason, aside from the fact that observability turns me on (I’m like that).

Come to think of it, lex and yacc, despite being staples of parsing on unix
systems, don’t have any kind of significant instrumentation at all. For
decades, these tools have been nothing more but generators of magical black
boxes. These black boxes are used all around unix — even DTrace uses lex+yacc
powered parsing for its D language — but no one can actually see what the
generated code is doing (short of tracing the program instruction for
instruction, which is, well, complex).

Apparently they work, and the users and creators of these tools did just fine
without these newfangled kernel-based instrumentation framerworks going around.
In fact, it’s questionable whether or not instrumentation would even be
helpful. Let’s face it: these tools were designed by people smarter than blokes
like us. So maybe this is a matter that should be left alone…

But, there’s still that… itch. That notion that keeps nagging at you while
you’re awake. Sometimes even when you’re asleep (ever dream about DTracing some
cool app; I do all the time). “My. I wonder if baking in some DTrace probes in
$APP_FOO would make it better… nah.” It’s like a mosquito buzzing about
your ear in the middle of the night, waking you up periodically. Maybe you
turn, you try to ignore it, but at some point your baggy, blood shot eyes shoot
open, and BAM! One blow, swift and impulsive. The nuissence is gone, the itch
has been scratched, now you can sleep well at night. Temporarily at least, you
will dream like normal people do. And then it starts all over again…

At any rate, I spent a few sleepless summer nights, fueled by a chaotic
enthusiasm, tracing and instrumenting lex (which, by the way, matches a bunch
of input tokens, using regular expressions, and either returns token values or
does some other specified action). Do a

man lex

for the full details.

Don’t get me wrong. I’d love to hold your hand and walk you through this but
who has the time? Not me, certainly. Callous, perhaps, but there you go.

I used the first example in the man page as a test bed:

%{
/* need this for the call to atof() below */
#include <math.h>
/* need this for printf(), fopen() and stdin below */
#include <stdio.h>
%}

DIGIT    [0-9]
ID       [a-z][a-z0-9]*
%%

{DIGIT}+                          {
    printf("An integer: %s (%d)\n", yytext,
    atoi(yytext));
    }

{DIGIT}+"."{DIGIT}*    {
    printf("A float: %s (%g)\n", yytext,
    atof(yytext));
    }

if|then|begin|end|procedure|function        {
    printf("A keyword: %s\n", yytext);
    }

{ID}                   printf("An identifier: %s\n", yytext);

"+"|"-"|"*"|"/"        printf("An operator: %s\n", yytext);

"{"[^}\n]*"}"         /* eat up one-line comments */

[ \t\n]+               /* eat up white space */

.                      printf("Unrecognized character: %s\n", yytext);

%%

int main(int argc, char *argv[])
{
    ++argv, --argc;  /* skip over program name */
    if (argc > 0)
        yyin = fopen(argv[0], "r");
    else
        yyin = stdin;

    yylex();
}

I began by tracing just the function calls made in the binary itself and
avoided any other module (like libc).

pfexec dtrace -F -c './lexy' -n
    'pid$target:lexy::entry, pid$target:lexy::return {}'

As can be seen below, everything is wrapped in a big function called yylex,
where “everything” refers to the large number of yylook functions.
Unfortunately, yylook takes no arguments and returns no values (I looked at
the source code). We’ll have to drill a little deeper to see what functions are
getting called in yylook. Either way simply using the pid provider is
insufficient to see the inner workings of lex.

A float: 9.9 (9.9)
An integer: 10 (10)
An integer: 5 (5)
A float: 3.333 (3.333)
CPU FUNCTION                                 
  0  -> _start                                
  0    -> __fsr                               
  0    <- __fsr                               
  0    -> main                                
  0      -> yylex                             
  0        -> yylook                          
  0        <- yylook                          
  0        -> yylook                          
  0        <- yylook                          
  0        -> yylook                          
  0        <- yylook                          
  0        -> yylook                          
  0        <- yylook                          
  0        -> yylook                          
  0        <- yylook                          
  0        -> yylook                          
  0        <- yylook                          
  0        -> yylook                          
  0        <- yylook                          
  0        -> yylook                          
  0        <- yylook                          
  0        -> yylook                          
  0        <- yylook                          
  0      <- yylex                             
  0    <- main

In order to get meaningful data about the lexing process, I would have to embed
semantically meaningful probes into the generated lex program. The most
straight-forward way of doing this is to add the probes into the actions
specified for each match. (UPDATE: See this post, this
post
, and this post about usdt probes.)

That would be kind of cool, but kind of trivial at the same time. The next
logical step would be to actually look at yylook, and probe the snot out of
it. This would be much cooler considering that lex is really just a state
machine, and I would able to see the states lex goes through as it lexes the
input.

Of course, manually modifying the lex-generated program is, well, manual. So I
did what any lazy, self-respecting computer scientist would do: I modified the
lex program to generate a DTrace source file and an instrumented lexer. All
that needs to be done is the linking at build time. I added a ‘-D’ option to
lex which toggles the generation of a probed lexer plus the D source file.
It’s in my git repo available here. Keep in mind, that lex
is part of Illumos, and you’d have to do a bit of dance involving some scripts
in order get the new and improved lex bits. See the install notes at the at of
this post. Yeah, it’s a pain. But it’s well worth it.

Unless.

Unless, of course, you weren’t one of us. Unless, you were but a mere muggle
watching, awe-struck and slack-jawed as a wizard puts on a show, demonstrating
the wickedness of his magic; making his caprice take its course.

To the wizardly, the inconvenience is worth it. Because magic is an end in and
of itself. Muggles, however, wish to leverage magic to their own low-brow,
pointy-haired purposes.

Enough about muggles. We’re wizards. And all wizards have a craving for one
thing: data.

Data, data, data. Embedded DTrace probes now allowed for far more data to be
collected during the lexing process. I’ll quickly demonstrate what one can do
with the new and improved lex, using a quick-and-dirty lexer for the C
programming language.

Here’s the code for the C lexer:

/* C lexer */
%a 1000
%p 2000
%{
#include "y.tab.h"
/*
 * Used for debugging purposes...
 */
int
indicator()
{
    return (0);
}

/*
 * We define a set of labled states for use in the C lexer, and then
 * some regex's for neatness.
 *
 * The states are:
 *
 * S0 - C code that is not a comment, a string, or a character.
 * S1 - C comment blocks
 * S2 - C line comment
 */
%}

%s S0 S1 S2

RGX_WS          [\f\r\t\v ]*
RGX_WS_DT       [\f\n\r\t\v ]
RGX_NL          \n+
/* C Pre Proc */
RGX_INCL        "#"{RGX_WS}"include"
RGX_INCLLIB     "<"{RGX_PATH}">"
RGX_INCLUSR     \"{RGX_PATH}\"
RGX_DEF         "#"{RGX_WS}"define"
RGX_UNDEF       "#"{RGX_WS}"undef"
RGX_MACIF       "#"{RGX_WS}"if"
RGX_MACELSE     "#"{RGX_WS}"else"
RGX_MACELIF     "#"{RGX_WS}"elif"
RGX_MACENDIF    "#"{RGX_WS}"endif"
RGX_IFDEF       "#"{RGX_WS}"ifdef"
RGX_IFNDEF      "#"{RGX_WS}"ifndef"
RGX_PRAGMA      "#"{RGX_WS}"pragma"
RGX_WARNING     "#"{RGX_WS}"warning"
RGX_ERROR       "#"{RGX_WS}"error"
RGX_LINE        "#"{RGX_WS}"line"
RGX_LINCONT     \\
/* C Num Constants */
RGX_DIGIT       [0-9]
RGX_INT         {RGX_DIGIT}+
RGX_FLOAT       {RGX_DIGIT}+"."{RGX_DIGIT}+
RGX_HEX         0x[0-9a-fA-F]+
/* C Types */
RGX_TCHAR       char
RGX_TINT        int
RGX_TFLOAT      float
RGX_TDBL        double
RGX_TLONG       long
RGX_TSHORT      short
RGX_SIGNED      signed
RGX_UNSIGNED    unsigned
RGX_EXT         extern
RGX_STAT        static
RGX_TYPEDEF     typedef
RGX_STRUCT      struct
RGX_ENUM        enum
RGX_UNI         union
RGX_SO          sizeof
/* C Control Flow*/
RGX_IF          "if"
RGX_THEN        "then"
RGX_ELSE        "else"
RGX_GOTO        "goto"
RGX_RET         "return"
RGX_SWITCH      "switch"
RGX_CASE        "case"
RGX_DEFAULT     "default"
RGX_BREAK       "break"
RGX_CONTIN      "continue"
RGX_LABLE       [a-zA-Z0-9]+:
RGX_DELIM       ";"
RGX_BEGIN       "{"
RGX_END         "}"
RGX_PARENO      "("
RGX_PARENC      ")"
RGX_COMMA       ","
/* C Operators */
RGX_MBR         "."
RGX_PMBR        "->"
RGX_INC         "++"
RGX_DEC         "--"
RGX_MULEQ       "*="
RGX_MULDEREF    "*"
RGX_DIVEQ       "/="
RGX_DIV         "/"
RGX_ADDEQ       "+="
RGX_ADD         "+"
RGX_MINEQ       "-="
RGX_MIN         "-"
RGX_MODEQ       "%="
RGX_MOD         "%"
RGX_TRTHEN      "?"
RGX_TRELSE      ":"
RGX_LAND        "&&"
RGX_LOR         "||"
RGX_LEQ         "=="
RGX_NEQ         "!="
RGX_GTE         ">="
RGX_GT          ">"
RGX_LTE         "<="
RGX_LT          "<"
RGX_BAND        "&"
RGX_BOR         "|"
RGX_XOR         "^"
RGX_NOT         "!"
RGX_RSH         ">>"
RGX_LSH         "<<"
RGX_EQ          "="
RGX_ONECMP      "~"
/* C String Constants */
RGX_WORD        [a-zA-Z0-9]+
RGX_NAME        [a-zA-Z0-9\_]+
RGX_PATH        [a-zA-Z0-9/\-\_\.]+
/* RGX_ASCII    [`-=~!@#$%^&*()_+[]\;',./{}|:"<>?a-zA-Z0-9]+ */
RGX_STR         ([^"\\\n]|\\[^"\n]|\\\")*
RGX_CHR         ([^'\\\n]|\\[^'\n]|\\')*
RGX_CMT         .|\n
%{
%}

%%

<S0>{RGX_NL}            { ; }
<S2>{RGX_NL}            { BEGIN(S0); }
<S0>"/*"                { BEGIN(S1); }
<S1>"*/"                { BEGIN(S0); }
<S1>{RGX_CMT}           { ; }
<S2>{RGX_CMT}           { ; }
<S0>"//"                { BEGIN(S2); }
<S0>{RGX_INCL}          { return (TOK_INCL); }
<S0>{RGX_DEF}           { return (TOK_DEF); }
<S0>{RGX_UNDEF}         { return (TOK_UNDEF); }
<S0>{RGX_MACIF}         { return (TOK_MACIF); }
<S0>{RGX_MACELSE}       { return (TOK_MACELSE); }
<S0>{RGX_MACELIF}       { return (TOK_MACELIF); }
<S0>{RGX_MACENDIF}      { return (TOK_MACENDIF); }
<S0>{RGX_IFDEF}         { return (TOK_IFDEF); }
<S0>{RGX_IFNDEF}        { return (TOK_IFNDEF); }
<S0>{RGX_PRAGMA}        { return (TOK_PRAGMA); }
<S0>{RGX_WARNING}       { return (TOK_WARNING); }
<S0>{RGX_ERROR}         { return (TOK_ERROR); }
<S0>{RGX_LINE}          { return (TOK_LINE); }
<S0>{RGX_LINCONT}       { return (TOK_LINCONT); }
<S0>\<{RGX_PATH}\>      { return (TOK_INCLIB); }
<S0>\"{RGX_PATH}\"      { return (TOK_INCUSR); }
<S0>{RGX_DIGIT}         { return (TOK_DIGIT); }
<S0>{RGX_INT}           { return (TOK_INT); }
<S0>{RGX_FLOAT}         { return (TOK_FLOAT); }
<S0>{RGX_HEX}           { return (TOK_HEX); }
<S0>{RGX_TCHAR}         { return (TOK_TCHAR); }
<S0>{RGX_TINT}          { return (TOK_TINT); }
<S0>{RGX_TFLOAT}        { return (TOK_TFLOAT); }
<S0>{RGX_TDBL}          { return (TOK_TDBL); }
<S0>{RGX_TLONG}         { return (TOK_TLONG); }
<S0>{RGX_TSHORT}        { return (TOK_TSHORT); }
<S0>{RGX_SIGNED}        { return (TOK_SIGNED); }
<S0>{RGX_UNSIGNED}      { return (TOK_UNSIGNED); }
<S0>{RGX_EXT}           { return (TOK_EXT); }
<S0>{RGX_STAT}          { return (TOK_STAT); }
<S0>{RGX_TYPEDEF}       { return (TOK_TYPEDEF); }
<S0>{RGX_STRUCT}        { return (TOK_STRUCT); }
<S0>{RGX_ENUM}          { return (TOK_ENUM); }
<S0>{RGX_UNI}           { return (TOK_UNI); }
<S0>{RGX_SO}            { return (TOK_SO); }
<S0>"["                 { return (TOK_ARRO); }
<S0>"]"                 { return (TOK_ARRC); }
<S0>{RGX_IF}            { return (TOK_IF); }
<S0>{RGX_THEN}          { return (TOK_THEN); }
<S0>{RGX_ELSE}          { return (TOK_ELSE); }
<S0>{RGX_GOTO}          { return (TOK_GOTO); }
<S0>{RGX_RET}           { return (TOK_RET); }
<S0>{RGX_LABLE}         { return (TOK_LABLE); }
<S0>{RGX_DELIM}         { return (TOK_DELIM); }
<S0>{RGX_BEGIN}         { return (TOK_BEGIN); }
<S0>{RGX_END}           { return (TOK_END); }
<S0>{RGX_PARENO}        { return (TOK_PARENO); }
<S0>{RGX_PARENC}        { return (TOK_PARENC); }
<S0>{RGX_COMMA}         { return (TOK_COMMA); }
<S0>{RGX_MBR}           { return (TOK_MBR); }
<S0>{RGX_PMBR}          { return (TOK_PMBR); }
<S0>{RGX_INC}           { return (TOK_INC); }
<S0>{RGX_DEC}           { return (TOK_DEC); }
<S0>{RGX_MULEQ}         { return (TOK_MULEQ); }
<S0>{RGX_MULDEREF}      { return (TOK_MULDEREF); }
<S0>{RGX_DIVEQ}         { return (TOK_DIVEQ); }
<S0>{RGX_DIV}           { return (TOK_DIV); }
<S0>{RGX_ADDEQ}         { return (TOK_ADDEQ); }
<S0>{RGX_ADD}           { return (TOK_ADD); }
<S0>{RGX_MINEQ}         { return (TOK_MINEQ); }
<S0>{RGX_MIN}           { return (TOK_MIN); }
<S0>{RGX_MODEQ}         { return (TOK_MODEQ); }
<S0>{RGX_MOD}           { return (TOK_MOD); }
<S0>{RGX_TRTHEN}        { return (TOK_TRTHEN); }
<S0>{RGX_TRELSE}        { return (TOK_TRELSE); }
<S0>{RGX_LAND}          { return (TOK_LAND); }
<S0>{RGX_LOR}           { return (TOK_LOR); }
<S0>{RGX_LEQ}           { return (TOK_LEQ); }
<S0>{RGX_NEQ}           { return (TOK_NEQ); }
<S0>{RGX_GTE}           { return (TOK_GTE); }
<S0>{RGX_GT}            { return (TOK_GT); }
<S0>{RGX_LTE}           { return (TOK_LTE); }
<S0>{RGX_LT}            { return (TOK_LT); }
<S0>{RGX_BAND}          { return (TOK_BAND); }
<S0>{RGX_BOR}           { return (TOK_BOR); }
<S0>{RGX_XOR}           { return (TOK_XOR); }
<S0>{RGX_NOT}           { return (TOK_NOT); }
<S0>{RGX_RSH}           { return (TOK_RSH); }
<S0>{RGX_LSH}           { return (TOK_LSH); }
<S0>{RGX_EQ}            { return (TOK_EQ); }
<S0>{RGX_ONECMP}        { return (TOK_ONECMP); }
<S0>{RGX_NAME}          { return (TOK_NAME); }
<S0>{RGX_WORD}          { return (TOK_WORD); }
<S0>\"{RGX_STR}\"       |
<S2>\"{RGX_STR}\"       {
                ;
            }
<S0>\'{RGX_CHR}\'       { ; }
<S0>{RGX_WS}    { ; }
%%
int
main(int ac, char *av[])
{
    ++av; --ac;
    if (ac > 0) {
        yyin = fopen(av[0], "r");
    } else {
        yyin = stdin;
    }
    BEGIN(S0);
    while (!feof(yyin)) {
        yylex();
    }
}

We’ll be using this lexer to lex the C source file that’s been generated from
lex code that specifies how to lex the C language.

First the available probes:

provider lex {
    /* Tracks the lexer's state transitions */
    probe state(int);
    /* Indicates which action was taken */
    probe action(int);
    /* Fires whenever a match occures; arg0 is matched str */
    probe match(char*);
    /* The current character */
    probe ch(char);
    /* Fires whenever the lexer backtracks */
    probe fallback();
    /* Fires whenever we enter a compressed state */
    probe compr_st();
    /* Fires whenever the lexer stops to complete a match */
    probe stop();
    /* Fires whenever the user uses the BEGIN macro */
    probe begin(int, void*);
    /* Fires whenever the user uses the REJECT macro */
    probe reject();
};

As you can see, we probe events such as state transitions, matches,
action-execution, backtracking, and the current character.

So the first thing we will do, is a high-resolution state-by-state,
character-by-character, match-by-match trace of the C lexer lexing itself.

The following script, trace.d is used:

lex$target:::match
{
}

lex$target:::reject
{
}

lex$target:::fallback_st
{
    trace(arg0);
}

lex$target:::state
{
    trace(arg0);
}

lex$target:::begin
{
    trace(arg0);
}

lex$target:::fallback_ch,
lex$target:::ch
/arg0 != '\t' && arg0 != '\n' && arg0 != ''/
{
    printf(": '%c'\n", (char) arg0);
}

lex$target:::fallback_ch,
lex$target:::ch
/arg0 == '\t'/
{
    printf(": '\\t'\n");
}

lex$target:::fallback_ch,
lex$target:::ch
/arg0 == '\n'/
{
    printf(": '\\n'\n");
}

lex$target:::fallback_ch,
lex$target:::ch
/arg0 == ''/
{
    printf(": '\'\n");
}

pid$target::yylex:entry
{

}

pid$target::yylex:return
{
    trace(arg1);
}

However, keep in mind that if you’re not using a BEGIN or REJECT macro in any
of your lex code, you’ll have to commend out the parts of the code that enables
those probes.

Running it on the lexer (the clexer binary):

pfexec dtrace -c './clexer lex.yy.c' -s trace.d

Here are the first few lines of output:

CPU     ID                    FUNCTION:NAME
  3  72077                       main:begin                 2
  3  72136                      yylex:entry 
  3  72084                     yylook:state                 3
  3  72079                        yylook:ch : '#'

  3  72082               yylook:fallback_st                 2
  3  72084                     yylook:state                12
  3  72079                        yylook:ch : 'i'

  3  72084                     yylook:state                65
  3  72079                        yylook:ch : 'n'

  3  72084                     yylook:state               122
  3  72079                        yylook:ch : 'c'

  3  72084                     yylook:state               156
  3  72079                        yylook:ch : 'l'

  3  72084                     yylook:state               186
  3  72079                        yylook:ch : 'u'

  3  72084                     yylook:state               208
  3  72079                        yylook:ch : 'd'

  3  72084                     yylook:state               223
  3  72079                        yylook:ch : 'e'

  3  72084                     yylook:state               228
  3  72083                     yylook:match 
  3  72138                     yylex:return               260
  3  72136                      yylex:entry 
  3  72084                     yylook:state                 2
  3  72079                        yylook:ch : ' '

  3  72084                     yylook:state                 8
  3  72079                        yylook:ch : '<'

  3  72083                     yylook:match 
  3  72084                     yylook:state                 2
  3  72079                        yylook:ch : '<'

  3  72084                     yylook:state                28
  3  72079                        yylook:ch : 's'
...

As you can see, we also trace the return value of yylex, which is the integer
that represents a particular token. The definitions can be found in y.tab.h
which was generated from an incomplete yacc grammar.

Full y.tab.h:

# define TOK_SPACES 257
# define TOK_NL 258
# define TOK_TB 259
# define TOK_INCL 260
# define TOK_DEF 261
# define TOK_UNDEF 262
# define TOK_MACIF 263
# define TOK_MACELSE 264
# define TOK_MACELIF 265
# define TOK_MACENDIF 266
# define TOK_IFDEF 267
# define TOK_IFNDEF 268
# define TOK_PRAGMA 269
# define TOK_WARNING 270
# define TOK_ERROR 271
# define TOK_LINE 272
# define TOK_LINCONT 273
# define TOK_INCLIB 274
# define TOK_INCUSR 275
# define TOK_DIGIT 276
# define TOK_INT 277
# define TOK_FLOAT 278
# define TOK_HEX 279
# define TOK_TCHAR 280
# define TOK_TINT 281
# define TOK_TFLOAT 282
# define TOK_TDBL 283
# define TOK_TLONG 284
# define TOK_TSHORT 285
# define TOK_SIGNED 286
# define TOK_UNSIGNED 287
# define TOK_EXT 288
# define TOK_STAT 289
# define TOK_TYPEDEF 290
# define TOK_STRUCT 291
# define TOK_ENUM 292
# define TOK_UNI 293
# define TOK_SO 294
# define TOK_ARRO 295
# define TOK_ARRC 296
# define TOK_IF 297
# define TOK_THEN 298
# define TOK_ELSE 299
# define TOK_GOTO 300
# define TOK_RET 301
# define TOK_LABLE 302
# define TOK_DELIM 303
# define TOK_BEGIN 304
# define TOK_END 305
# define TOK_PARENO 306
# define TOK_PARENC 307
# define TOK_COMMA 308
# define TOK_MBR 309
# define TOK_PMBR 310
# define TOK_INC 311
# define TOK_DEC 312
# define TOK_MULEQ 313
# define TOK_MULDEREF 314
# define TOK_DIVEQ 315
# define TOK_DIV 316
# define TOK_ADDEQ 317
# define TOK_ADD 318
# define TOK_MINEQ 319
# define TOK_MIN 320
# define TOK_MODEQ 321
# define TOK_MOD 322
# define TOK_TRTHEN 323
# define TOK_TRELSE 324
# define TOK_LAND 325
# define TOK_LOR 326
# define TOK_LEQ 327
# define TOK_NEQ 328
# define TOK_GTE 329
# define TOK_GT 330
# define TOK_LTE 331
# define TOK_LT 332
# define TOK_BAND 333
# define TOK_BOR 334
# define TOK_XOR 335
# define TOK_NOT 336
# define TOK_RSH 337
# define TOK_LSH 338
# define TOK_EQ 339
# define TOK_ONECMP 340
# define TOK_NAME 341
# define TOK_WORD 342
# define TOK_CHR 343
# define TOK_STR 344

Now we will do a very similar trace, except we will also print timestamps in
the right-most column for particular events like matches, fallbacks, etc,
using the following script, trace_stamp.d:

lex$target:::match
{
    trace(vtimestamp);
}

lex$target:::reject
{
    trace(vtimestamp);
}

lex$target:::fallback_st
{
    trace(arg0);
    trace(vtimestamp);
}

lex$target:::state
{
    trace(arg0);
    trace(vtimestamp);
}

lex$target:::begin
{
    trace(arg0);
    trace(vtimestamp);
}

lex$target:::fallback_ch,
lex$target:::ch
/arg0 != '\t' && arg0 != '\n' && arg0 != ''/
{
    printf(": '%c'\n", (char) arg0);
}

lex$target:::fallback_ch,
lex$target:::ch
/arg0 == '\t'/
{
    printf(": '\\t'\n");
}

lex$target:::fallback_ch,
lex$target:::ch
/arg0 == '\n'/
{
    printf(": '\\n'\n");
}

lex$target:::fallback_ch,
lex$target:::ch
/arg0 == ''/
{
    printf(": '\'\n");
}

pid$target::yylex:entry
{
    trace(vtimestamp);
}

pid$target::yylex:return
{
    trace(arg1);
    trace(vtimestamp);
}

The commenting out of BEGIN or REJECT probes also applies to this script.

And now, the trace:

CPU     ID                    FUNCTION:NAME
  2  72077                       main:begin                 2      833713
  2  72136                      yylex:entry            835272
  2  72084                     yylook:state                 3      849428
  2  72079                        yylook:ch : '#'

  2  72082               yylook:fallback_st                 2      983644
  2  72084                     yylook:state                12      984585
  2  72079                        yylook:ch : 'i'

  2  72084                     yylook:state                65      985769
  2  72079                        yylook:ch : 'n'

  2  72084                     yylook:state               122      986795
  2  72079                        yylook:ch : 'c'

  2  72084                     yylook:state               156      987847
  2  72079                        yylook:ch : 'l'

  2  72084                     yylook:state               186      988972
  2  72079                        yylook:ch : 'u'

  2  72084                     yylook:state               208      990097
  2  72079                        yylook:ch : 'd'

  2  72084                     yylook:state               223      991058
  2  72079                        yylook:ch : 'e'

  2  72084                     yylook:state               228      992023
  2  72083                     yylook:match            993829
  2  72138                     yylex:return               260      995786
  2  72136                      yylex:entry            996406
  2  72084                     yylook:state                 2      996916
  2  72079                        yylook:ch : ' '

  2  72084                     yylook:state                 8      997841
  2  72079                        yylook:ch : '<'

  2  72083                     yylook:match            999122
  2  72084                     yylook:state                 2      999588
  2  72079                        yylook:ch : '<'

  2  72084                     yylook:state                28     1000575
  2  72079                        yylook:ch : 's'
...

So the last two outputs are very detailed, and can be very useful in debugging
the lexer. However, examining a complete trace from top to bottom is a very
tedious, time-consuming task (the full output for each of the above traces is
195,352 lines long).

As an alternative, one can use DTrace to ask a very specific question about the
lexing process as a whole, without having to manually sift through reams of
output.

For instance, we may wish to see how many times a particular token is matched,
using tok_count.d:

pid$target::yylex:return
{
    @[arg1] = count();
}

The output:

0                  1
269                1
287                1
289                1
294                1
317                1
332                1
329                2
274                3
328                3
331                3
324                4
336                4
260                5
299                5
333                5
300                6
323                6
326                6
288                7
263                8
268                9
325               10
264               12
311               12
312               12
280               13
330               13
291               15
310               16
267               18
327               19
261               24
295               25
296               25
320               32
266               35
297               41
281               44
314               60
339               71
301               93
302              102
272              104
275              109
304              151
305              151
306              307
307              307
303              371
318              531
276             1201
341             1433
277             2186
308             3272

We may also wish to see what average lex time per token-type is, using
tok_time_avg.d:

pid$target::yylex:entry
{
    self->ts = vtimestamp;
}

pid$target::yylex:return
{
    @[arg1] = avg(vtimestamp - self->ts);
}

Output of pfexec dtrace -c './clexer lex.yy.c' -s tok_time_avg.d:

308              509
323              512
295              518
296              524
324              534
303              552
320              554
318              559
310              566
312              569
306              570
336              595
311              606
333              631
330              635
327              637
339              638
314              639
276              644
332              660
325              665
328              673
317              678
329              688
277              694
331              699
326              700
305              718
304              767
280              802
302              831
264              836
268              859
266              861
341              879
294              884
287              894
288              906
281              915
297              941
291              946
299              961
300              966
301              973
272             1050
267             1124
263             1257
275             1433
274             1550
307             1641
261             1879
289             5318
0              18105
260            26437
269            66109

So, on average token 269 takes the longest to lex. Looking a y.tab.h, 269 is
TOK_PRAGMA. 260, the runner up, is TOK_INLC, and so forth.

To see what is so slow about parsing a pragma, we’ll use trace_pragma.d a
modified version of trace_stamp.d. trace_pragma.d uses DTrace’s speculative
tracing facility to display the trace data pragma matching and nothing else.

The output of that trace is 6312 lines long. And it only encompasses one
trace.

Here’s the initial screen of text:

CPU     ID                    FUNCTION:NAME
  2  72076                      yylex:entry          61977649
  2  72086                     yylook:state                 2     61978154
  2  72081                        yylook:ch : '\n'

  2  72086                     yylook:state                 9     61979102
  2  72081                        yylook:ch : '/'

  2  72085                     yylook:match          61980055
  2  72086                     yylook:state                 3     61980550
  2  72081                        yylook:ch : '/'

  2  72084               yylook:fallback_st                 2     61981495
  2  72086                     yylook:state                23     61981973
  2  72081                        yylook:ch : '*'

  2  72086                     yylook:state                80     61982925
  2  72085                     yylook:match          61983512
  2  72080                      yylex:begin                 4     61985065
  2  72086                     yylook:state                 4     61985521
  2  72081                        yylook:ch : '\n'

  2  72086                     yylook:state                53     61986638
  2  72085                     yylook:match          61987130
  2  72086                     yylook:state                 5     61987618
  2  72081                        yylook:ch : ' '

  2  72083               yylook:fallback_ch : '\t'

  2  72084               yylook:fallback_st                 4     61990368
  2  72083               yylook:fallback_ch : '\t'

  2  72086                     yylook:state                53     61991362
  2  72085                     yylook:match          61991838
  2  72086                     yylook:state                 4     61992306
  2  72081                        yylook:ch : '*'

  2  72086                     yylook:state                54     61993324
  2  72081                        yylook:ch : ' '

  2  72085                     yylook:match          61994266
  2  72086                     yylook:state                 4     61994728
  2  72081                        yylook:ch : ' '
...

As you can see we started off by parsing a block comment. We don’t finish
parsing the block comment until line 5879 of the output.

The reason yylex takes so long to return on a pragma, is because it never
returns when matching a comment, it just keeps lexing and matching until it
matches something that it has to return, in this case a pragma token.

If that comment wasn’t there, right above the pragma it may have returned
much faster.

Looking at the bottom of the output, we see that matching a pargma token takes
more than 8 microseconds. However, because of the multitiude of probe enablings
in this script, we have to take overhead into account.

Bottom of output:

...
  2  72085                     yylook:match          64343356
  2  72086                     yylook:state                 3     64343869
  2  72081                        yylook:ch : '#'

  2  72084               yylook:fallback_st                 2     64344798
  2  72086                     yylook:state                12     64345273
  2  72081                        yylook:ch : 'p'

  2  72086                     yylook:state                67     64346255
  2  72081                        yylook:ch : 'r'

  2  72086                     yylook:state               124     64347347
  2  72081                        yylook:ch : 'a'

  2  72086                     yylook:state               158     64348283
  2  72081                        yylook:ch : 'g'

  2  72086                     yylook:state               188     64349246
  2  72081                        yylook:ch : 'm'

  2  72086                     yylook:state               209     64350288
  2  72081                        yylook:ch : 'a'

  2  72086                     yylook:state               224     64351204
  2  72085                     yylook:match          64351727
  2  72138                     yylex:return               269     64352499

A better way to see how long lex takes to match a pragma is trace the time
between matches. We will use the final state that occurs before a pragma is
matched (224), to determine that this is the match that we care about.

The script used, match_fst_time_avg.d:

dtrace:::BEGIN
{
    self->prevts = vtimestamp;
    self->curts = 0;
}

lex$target:::state
{
    self->st = arg0;
}

lex$target:::match
{
    self->curts = vtimestamp;
    @[self->st] = avg(self->curts - self->prevts);
    self->prevts = self->curts;
}

The output:

 14             1397
 53             1401
 28             1413
 51             1433
 54             1434
 49             1464
 25             1486
 29             1509
 30             1558
 18             1596
 24             1606
 16             1660
 38             1797
 10             1819
 17             1828
 31             1838
 35             1838
 27             1842
 71             1842
 58             1845
 20             1848
 26             1852
  9             1854
110             1854
 33             1856
 80             1862
 89             1871
 21             1881
 19             1886
 91             1887
 76             1895
111             1897
100             1939
  8             2015
 90             2026
 77             2264
 75             2286
 79             2303
 84             2309
121             2382
 85             2395
137             2505
163             2840
167             2873
161             3090
 72             3318
181             3319
187             3780
 32             3857
213             3860
216             3864
217             3868
218             3883
214             3895
204             3931
206             4094
222             4258
221             4717
224             4898
230             4973
 37             5461
129             6119
116             7140
 60            11786
228           176927

As you can see matching pragmas is actually fast (5 microseconds as opposed to
66).

I rather prefer this script to the tok_time_avg.d because this script’s
measurements reflect the average time between matches, regardless of whether or
not a particular match causes yylex to return.

And the following script associates final states with token types, discarding
the states that have no token type.

tok_fstate.d:

lex$target:::state
{
    self->st = arg0;
}

pid$target::yylex:return
{
    @[arg1, self->st] = count();
}

So, here is the table of tokens and their associated final state:

  0                3                1
269              224                1
287              230                1
289              217                1
294              216                1
317               76                1
332               28                1
329               91                2
274              129                3
328               58                3
331               89                3
324               26                4
336               10                4
260              228                5
299              163                5
333               14                5
300              167                6
323               31                6
326              110                6
288              213                7
263              121                8
268              222                9
325               71               10
264              181               12
311               75               12
312               77               12
280              161               13
330               30               13
291              218               15
341               38               15
310               79               16
267              206               18
327               90               19
261              221               24
295               33               25
296               35               25
320               21               32
266              204               35
297              100               41
281              137               44
314               18               60
339               29               71
301              214               93

And, so as you can see the most expensive match is the token that has final
state 228, which is TOK_INCL (a “#include”).

The cheapest, a bitwise ‘and’ (“&”).

What surpises the hell out of me is that (state 38) TOK_NAME (variable names,
function names, anything that’s not a keyword or operator) gets matched much
faster, on average than TOK_INCL does. TOK_NAME gets parsed in almost 2
microseconds, TOK_INCL gets parsed in 177 microseconds.

Perhaps this is because TOK_INCL‘s get matched at the beginning of the file,
and at that point our clexer binary is just starting execution, and has yet
to have some of its instructions and memory cached. But that’s just a
hypothesis.

Of course, averages get boring after a while, at which point you’d probably
like to move on to using scripts that employ DTrace’s quantize function, like
tok_fst_time_qtz.d:

dtrace:::BEGIN
{
    self->prevts = vtimestamp;
    self->curts = 0;
}

lex$target:::state
{
    self->st = arg0;
}

lex$target:::match
{
    self->curts = vtimestamp;
    @[self->st] = quantize(self->curts - self->prevts);
    self->prevts = self->curts;
}

Tail of output:

...
           84
       value  ------------- Distribution ------------- count    
         512 |                                         0        
        1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              1461     
        2048 |@@@@@@@@@@@@@                            723      
        4096 |                                         0        
        8192 |                                         1        
       16384 |                                         0        
       32768 |                                         0        
       65536 |                                         0        
      131072 |                                         1        
      262144 |                                         0        

           20
       value  ------------- Distribution ------------- count    
         512 |                                         0        
        1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3270     
        2048 |                                         1        
        4096 |                                         1        
        8192 |                                         0        

            8
       value  ------------- Distribution ------------- count    
         512 |                                         0        
        1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     2888     
        2048 |@@@@                                     299      
        4096 |                                         11       
        8192 |                                         2        
       16384 |                                         0        
       32768 |                                         0        
       65536 |                                         0        
      131072 |                                         1        
      262144 |                                         0        

           32
       value  ------------- Distribution ------------- count    
         512 |                                         0        
        1024 |@                                        35       
        2048 |@@@@@@@@@@@@@@@@@@@                      624      
        4096 |@@@@@@@@@@@@@@@@@@@                      619      
        8192 |                                         2        
       16384 |                                         0        
       32768 |                                         0        
       65536 |                                         0        
      131072 |                                         1        
      262144 |                                         0

As far as I know, that’s a bit of dynamic introspection that the lexing process
never had! Cool. Now compiler writers, on Illumos, can shine a much brighter
flashlight at their lexers, illuminating some of the depths of this process.
And this is just the beginning, I hope to add more instrumentation in the
future. For example, instead of having to use final states and returned token
id’s, I’d like to be able to aggregate on the regular expressions defined in
the lex source file.

Though, it just occurred to me that I could use DTrace’s built-in hash table
construct to associate final states with token-types, but oh well. I’ll leave
that as an exercise to the reader.

As for how lex works, it’s quite simple but I’m not into hand-holding, as
mentioned earlier so, if lexing turns you on, check out the code. It’s
relatively straight-foreward, and could even use a few enhancements. Such as
generating code that is prettier to read, removing some of the deficiencies in
the pattern matcher (there are always deficiencies 🙂 ), and so forth.

Also, there’s a repository with the DTrace scripts I used, available here.

So probing lex was a nice little warm-up. An appetizer. Foreplay, even. Now
for the main course: yacc.

Stay tuned, for part 1.5.

INSTALL NOTES:
* use bldenv to set up some environment variables
* run make
* copy the generated lex binary into wherever you keep your custom bins
* copy lex_probes.d and dtrace_ncform from common to /usr/share/lib/ccs/

Advertisements

One thought on “Adventures of a DTrace Addict: Part 1.0

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s