Visualizing Slab List Performance Using DTrace and Gnuplot

NOTE: To view the visualizations in this post at a higher resolution, use the ‘view image’ option in your browser. Sadly the images have been scaled down from their original resolution.

As has been pointed out by the DTrace engineers, the single greatest short
coming of the traditional DTrace command line interface is its limited ability
to engage the visual cortex. Many patterns simple aren’t visible for large
amounts of data. The numbers are there, of course, but it is difficult for the
human brain to process tabular data, even if it is neatly quantized.

After failing to get new insights into slab list performance by using many
elaborate D scripts, I took a long step back and asked myself: “What would the
DTrace engineers do”?

At that moment, I imagined people in white lab coats walking around a Joyent
data center, in between rows of towering computers, much in the same way that
people would walk around the engineering section of the Starship Enterprise.
I’d see a Joyent engineer swiping and dragging something on a tiny palm-size
touch screen interface that was integrated into the case of one of these
servers. These servers, by the way, would be extremely quiet, almost silent, as
soothing classical music plays from every direction.

A final swipe on the touch screen, and an elaborate, rainbow-colored, heat map
would appear on a very large central display. Some of the engineers would stop
what they are doing, take note of the visualization, and then proceed to write
some code that would fix some performance pathology that would not have been
visible without Cloud Analytics.

And then I snapped out of it. I didn’t have Analytics or Fishworks available to
me. I just had DTrace. But DTrace really is the magic sauce. All I needed was
code that could output a large quantity of tabular data into a png that I could
load in a browser.

A fast google search revealed that Gnuplot was a decent tool for the job, so I
went with that.

The first step is to generate the data. I wanted to examine insertion
performance, so I used this script to record how long each insertion took. The
first column is the number of elements in the slab list at time of insertion,
and the second column is the time elapsed in nanoseconds.

/* add_trace.d */
#pragma D option quiet
slablist$target:::add_begin
{
    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;
}

slablist$target:::add_end
{
    self->ets = vtimestamp - self->ts;
    printf("%d %d\n", self->elems, self->ets);
}

We use vtimestamp instead of timestamp so that DTrace doesn’t factor in the
time it takes to execute actions like self->elems = args[0]->sli_elems.

So we run the following on the command line:

pfexec dtrace -c './drv_rand_ops 1416100 intsrt' -s add_trace.d > out

It inserts more than 1.4 million random values into a slab list.

We are collecting 1.4 million data points. You might ask, why don’t we just
average everything? The simple reason is that an average insertion time is
really just an attempt to summarize a very complex and transient phenomenon as
a single number. This single number doesn’t give us much detail. You might ask,
well, why don’t we just quantize the data? The reason is similar. Quantizing
the data with DTrace, gives a more detailed summary of the complex and
transient phenomenon — however, it is still a summary, and large amount of
fine detail is lost. The most detailed data would be a full trace of every
single insertion. Unfortunately, it is difficult for humans to process 1.4
million data points in a table. But if we plot that data, and observe the
graph, we would be able to see large amount of fine detail at a glance.

Once we have generated the data, we can proceed to graph it with gnuplot, using
the following gnuplot script:

# gp0.p
set terminal png size 2160,1920
set output "pngfile"
plot "out" using 1:2 title 'insertion time, ns'

This generates a 2160×1920 png image that is viewable in firefox.

This is an interesting image. First we see that there is a solid red line at
the bottom indicating some really fast insertions. We also see that there is a
line of outliers that is there for pretty much any slab list size. And most
alarmingly, we see that after 800000 elements have been inserted there is very
erratic outlier activity. These outliers are not reflected in the averages, but
they are very apparent on the graph.

The questions now are: what is causing these erratic outliers, and what is
causing the steady stream of outliers right above the bottom-most solid line.

These erratic outliers aren’t trivial. They are huge spikes of more than 100
microseconds. What could be causing them? I suspect it’s either memory
allocation latency or reaping latency. Reaping is slab list terminology for
collecting slabs that are not strictly needed. For example if have the
following slabs:

    [1|2|3| | | ] <--> [4|5| | | | ] <--> [6|7|8|9| | ]

Of the above three, we only need two. We can “reap” the extra slab by shifting
all of the elements to the left.

    [1|2|3|4|5|6] <--> [7|8|9| | | ] <--> [ | | | | | ]

We can then reap the last slab.

We reap slab lists, depending on a minimum capacity ratio that the user
provides. For example, if the user provides a ratio of 70/100, we start reaping
as soon as we go below this ratio. For illustration, the above list had a
capacity ratio of 50/100.

We modify the previous DTrace script to look like this:

#pragma D option quiet

dtrace:::BEGIN
{
    self->rets = 0;
}

slablist$target:::add_begin
{
    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;
}

pid$target::try_reap_all:entry
{
    self->rts = vtimestamp;
}

pid$target::try_reap_all:return
{
    self->rets = vtimestamp - self->rts;
}

slablist$target:::add_end
{
    self->ets = vtimestamp - self->ts;
    printf("%d %d %d\n", self->elems, self->ets, self->rets);
    self->rets = 0;
}

This adds reap durations as a third column.

pfexec dtrace -b 128m -c './drv_rand_ops 1416100 intsrt' -s add_trace.d > out2

We modify the gnuplot script to look like this:

# gp1.p
set terminal png size 2160,1920
set output "pngfile2"
plot "out2" using 1:2 title 'insertion time, ns', "out2" using 1:3 title 'reap time, ns'

We run gnuplot:

gnuplot gp1.p

And we get the following graph:

As you can see, the erratic outliers are correlated with reap activity.

What about the non-erratic outliers? I’ve already suggested that it might be
memory allocation activity. Lets test this theory.

The modified D script adds a fourth column:

#pragma D option quiet

dtrace:::BEGIN
{
    self->rets = 0;
    self->mets = 0;
}

slablist$target:::add_begin
{
    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;
}

pid$target::mk_slab:entry
{
    self->mts = vtimestamp;
}

pid$target::mk_slab:return
{
    self->mets = vtimestamp - self->mts;
}

pid$target::try_reap_all:entry
{
    self->rts = vtimestamp;
}

pid$target::try_reap_all:return
{
    self->rets = vtimestamp - self->rts;
}

slablist$target:::add_end
{
    self->ets = vtimestamp - self->ts;
    printf("%d %d %d %d\n", self->elems, self->ets, self->rets, self->mets);
    self->rets = 0;
    self->mets = 0;
}

The modified gnuplot script:

# gp2.p
set terminal png size 2160,1920
set output "pngfile3"
plot "out3" using 1:2 title 'insertion time, ns', "out3" using 1:3 title 'reap time, ns', \
"out3" using 1:4 title 'alloc time, ns'

And the resulting graph:

As you can see, the outliers are positively correlated with memory allocation
activity.

Further tracing revealed that we did a lot of reaping, to save only a handful
of slabs. In other words we were sacrificing a lot of cpu resources to save
meager memory resources. The key problem, is that the user was allowed to
specify the minimum capacity of a slab list. The user would have been better
served if he could specify two factors:

minimum number of slabs to collect per reap
minimum percentage of slabs to collect per reap

For example, the user can specify a minimum number of 40, and a minimum
percentage of 30. Each reap recycles at least 40 kilobytes, and at least 30% of
the slabs in the list. The code implementing this change was introduced a few
days ago.

Here is a graph of what the insertion pattern looks like now (allocation and
reap times are also shown).

There are no more erratic outliers, resulting from reaps, though there are
still outliers resulting from memory allocations (a problem for another day).

Advertisements