One Year of SmartOS

NOTICE: This post has moved to a new blog that’s exclusively about SmartOS Administration.


Yet Another Draft

This is an update to my visual data structure comparison using DTrace.

I’ve replaced the original crusty Gnuplot based plotting system with an R based one. I’ve condensed the information to a handful of graphs to make it more digestible. I’ve also made the document itself look nicer, gave it more structure than before, added a table of contents, and refined some of the ideas in it.

There still a long way to go, like measuring fold performance, adding sequential and filtering structures and finding a way to measure stack growth.

This is a work in progress, and is not yet conclusive. Click on the link below to read the pdf.


Visualizing Slab List Performance Using DTrace and Gnuplot

NOTE: To view the visualizations in this post at a higher resolution, use the ‘view image’ option in your browser. Sadly the images have been scaled down from their original resolution.

As has been pointed out by the DTrace engineers, the single greatest short
coming of the traditional DTrace command line interface is its limited ability
to engage the visual cortex. Many patterns simple aren’t visible for large
amounts of data. The numbers are there, of course, but it is difficult for the
human brain to process tabular data, even if it is neatly quantized.

After failing to get new insights into slab list performance by using many
elaborate D scripts, I took a long step back and asked myself: “What would the
DTrace engineers do”?

At that moment, I imagined people in white lab coats walking around a Joyent
data center, in between rows of towering computers, much in the same way that
people would walk around the engineering section of the Starship Enterprise.
I’d see a Joyent engineer swiping and dragging something on a tiny palm-size
touch screen interface that was integrated into the case of one of these
servers. These servers, by the way, would be extremely quiet, almost silent, as
soothing classical music plays from every direction.

A final swipe on the touch screen, and an elaborate, rainbow-colored, heat map
would appear on a very large central display. Some of the engineers would stop
what they are doing, take note of the visualization, and then proceed to write
some code that would fix some performance pathology that would not have been
visible without Cloud Analytics.

And then I snapped out of it. I didn’t have Analytics or Fishworks available to
me. I just had DTrace. But DTrace really is the magic sauce. All I needed was
code that could output a large quantity of tabular data into a png that I could
load in a browser.

A fast google search revealed that Gnuplot was a decent tool for the job, so I
went with that.

The first step is to generate the data. I wanted to examine insertion
performance, so I used this script to record how long each insertion took. The
first column is the number of elements in the slab list at time of insertion,
and the second column is the time elapsed in nanoseconds.

/* add_trace.d */
#pragma D option quiet
    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;

    self->ets = vtimestamp - self->ts;
    printf("%d %d\n", self->elems, self->ets);

We use vtimestamp instead of timestamp so that DTrace doesn’t factor in the
time it takes to execute actions like self->elems = args[0]->sli_elems.

So we run the following on the command line:

pfexec dtrace -c './drv_rand_ops 1416100 intsrt' -s add_trace.d > out

It inserts more than 1.4 million random values into a slab list.

We are collecting 1.4 million data points. You might ask, why don’t we just
average everything? The simple reason is that an average insertion time is
really just an attempt to summarize a very complex and transient phenomenon as
a single number. This single number doesn’t give us much detail. You might ask,
well, why don’t we just quantize the data? The reason is similar. Quantizing
the data with DTrace, gives a more detailed summary of the complex and
transient phenomenon — however, it is still a summary, and large amount of
fine detail is lost. The most detailed data would be a full trace of every
single insertion. Unfortunately, it is difficult for humans to process 1.4
million data points in a table. But if we plot that data, and observe the
graph, we would be able to see large amount of fine detail at a glance.

Once we have generated the data, we can proceed to graph it with gnuplot, using
the following gnuplot script:

# gp0.p
set terminal png size 2160,1920
set output "pngfile"
plot "out" using 1:2 title 'insertion time, ns'

This generates a 2160×1920 png image that is viewable in firefox.

This is an interesting image. First we see that there is a solid red line at
the bottom indicating some really fast insertions. We also see that there is a
line of outliers that is there for pretty much any slab list size. And most
alarmingly, we see that after 800000 elements have been inserted there is very
erratic outlier activity. These outliers are not reflected in the averages, but
they are very apparent on the graph.

The questions now are: what is causing these erratic outliers, and what is
causing the steady stream of outliers right above the bottom-most solid line.

These erratic outliers aren’t trivial. They are huge spikes of more than 100
microseconds. What could be causing them? I suspect it’s either memory
allocation latency or reaping latency. Reaping is slab list terminology for
collecting slabs that are not strictly needed. For example if have the
following slabs:

    [1|2|3| | | ] <--> [4|5| | | | ] <--> [6|7|8|9| | ]

Of the above three, we only need two. We can “reap” the extra slab by shifting
all of the elements to the left.

    [1|2|3|4|5|6] <--> [7|8|9| | | ] <--> [ | | | | | ]

We can then reap the last slab.

We reap slab lists, depending on a minimum capacity ratio that the user
provides. For example, if the user provides a ratio of 70/100, we start reaping
as soon as we go below this ratio. For illustration, the above list had a
capacity ratio of 50/100.

We modify the previous DTrace script to look like this:

#pragma D option quiet

    self->rets = 0;

    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;

    self->rts = vtimestamp;

    self->rets = vtimestamp - self->rts;

    self->ets = vtimestamp - self->ts;
    printf("%d %d %d\n", self->elems, self->ets, self->rets);
    self->rets = 0;

This adds reap durations as a third column.

pfexec dtrace -b 128m -c './drv_rand_ops 1416100 intsrt' -s add_trace.d > out2

We modify the gnuplot script to look like this:

# gp1.p
set terminal png size 2160,1920
set output "pngfile2"
plot "out2" using 1:2 title 'insertion time, ns', "out2" using 1:3 title 'reap time, ns'

We run gnuplot:

gnuplot gp1.p

And we get the following graph:

As you can see, the erratic outliers are correlated with reap activity.

What about the non-erratic outliers? I’ve already suggested that it might be
memory allocation activity. Lets test this theory.

The modified D script adds a fourth column:

#pragma D option quiet

    self->rets = 0;
    self->mets = 0;

    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;

    self->mts = vtimestamp;

    self->mets = vtimestamp - self->mts;

    self->rts = vtimestamp;

    self->rets = vtimestamp - self->rts;

    self->ets = vtimestamp - self->ts;
    printf("%d %d %d %d\n", self->elems, self->ets, self->rets, self->mets);
    self->rets = 0;
    self->mets = 0;

The modified gnuplot script:

# gp2.p
set terminal png size 2160,1920
set output "pngfile3"
plot "out3" using 1:2 title 'insertion time, ns', "out3" using 1:3 title 'reap time, ns', \
"out3" using 1:4 title 'alloc time, ns'

And the resulting graph:

As you can see, the outliers are positively correlated with memory allocation

Further tracing revealed that we did a lot of reaping, to save only a handful
of slabs. In other words we were sacrificing a lot of cpu resources to save
meager memory resources. The key problem, is that the user was allowed to
specify the minimum capacity of a slab list. The user would have been better
served if he could specify two factors:

minimum number of slabs to collect per reap
minimum percentage of slabs to collect per reap

For example, the user can specify a minimum number of 40, and a minimum
percentage of 30. Each reap recycles at least 40 kilobytes, and at least 30% of
the slabs in the list. The code implementing this change was introduced a few
days ago.

Here is a graph of what the insertion pattern looks like now (allocation and
reap times are also shown).

There are no more erratic outliers, resulting from reaps, though there are
still outliers resulting from memory allocations (a problem for another day).

Introducing Slab Lists

UPDATE: Check out the new paper on Slab Lists. It’s still in the works, but is more detailed and more up to date than this post. It also has detailed performance comparisons with AVL Trees, Skip Lists, and B-Trees.

Data Structures Overview

Since the dawn of the computer age, computers have served one unchanging
function: to store raw data, carry out logical and arithmetic operations on
this data, and to present this data as information to user. This is the central
problem that has created billion-dollar industries and companies that try to
solve it with advanced file-systems, memory allocators, databases, compression
algorithms, and so forth. All of these domains share one thing in common and
that is that they use a variety of containers known as data structures to
assist them in solving their problems. Here I present a new, memory and
cpu-efficient data structure: the slab list.

The slab list is a parallel list data structure that uses the notion of
slabs introduced in the slab allocator, to efficiently store and do
operations on data. The slab list’s closest relative (that I was able to find),
is likely the unrolled linked list, which tries to settle for a middle ground
between linked lists and arrays.

The slab list is designed with systems programming in mind, and as such its
space and time performance is dependant on the quality of the memory allocator
and thread library used.

This is an introductory post to the data structure. It describes its
implementation in broad strokes. Future posts will drill into some of the
gritty, grimey, and critical implementation details, that make slab lists
practical to use. When you come back, be ready to read some code.

At the moment libslablist is still being tested and developed. It is not yet
production quality. But feel free to use it, hack it, and check it out.


In 1993 Jeff Bonwick of Sun Microsystems introduced his innovative slab
in a USENIX paper. The slab allocator made memory allocation more
efficient while decreasing fragmentation. The idea was soon adopted by almost
every modern operating system.

The slab allocator had the slab as the central data structure. It was a 4K
region of memory (at least), that would hold objects of equal size. The slabs
would be stored in a doubly linked list. The 1993 paper (and the 2001 sequel to
that paper) refered to this doubly linked list of slabs as a slab list.

I read these papers in 2006 and the slab list was vaguely interesting, and
certainly practical for use in the allocator, but it did not warrant deeper
inspection. It was, after all, a mere linked list.

Fast forward to 2012. I was working on a project that had to store some of its
data in a doubly linked list, and some of its data in an AVL tree. The program
then had to store 200 million nodes. This would have been trivial on a high-end
server. However, the target platform was a consumer laptop with 4G of memory.
The system was unable to stay responsive under heavy load due to swapping,
and was unable to hold all of the data in RAM anyway.

At first, I thought, some other process must have been consuming a lot of RAM,
or that my code had a memory leak. The reason I thought so is that the program
was storing 64-bit integers and doubles. That’s 8 bytes of data. Two hundred
million nodes, is 1.6 billion bytes. Which is around 1.5 gigabytes. This should
have fit in 4GB RAM. The reason it didn’t is that a doubly linked list has 16
bytes of overhead, due to the pointers. Making a single node worth 24 bytes,
instead of 8 bytes. And an AVL tree has 24 bytes of overhead, making a single
node worth 32 bytes instead of 8.

After I identified the source of my anguish, I sent a probe down into my brain,
hoping it would find a solution. I was surprised by what was sent back.

The slab list. It was not an obvious choice. It didn’t have a general purpose
implementation (the slab allocator’s code is highly specialized and low level,
even for C code). It wasn’t tested in any context outside of slab allocation.
There were many unknowns. What was the ideal slab size? Was it possible to
access data in logarithmic time? How would it interact with the memory
allocator? How would the performance compare with an AVL tree? Would shifting
elements in a slab on every addition or removal have negligible impact on
performance? Or would it make slab lists unusable? How long would it take to

The most obvious data structure to select from my mental catalog, would have
been the B Tree. It has been in use for decades, and has multiple open source
implementations scattered about the net. The B Tree was the proven solution.

But the novelty of a non-tree data structure that was memory efficient, had
logarithmic time complexity, and could potentially (and easily) be parallelized
was far too exciting and compelling. Once I started thinking about it and
implementing it I couldn’t stop — didn’t want to stop. I put the initial project
on the back-burner, as I began to focus on a slab list implementation.

It took two and a half months to complete.

Basic Constraints

The basic constraints for slab lists were:

  • Had to run on Illumos (at least).

  • Potentially take advantage of multicore processors.

  • Store data much more efficiently than any linked list.

  • Insert and remove data quickly.

  • Keep data sorted (if necessary).

  • Preserve insertion order (if necessary).

  • Run the test suite in situ.

  • Potentially map functions to a list in parallel.

  • Potentially fold functions over a list in parallel.

Any of the potential constraints above, didn’t have to be implemented on first
release, or ever. They just had to be possible should the need for those
features arise.

Overview and Comparison

The slab list is designed for configurability and maximum memory efficiency,
while maintaining the semantics of inserting, removing, and retrieving
elements to/from a list or collection. We aim to leverage the advantages of
both doubly linked lists and arrays.

The main structure of the slab list, is the slab. It is a structure that can
hold 952 bytes of data plus some metadata. The metadata is 66 bytes in size
(80 bytes with compiler-inserted padding). This makes the slab exactly 1K in
size, meaning that we can fit 4 slabs on one page. The 952 bytes of data is
carved into 8-byte elements, meaning that each slab can hold 119 elements.

Elements can be any 8-byte sequence (64-bit integers, 64-bit double
precision floating point numbers, 64-bit pointers, and so on).

The data to total-data ratio per slab is 952/1024 which is 93% efficiency.
Increasing the number of elements per-slab increases the efficiency, but
decreases performance. Empirically, the 1K slab offers the best performance
improvement per percentage of memory efficiency sacrificed.

To get a more visual perspective of what the ratio between data and metadata
in a slab is look at the diagrams below:

In the slab list ASCII diagram below, the stars and arrows are pointers.
“MD” is meta data, that are not next and prev pointers.

[*|MD|////952b////|*] -><- [*|MD|////952b////|*]-><-[*|MD|////952b////|*]

If we zoom into one of the slabs:

 *              Legend:
 *                      nextslab: ptr to the next slab (8 bytes)
 *                      prevslab: ptr to the prev slab (8 bytes)
 *                      slistptr: ptr to slab list slab belongs to (8 bytes)
 *                      NE: number of elements in slab (2 bytes)
 *                              ^ the compiler may pad this to be 8 bytes
 *                      maxvalue: largest element in the slab (8 bytes)
 *                      minvalue: smallest element in the slab (8 bytes)
 *                      rest: user-given data (952 bytes)
 *   base of slab structure --->  +--------------------------+----------+
 *                                | mutex structure 24 bytes | nextslab |
 *                                +----------+----------+----+----------+
 *                                | prevslab | slistptr | NE | maxvalue |
 *                                +----------+----------+----+----------+
 *                                | minvalue | 952 bytes of user data   |
 *                                +----------+--------------------------+
 *                                |                ||                   |
 *                                |                ||                   |
 *                                |                \/                   |
 *                                :                                     :
 *                                .                                     .
 *                                .                                     .
 *                                .                                     .
 *                                :                                     :
 *                                |                                     |
 *     end of slab structure--->  +-------------------------------------+

Every slab has a mutex structure. This structure is needed to support locking
concurrency. Currently, slab lists are single-threaded, but support for
multiple threads is in the pipeline.

As can be seen by the diagram above, user data is an overwhelming portion of
the entire slab.

Compare slab lists to a doubly linked list. Each node has 2 pointers and an
8-byte element which is 24-bytes per node.

The data to total-data ratio per node is 8/24 which is 33% efficiency.

The following two diagrams illustrate why this is so:

[*|8b|*] -><- [*|8b|*]-><-[*|8b|*]

Zooming in on one of the doubly linked list nodes:

 *      base of node ------> +----------+
 *                           | prevnode | (8 bytes)
 *                           +----------+
 *                           | nextnode | (8 bytes)
 *                           +----------+
 *                           | userdata | (8 bytes)
 *      end of node ------>  +----------+

As you can see userdata is 1/3 of the whole node.

Compare slab lists to the Illumos kernel’s AVL Tree implementation, where
the metadata per node is 24 bytes in size, and we have 8/32 which is 25%

The following diagrams illustrate why this is so.

In the AVL tree diagram below, MD is meta data which takes up 16 bytes.

 *                        [*|MD..|8b|*]
 *                        /           \
 *                       /             \
 *           [*|MD..|8b|*]             [*|MD..|8b|*]

Zooming in on one of the AVL nodes:

 *                               +-----------+
 *                               | lf child  | (8 bytes)
 *                               +-----------+
 *                               | rt child  | (8 bytes)
 *                               +-----------+
 *                               |  avl_pcb  | (8 bytes)
 *                               +-----------+
 *                               |  userdata | (8 bytes)
 *                               +-----------+

The avl_pcb is 64 bits containing:

pointer to the parent (60 bits),

a bit indicating if the node is left or right of the parent,

and a balance value (2 bits).

Zooming in on avl_pcb:

 *     |63                               3|        2        |1          0 |
 *     |----------------------------------|-----------------|-------------|
 *     |   avl_parent hi order bits       | avl_child_index | avl_balance |
 *     |----------------------------------|-----------------|-------------|

As can be seen in the second AVL diagram, userdata is 1/4 of the node, which
is indeed 25%.

To summarize the memory-efficiency of each data strucure’s node, from best
to worst:

 *              Slab list:              93%
 *              Doubly linked list:     33%
 *              AVL Tree:               25%

This is all good and well, when a slab is full. However partially full
slabs could end up having worse memory efficiency per slab than both doubly
linked lists and AVL trees.

In order to guarantee a minimal memory efficiency of 50%, we transparently
use a singly-linked list until the number of elements in a slab list reaches
60 (1/2 of a slab). After we reach 1/2 of a slab we copy all of the
elements into a new slab, and continue using slabs thereafter.


Each slab is ranged. That is, it knows what its maximum and minimum elements

The usefulness of the range depend on the type of the slab list. A slab
list can be either sorted or ordered. A sorted slab list keeps all of the
elements sorted (according to a user-given comparison function), whereas an
ordered slab list merely inserts each new element to the end of the list.

Ranges on slabs are useful for sorted slab lists, as we can use it to
traverse the list 119 times more quickly than a linked list, when searching
for an element. When we get to a slab that may contain an element, we try
to find the index of the element via binary search.


If the slab list exceeds a certain number of slabs (user-defined), we create
a sublayer for that slab list. A sublayer is an internal slab list that
contains pointers to slabs as elements. All of the slab-pointers are sorted,
and the sublayer’s slabs are also ranged. A sublayer is only created for
sorted slab lists. Slabs in a sublayer are called sub-slabs. Superslabs are
subslabs that are refered to from a lower subslab. Slabs at the highest
superlayer (slabs that contain the user’s data) are simply known as slabs,
or superslabs from the perspective of the highest sublayer. The lowest
sublayer is called the baselayer.

 *    slabs from superlayer -------->  [ ][ ][ ][ ][ ][ ][ ][ ] .....
 *                                      ^  ^  ^  ^  ^  ^  ^  ^
 *                                      |  |  |  |  |  |  |  |
 *                                      |  |  |  |  |  |  |  |
 *                               +.....+--+--+--+--+--+--+--+--+......+
 *    a slab from sublayer ----> |.....|* |* |* |* |* |* |* |* |......|
 *                               +.....+--+--+--+--+--+--+--+--+......+
 *                           metadata   slab elems (slab ptrs)

This sublayering of slabs gives us a structure that looks like an inverted

 *        highest superlayer --->   _________
 *                                  \       /
 *                                   \     /
 *                                    \   /
 *        baselayer -------------->    \_/

If the maximum number of slabs the baselayer can have is N, then it
can have at most N*119 elems. If we have only 1 sublayer attached to the
slab list, then we have N*119 elems. If we have two we have N*(119^2)
user-inserted elems. If we have 3, N*(119^3) elems. And so on.

Slablists are limited to a maximum of 8 sublayers. With a completely full
slab list, of 8-byte integers, we would consume 0.365 zettabytes of RAM.
That exceeds the largest hard drives on the marked today (2012), let alone
any amount RAM one could cram into computer.

If we wanted to find a slab with element E, and our slab list has
sublayers, we 1) go to the baselayer, 2) we do a linear search on that
sublayer, until we find a slab with the range into which E could belong

The slab that has been found at the baselayer, contains pointers to slabs in
the superlayer. We use binary search find the slab in the superlayer into
which E could belong to. We keep doing this repeatedly until we get to the
highest superlayer. The highest superlayer is the layer that contains data
that the user gave it.

The use of sublayers of slab lists, and binary search gives a search time
comparable to balanced binary trees (that is, logarithmic time).

Information specific to individual slab lists is encoded in the slabist_t

The user can specify the minimum capacity ratio (number of elems in entire
slab list, to number of possible elems [119*nslabs]), name, slabs for
sublayer, object size, and comparison function.

If a slab is added to or removed from the superlayer, we have to ripple the
changes down to the sublayer by removing or adding a reference to the slab.
If that sublayer, gains or loses a slab as a result of the previous ripple,
we have to ripple that change down to the next sublayer, and so on.

DTrace Probes / Test Suite

This slab list implementation has a profusion of DTrace probes. There are
two kinds of probes: tracing probes and testing probes.

Tracing probes are ordinary DTrace probes that notify the user of events
within libslablist.

Testing probes are DTrace probes that, when enabled, trigger tests that
verify the correctness of the slab list data structure after each
modification, and report failure or success via dtrace. All testing probes
have a test_* prefix.

Tests are driven by driver programs that barrage slab lists with random data
(integers and variable length strings).

The probes and tests will be expanded upon in future posts.

Memory Allocation

Slab lists use the libumem slab allocator available on Illumos, but can be
compiled against any libc malloc implementation. This is acheived by
abstracting away the allocation code.

Instead of doing:

slab_t s = malloc(sizeof(slab_t));


slab_t s = umem_cache_alloc(cache_slab, UMEM_NOFAIL);

we do:

slab_t s = mk_slab();

Where mk_slab calls the appropriate allocation function, depending on the
compile time options.

Code Organization

See src/slablist_provider.d for all of the slablist probes (testing and

See src/slablist_add.c and src/slablist_rem.c for the addition and removal

See src/slablist_find.c for search algorithms used.

See src/slablist_umem.c and src/slablist_malloc.c for the specifics behind
memory allocation.

See src/slablist_test.c for the test functions. They are only called when a
corresponding dtrace test-probe is enabled. All telemetry is consumed via
dtrace probes.

See src/slablist_provider.d for comments regarding individual probes. See
slablist_test.c for test function implementation.

See src/slablist.d for translators used by the probes.

See build/build_lib for build script implementation. If you have primary
admin access on Illumos you can run ./build_install_all. If not, you’ll
figure something out.

See drv/ for various driver programs. drv/drv_rand_ops executes an arbitrary
number of random insertions and removals, on 4 kinds of slab lists (sorted
int and str list, and ordered int and str list).

See tests/ for various test D scripts.

See tools/ for various tool D scripts.

Illuminating Lucene with Illumos

Intuition is the clear conception of the whole at once.
– Johann Kaspar Lavater

Every once in a while one comes across a program of such monstrous complexity,
that understanding by mere code-reading is a privilege reserved only for the
most persistent and patient of people.

Lucene is one such program, that I have to use at school for various projects.
The sheer size of the code-base is a massive barrier to understanding, which is
only exponentiated by the fact that I program mostly in C, and am merely
competent (if that) in Java.

Continue reading

A Good Post About SmartOS

WordPress’s reblogging facility isn’t advanced enough to figure out that I reblogged something to the wrong blog, and would like to move it. So I have to resort to far more crude, but effective, method of sharing this excellent introductory post:

Discovering SmartOS

Introductory and informative blog posts like this one are needed to take SmartOS (and other Illumos-based tech) beyond the murky, mysterious realm of kernel engineering, and into the intensely pragmatic realm of common system administration.

Lex Provider Update

A while ago, I wrote about the DTrace provider I made for lex, the lexical
analyzer that ships with Illumos.

I added one new probe to it, that solves a simple but particularly vexing
issue: tracing and aggregating on the regular expression used to match a token
instead of the actual token or the returned token-id. This new probe is called

The returned token-id (value returned by yylex()) would only be adequate in
identifying the most commonly matched regular expressions, if yylex()
necessarily returned on every single match. It doesn’t.

Continue reading