One Year of SmartOS

NOTE: you can also download this post in PDF format if you’re into that.

1  Introduction

I’ve been using SmartOS for one year as my main development OS, and I want to
share my experiences because SmartOS is a very atypical system and as such it
requires an atypical style of usage. Well, atypical for me. Unlike all of the
generations of engineers that came before me, I’ve been using Linux,
OpenSolaris, and then OpenIndiana on stand-alone consumer hardware — desktops
and laptops. This translates into me being the sole user of a system
who’s usefulness is more or less independent of network connectivity. Having
this kind of monopoly leads to very bad habits such as messing with global
configs, installing custom software directly into /bin, storing
personal files in /rpool/docs, not using Zones whenever possible, and
so on. All in all, a very messy situation, that doesn’t bother anyone but
should bother me. Naturally, I’ve been slow to change my heathen ways.
In fact I didn’t change these habits until I was forced to by deciding to use
SmartOS as my main development OS (yeah, it’s that compelling).


2  Hardware

First thing’s first: you probably won’t be able to run it on your desktop or
laptop. It’s not that it can’t theoretically run on those systems
— it’s very lean — it just has some problems with firmware and drivers and
the like. As such it requires a very specific brand of motherboard:
supermicro. Supermicro boards are server boards, which means you can
only install SmartOS on a server. As I said, I never had the need for a server,
so I had to part with some cash to get one from eBay. My current hardware is at
the bottom of this post. Another thing to keep in mind is that the usability of
your SmartOS server is directly correlated to the network connectivity. That’s
kind of a downer, as it won’t always be available to you, but it will be
available to you from potentially all devices that are connected to the
internet — yes, I’ve used SmartOS from my phone.


3  SmartOS Basics

The other thing is that SmartOS is designed to be a hypervisor — a host for
virtual machines. Most OSes can be such hosts in addition to being “regular”
operating systems — like running VMWare or Parallels on top of Mac OS X.
SmartOS can only be used to host VM’s and nothing more. It’s like being able to
run VMWare on the Mac, but not being able to use the Mac for anything else —
everything you do happens in one virtual machine or another.

The SmartOS host (also known as the global zone) runs entirely in RAM
and is read-only. Areas like /bin, /etc, and /usr
can never be changed. They can only be changed if you compile a custom SmartOS
live USB. The only areas that can be changed are /opt and
/usbkey. The latter holds the configuration for SmartOS while the
former can be used to install optional software. If you need to use
/opt you really need to use it. It’s typically used to start
custom services during boot, though I’ve also seen people use it to install
Xorg.

The main consequence of this kind of design is that the host is completely
insulated from the VMs. The host can be upgraded without breaking anything in
the VM itself. It also means that there is no root ZFS pool. To
OpenSolaris and Solaris veterans this is huge, because it allows you to use
RAID-Z on your system, where before you would have been constrained to
mirroring — unless you were forward-looking enough to use separate boot and
data drives.

The main challenge with SmartOS is figuring out how to separate your data from
the host system, and how to make this data available to a multitude of virtual
machines — many of which will need to be deleted and re-created. Furthermore,
one has to figure out how to efficiently restore those re-created virtual
machines to the usable state they were in before they were destroyed —
this mostly refers to installed packages, config files, and the home directory.
Also, one has to get used to hopping between various purpose-specific virtual
machines, instead of just sticking to a single machine.


4  Separating Your Static Data

As I said before, I used to store my data — which consists of a vast
collection of (legally obtained) PDF books, music files, personal software
projects, and historical performance data — as a bunch of subdirectories in
/rpool. This simply will not do in the new world order. The default
ZFS pool in SmartOS is the zones pool. To start off, in the global
zone, I created a non-conflicting directory heirarchy to hold all of my
data:

zfs create zones/depot
zfs set mountpoint=/depot zones/depot
chmod a+rwx /depot

I broadly divide my software activities into two categories: analysis
and synthesis. The analysis category consists of various software
analysis data and notes that are persistent, and allows me to resume a line
inquiry after a long break or distraction. Most of my source repositories have
infrastructure for collecting and logging performance (and debug) data.
However, this data will be overwritten the next time the framework is run, or
the next time make clean is executed. In order to have historical
performance data, the logs had to be dated and stored somewhere and thus the
analysis data set was born. The synthesis data set contains
everything that I create which is mostly software, blog posts, and essays. My
collection of ebooks and music was located in /rpool/dlib and
/rpool/music. This is what the transfer looked like from an
OpenIndiana machine to a SmartOS machine:

zfs send -R rpool/synthesis@snap-67| ssh root@$smartos zfs recv -d zones
zfs send -R rpool/analysis@snap-23 | ssh root@$smartos zfs recv -d zones
zfs send -R rpool/dlib@snap-12 | ssh root@$smartos zfs recv -d zones
zfs send -R rpool/music@snap-44 | ssh root@$smartos zfs recv -d zones
zfs set mountpoint=/depot/synthesis zones/synthesis
zfs set mountpoint=/depot/analysis zones/analysis
zfs set mountpoint=/depot/dlib zones/dlib
zfs set mountpoint=/depot/music zones/music
chmod -R a+rwx /depot

There was also the neccessity to separate my home-files from the system. So I
created a new data set just for this purpose:

zfs create zones/synthesis/home_files
mkdir /depot/synthesis/home_file/skel

The skel directory contains all of the dot-files from the OpenIndiana
install. Any changes to those dot-files has to happen in that directory. Then a
script is used to deliver the files to a home directory. Keeping your files in
a VM is a bad idea since (a) destroying that VM will destroy those files too
(as well as the user that owns those files — in my case the user
nickziv) — and (b) sharing those files between VMs becomes less
convenient.

You will want to automate the creation of your user for new VMs as well as the
delivery of those files. Thankfully the useradd utility seems to have
had this use-case in mind. Here is a script that creates my user, delivers the
files, and gives me the administrative privileges (local to the guest VM only).
Note that this script only works on SmartOS VMs — more on those later.

#!/bin/ksh

#This script copies all of the files and folders to the current user’s home
#directory. It ignores .audioctl and .mozilla.

useradd -s /bin/ksh -b /home -m -k /depot/synthesis/home_files/skel\
-P “Service Management”,”Software Installation”,”Primary Administrator” nickziv

The -k switch copies the files in skel to $HOME for that
user. Most administrators like to use NFS for this sort of thing, but I think
that it is overkill for my needs. In later sections we will see how we can save
our installed packages between VMs.


5  Virtual Machines

Virtual machines are the unit of currency in SmartOS. You can create two kinds
of VMs: OS and KVM. In short, OS VMs are faster in every respect to KVM VMs,
but are less flexible: they can only run applications that run on Illumos, and
their system directories can’t be permanently changed (specifically
/etc, /bin, /lib, and /usr) — this is
because they share these directories with the global zone, which means all of
your OS VMs get upgraded at no cost when you upgrade your release of
SmartOS. This makes them ideal as sandboxes for work you would have previously
done in an Illumos machine’s global zone. KVM VMs are slower, and use more RAM
and disk, but they make up for it by allowing you to run any OSes (like Linux,
Windows, Haiku, FreeBSD, and Plan9) that you may need. This makes them ideal
for running legacy applications and applications that aren’t packaged for
SmartOS (but are packaged for, say, Debian or FreeBSD).

Both OS and KVM VMs are created from images. Images are base templates.
The two commands SmartOS provides for managing images and VMs are,
respectively, imgadm and vmadm. Generally, you use
imgadm to get images, while you use vmadm to turn them into
working VMs.

By default images are provided by Joyent, but you can specify other publishers.
To see which ones are available for your use just do the following:

[root@78-e7-d1-8c-9b-9e ~]#  imgadm avail

e65e0c4c-dc4b-11e3-b843-93672a0b57d8 cassandra 14.1.0
smartos 2014-05-15T16:13:49Z
bb41287c-e02a-11e3-aa6b-db678b82f5fc java 14.1.0
smartos 2014-05-20T14:26:27Z
4b348706-e122-11e3-9cb4-9b2c5062255c nginx 14.1.0
smartos 2014-05-21T19:58:35Z
1daa8e14-e1ba-11e3-bd59-a3971b58fa36 elasticsearch 14.1.0
smartos 2014-05-22T14:05:21Z
3ff5a554-e4ed-11e3-820d-33b5e236480b qemu-kvm-1.1.2 1.2.2
QEMU 2014-05-26T15:49:02Z
e9b67338-e4ee-11e3-8668-c35a00c0e8dd qemu-kvm-1.1.2 1.3.2
QEMU 2014-05-26T16:00:56Z
4b46e5e4-f17c-11e3-9da1-17ebb4120b97 standard64 14.1.1
smartos 2014-06-11T15:23:07Z
39a95440-f193-11e3-9ce4-332a0d637d3a standard 14.1.1
smartos 2014-06-11T18:07:16Z
1eed39b6-f7cc-11e3-aacb-7bffef5bf8b6 mysql-cluster 14.1.0
smartos 2014-06-19T16:09:40Z
4db4df60-f880-11e3-a1c2-873cdb86240f nodejs 14.1.1
smartos 2014-06-20T13:39:27Z

The list shows available images, which are uniquely identified by a UUID. You
can grab an image by executing the following:

imgadm import <uuid>

You can create a VM based off of an image using vmadm, which takes
JSON manifests of the VM you want. Below is one such manifest that I use to
create my development VM. You’ll notice that at the end of the file are
properties which tell vmadm to create a LOFS mount of the
/depot directory (which, keep in mind, is the root of ALL of my other
personal ZFS data sets). Using this config, you have to manage these datasets
using the zfs command from the global zone — trust me, this is the
simplest way.

If you’re like me and like having lots of VMs, you’ll want to have a place
where you can store all of your manifests. I created a ZFS dataset for this
purpose under zones/depot/manifests. I’m not sure why I didn’t put
this under the synthesis dataset, but whatever, this is what I ended
up with.

Also, if you use vim, you’ll want to save the manifests using a .js
extension instead of a .json extension, because vim only uses
highlighting for the former.

{
“brand”: “joyent”,
“image_uuid”: “dc0688b2-c677-11e3-90ac-13373101c543″,
“hostname”: “dev”,
“alias”: “dev”,
“max_physical_memory”: 8192,
“quota”: 5,
“resolvers”:
[
“8.8.8.8”,
“8.8.4.4”
],
“nics”:
[
{
“nic_tag”: “stub0″,
“ip”: “10.0.2.0”,
“netmask”: “255.0.0.0”
},
{
“nic_tag”: “admin”,
“ip”: “172.16.60.10”,
“netmask”: “255.240.0.0”,
“gateway”: “172.16.0.1”,
“primary”: “1”
}
],
“filesystems”:
[
{
“options”: “rw”,
“type”: “lofs”,
“source”: “/depot”,
“target”: “/depot”
}
]
}

You’ll notice that you can give your VMs names (aliases). All of the
tools in SmartOS identify VMs using UUIDS, which are a pain to type.
So, if you choose to use unique aliases for each VM, you can write
wrapper-scripts that expand the aliases into UUIDS. I use the following script
to login to OS VMs, instead of zlogin.

#!/bin/ksh

# This command logs into vm with alias $2 using user $1.
# zvmlogin nickziv dev

vmuuid=$(vmadm list -H -o uuid alias=$2)

zlogin -l $1 $vmuuid

It is stored in /opt/local/bin in the global zone, and is run like so:

zvmlogin nickziv dev

You’ll probably want to make such scripts for common VM-related things you do
in the global zone. The bash shell in the global zone can do tab-completion for
those UUIDs if you start typing them, but I find it unsatisfactory — requires
listing the VMs and eyeballing the first 4 to 6 letters of the UUID.


6  Package Management In OS VMs

In an OS VM you install packages through the pkgin utility which is a
wrapper over the cross-platform pkgsrc packaging infrastructure from
NetBSD. You search of a package using the search sub-command:

] pkgin search cairo
ruby200-rcairo-1.12.6nb1 Ruby bindings for cairo
ruby200-gnome2-cairo-gobject-2.0.2nb1 Ruby binding of cairo-gobject
ruby193-rcairo-1.12.6nb1 Ruby bindings for cairo
ruby193-gnome2-cairo-gobject-2.0.2nb1 Ruby binding of cairo-gobject
ruby18-rcairo-1.12.6nb1 Ruby bindings for cairo
ruby18-gnome2-cairo-gobject-2.0.2nb1 Ruby binding of cairo-gobject
py27-cairo-1.10.0nb1 = Python bindings for cairo
py26-cairo-1.10.0nb1 Python bindings for cairo
p5-cairo-1.103nb1 Perl bindings to the cairo graphics library
guile-cairo-1.4.0nb12 Guile wrapper for cairo
gst-plugins1-cairo-1.0.10 Open source multimedia framework – cairo plugin
gst-plugins0.10-cairo-0.10.31nb6 Open source multimedia framework – cairo plugin
goocanvas2-2.0.1nb2 Cairo-based canvas widget for GTK+3.0
goocanvas-1.0.0nb15 Cairo-based canvas widget for GTK+
glitz-0.5.6nb2 OpenGL 2D graphics library and a backend for gl output in cairo
cairomm-1.10.0nb9 C++ API for cairo
cairo-gobject-1.12.16 = Vector graphics library with cross-device output support
cairo-clock-0.3.3nb27 Analog clock drawn with vector-graphics
cairo-1.12.16 = Vector graphics library with cross-device output support

=: package is installed and up-to-date
<: package is installed but newer version is available
>: installed package has a greater version than available package
]

You install a package using the install subcommand, which can be
abbreviated to in.

] pfexec pkgin -y in p5-cairo
calculating dependencies… done
nothing to upgrade.
4 packages to be installed: pkg-config-0.28 p5-ExtUtils-PkgConfig-1.13nb2 p5-ExtUtils-Depends-0.304nb2 p5-cairo-1.103nb1 (377K to download, 1096K to install)

downloading packages…
downloading pkg-config-0.28.tgz: 0%
downloading p5-ExtUtils-PkgConfig-1.13nb2.tgz: 0%
downloading p5-ExtUtils-Depends-0.304nb2.tgz: 0%
downloading p5-cairo-1.103nb1.tgz: 0%
installing packages…
installing pkg-config-0.28…
installing p5-ExtUtils-PkgConfig-1.13nb2…
installing p5-ExtUtils-Depends-0.304nb2…
installing p5-cairo-1.103nb1…
pkg_install warnings: 0, errors: 0
reading local summary…
processing local summary…
updating database: 100%
marking p5-cairo-1.103nb1 as non auto-removable

The -y flag tells the utility to assume “yes” for all questions.

I maintain a KSH script that installs all of the packages that I need for
development, in the home_files directory I mentioned above. It looks
like this:

#!/bin/ksh

# devpkgs.ksh
# This script installs the neccessary dev packages.

pfexec pkgin -y in gcc47

pfexec pkgin -y in postgresql93

pfexec pkgin -y in gnuplot

pfexec pkgin -y in R

pfexec pkgin -y in graphviz

pfexec pkgin -y in git

pfexec pkgin -y in python33

pfexec pkgin -y in py27-cairo

pfexec pkgin -y in xpdf


Whenever a I create an OS VM for development, usually to replace an old one, I
run the create\_nickziv.ksh script to create my user, and then I run
devpkgs.ksh to install all packages. Sometimes packages for certain
environments are managed outside of pkgsrc in an environment-specific
packaging system (like that used by R). You’ll have to maintain separate
scripts for these, as I do for R (called install.R). Which can run
like so: pfexec Rscript install.R.

# install.R
install.packages(“ggplot2″)
install.packages(“tm”)
install.packages(“wordcloud”)
install.packages(“RColorBrewer”)

This is a far-cry from a full-blown Chef or SaltStack deployment, but it’s
worked for me so far, and I see no need at the moment to go with something more
complicated.


7  End

I hope that my experience is helpful to you. There are other SmartOS topics
that I’d like to discuss, but time is limited. Good night, and good luck.






This document was translated from LATEX by
HEVEA.

Yet Another Draft

This is an update to my visual data structure comparison using DTrace.

I’ve replaced the original crusty Gnuplot based plotting system with an R based one. I’ve condensed the information to a handful of graphs to make it more digestible. I’ve also made the document itself look nicer, gave it more structure than before, added a table of contents, and refined some of the ideas in it.

There still a long way to go, like measuring fold performance, adding sequential and filtering structures and finding a way to measure stack growth.

This is a work in progress, and is not yet conclusive. Click on the link below to read the pdf.

vis_ds.pdf

Visualizing Slab List Performance Using DTrace and Gnuplot

NOTE: To view the visualizations in this post at a higher resolution, use the ‘view image’ option in your browser. Sadly the images have been scaled down from their original resolution.

As has been pointed out by the DTrace engineers, the single greatest short
coming of the traditional DTrace command line interface is its limited ability
to engage the visual cortex. Many patterns simple aren’t visible for large
amounts of data. The numbers are there, of course, but it is difficult for the
human brain to process tabular data, even if it is neatly quantized.

After failing to get new insights into slab list performance by using many
elaborate D scripts, I took a long step back and asked myself: “What would the
DTrace engineers do”?

At that moment, I imagined people in white lab coats walking around a Joyent
data center, in between rows of towering computers, much in the same way that
people would walk around the engineering section of the Starship Enterprise.
I’d see a Joyent engineer swiping and dragging something on a tiny palm-size
touch screen interface that was integrated into the case of one of these
servers. These servers, by the way, would be extremely quiet, almost silent, as
soothing classical music plays from every direction.

A final swipe on the touch screen, and an elaborate, rainbow-colored, heat map
would appear on a very large central display. Some of the engineers would stop
what they are doing, take note of the visualization, and then proceed to write
some code that would fix some performance pathology that would not have been
visible without Cloud Analytics.

And then I snapped out of it. I didn’t have Analytics or Fishworks available to
me. I just had DTrace. But DTrace really is the magic sauce. All I needed was
code that could output a large quantity of tabular data into a png that I could
load in a browser.

A fast google search revealed that Gnuplot was a decent tool for the job, so I
went with that.

The first step is to generate the data. I wanted to examine insertion
performance, so I used this script to record how long each insertion took. The
first column is the number of elements in the slab list at time of insertion,
and the second column is the time elapsed in nanoseconds.

/* add_trace.d */
#pragma D option quiet
slablist$target:::add_begin
{
    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;
}

slablist$target:::add_end
{
    self->ets = vtimestamp - self->ts;
    printf("%d %d\n", self->elems, self->ets);
}

We use vtimestamp instead of timestamp so that DTrace doesn’t factor in the
time it takes to execute actions like self->elems = args[0]->sli_elems.

So we run the following on the command line:

pfexec dtrace -c './drv_rand_ops 1416100 intsrt' -s add_trace.d > out

It inserts more than 1.4 million random values into a slab list.

We are collecting 1.4 million data points. You might ask, why don’t we just
average everything? The simple reason is that an average insertion time is
really just an attempt to summarize a very complex and transient phenomenon as
a single number. This single number doesn’t give us much detail. You might ask,
well, why don’t we just quantize the data? The reason is similar. Quantizing
the data with DTrace, gives a more detailed summary of the complex and
transient phenomenon — however, it is still a summary, and large amount of
fine detail is lost. The most detailed data would be a full trace of every
single insertion. Unfortunately, it is difficult for humans to process 1.4
million data points in a table. But if we plot that data, and observe the
graph, we would be able to see large amount of fine detail at a glance.

Once we have generated the data, we can proceed to graph it with gnuplot, using
the following gnuplot script:

# gp0.p
set terminal png size 2160,1920
set output "pngfile"
plot "out" using 1:2 title 'insertion time, ns'

This generates a 2160×1920 png image that is viewable in firefox.

This is an interesting image. First we see that there is a solid red line at
the bottom indicating some really fast insertions. We also see that there is a
line of outliers that is there for pretty much any slab list size. And most
alarmingly, we see that after 800000 elements have been inserted there is very
erratic outlier activity. These outliers are not reflected in the averages, but
they are very apparent on the graph.

The questions now are: what is causing these erratic outliers, and what is
causing the steady stream of outliers right above the bottom-most solid line.

These erratic outliers aren’t trivial. They are huge spikes of more than 100
microseconds. What could be causing them? I suspect it’s either memory
allocation latency or reaping latency. Reaping is slab list terminology for
collecting slabs that are not strictly needed. For example if have the
following slabs:

    [1|2|3| | | ] <--> [4|5| | | | ] <--> [6|7|8|9| | ]

Of the above three, we only need two. We can “reap” the extra slab by shifting
all of the elements to the left.

    [1|2|3|4|5|6] <--> [7|8|9| | | ] <--> [ | | | | | ]

We can then reap the last slab.

We reap slab lists, depending on a minimum capacity ratio that the user
provides. For example, if the user provides a ratio of 70/100, we start reaping
as soon as we go below this ratio. For illustration, the above list had a
capacity ratio of 50/100.

We modify the previous DTrace script to look like this:

#pragma D option quiet

dtrace:::BEGIN
{
    self->rets = 0;
}

slablist$target:::add_begin
{
    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;
}

pid$target::try_reap_all:entry
{
    self->rts = vtimestamp;
}

pid$target::try_reap_all:return
{
    self->rets = vtimestamp - self->rts;
}

slablist$target:::add_end
{
    self->ets = vtimestamp - self->ts;
    printf("%d %d %d\n", self->elems, self->ets, self->rets);
    self->rets = 0;
}

This adds reap durations as a third column.

pfexec dtrace -b 128m -c './drv_rand_ops 1416100 intsrt' -s add_trace.d > out2

We modify the gnuplot script to look like this:

# gp1.p
set terminal png size 2160,1920
set output "pngfile2"
plot "out2" using 1:2 title 'insertion time, ns', "out2" using 1:3 title 'reap time, ns'

We run gnuplot:

gnuplot gp1.p

And we get the following graph:

As you can see, the erratic outliers are correlated with reap activity.

What about the non-erratic outliers? I’ve already suggested that it might be
memory allocation activity. Lets test this theory.

The modified D script adds a fourth column:

#pragma D option quiet

dtrace:::BEGIN
{
    self->rets = 0;
    self->mets = 0;
}

slablist$target:::add_begin
{
    self->ts = vtimestamp;
    self->elems = args[0]->sli_elems;
}

pid$target::mk_slab:entry
{
    self->mts = vtimestamp;
}

pid$target::mk_slab:return
{
    self->mets = vtimestamp - self->mts;
}

pid$target::try_reap_all:entry
{
    self->rts = vtimestamp;
}

pid$target::try_reap_all:return
{
    self->rets = vtimestamp - self->rts;
}

slablist$target:::add_end
{
    self->ets = vtimestamp - self->ts;
    printf("%d %d %d %d\n", self->elems, self->ets, self->rets, self->mets);
    self->rets = 0;
    self->mets = 0;
}

The modified gnuplot script:

# gp2.p
set terminal png size 2160,1920
set output "pngfile3"
plot "out3" using 1:2 title 'insertion time, ns', "out3" using 1:3 title 'reap time, ns', \
"out3" using 1:4 title 'alloc time, ns'

And the resulting graph:

As you can see, the outliers are positively correlated with memory allocation
activity.

Further tracing revealed that we did a lot of reaping, to save only a handful
of slabs. In other words we were sacrificing a lot of cpu resources to save
meager memory resources. The key problem, is that the user was allowed to
specify the minimum capacity of a slab list. The user would have been better
served if he could specify two factors:

minimum number of slabs to collect per reap
minimum percentage of slabs to collect per reap

For example, the user can specify a minimum number of 40, and a minimum
percentage of 30. Each reap recycles at least 40 kilobytes, and at least 30% of
the slabs in the list. The code implementing this change was introduced a few
days ago.

Here is a graph of what the insertion pattern looks like now (allocation and
reap times are also shown).

There are no more erratic outliers, resulting from reaps, though there are
still outliers resulting from memory allocations (a problem for another day).

Introducing Slab Lists

UPDATE: Check out the new paper on Slab Lists. It’s still in the works, but is more detailed and more up to date than this post. It also has detailed performance comparisons with AVL Trees, Skip Lists, and B-Trees.

Data Structures Overview

Since the dawn of the computer age, computers have served one unchanging
function: to store raw data, carry out logical and arithmetic operations on
this data, and to present this data as information to user. This is the central
problem that has created billion-dollar industries and companies that try to
solve it with advanced file-systems, memory allocators, databases, compression
algorithms, and so forth. All of these domains share one thing in common and
that is that they use a variety of containers known as data structures to
assist them in solving their problems. Here I present a new, memory and
cpu-efficient data structure: the slab list.

The slab list is a parallel list data structure that uses the notion of
slabs introduced in the slab allocator, to efficiently store and do
operations on data. The slab list’s closest relative (that I was able to find),
is likely the unrolled linked list, which tries to settle for a middle ground
between linked lists and arrays.

The slab list is designed with systems programming in mind, and as such its
space and time performance is dependant on the quality of the memory allocator
and thread library used.

This is an introductory post to the data structure. It describes its
implementation in broad strokes. Future posts will drill into some of the
gritty, grimey, and critical implementation details, that make slab lists
practical to use. When you come back, be ready to read some code.

At the moment libslablist is still being tested and developed. It is not yet
production quality. But feel free to use it, hack it, and check it out.

History

In 1993 Jeff Bonwick of Sun Microsystems introduced his innovative slab
allocator
in a USENIX paper. The slab allocator made memory allocation more
efficient while decreasing fragmentation. The idea was soon adopted by almost
every modern operating system.

The slab allocator had the slab as the central data structure. It was a 4K
region of memory (at least), that would hold objects of equal size. The slabs
would be stored in a doubly linked list. The 1993 paper (and the 2001 sequel to
that paper) refered to this doubly linked list of slabs as a slab list.

I read these papers in 2006 and the slab list was vaguely interesting, and
certainly practical for use in the allocator, but it did not warrant deeper
inspection. It was, after all, a mere linked list.

Fast forward to 2012. I was working on a project that had to store some of its
data in a doubly linked list, and some of its data in an AVL tree. The program
then had to store 200 million nodes. This would have been trivial on a high-end
server. However, the target platform was a consumer laptop with 4G of memory.
The system was unable to stay responsive under heavy load due to swapping,
and was unable to hold all of the data in RAM anyway.

At first, I thought, some other process must have been consuming a lot of RAM,
or that my code had a memory leak. The reason I thought so is that the program
was storing 64-bit integers and doubles. That’s 8 bytes of data. Two hundred
million nodes, is 1.6 billion bytes. Which is around 1.5 gigabytes. This should
have fit in 4GB RAM. The reason it didn’t is that a doubly linked list has 16
bytes of overhead, due to the pointers. Making a single node worth 24 bytes,
instead of 8 bytes. And an AVL tree has 24 bytes of overhead, making a single
node worth 32 bytes instead of 8.

After I identified the source of my anguish, I sent a probe down into my brain,
hoping it would find a solution. I was surprised by what was sent back.

The slab list. It was not an obvious choice. It didn’t have a general purpose
implementation (the slab allocator’s code is highly specialized and low level,
even for C code). It wasn’t tested in any context outside of slab allocation.
There were many unknowns. What was the ideal slab size? Was it possible to
access data in logarithmic time? How would it interact with the memory
allocator? How would the performance compare with an AVL tree? Would shifting
elements in a slab on every addition or removal have negligible impact on
performance? Or would it make slab lists unusable? How long would it take to
implement?

The most obvious data structure to select from my mental catalog, would have
been the B Tree. It has been in use for decades, and has multiple open source
implementations scattered about the net. The B Tree was the proven solution.

But the novelty of a non-tree data structure that was memory efficient, had
logarithmic time complexity, and could potentially (and easily) be parallelized
was far too exciting and compelling. Once I started thinking about it and
implementing it I couldn’t stop — didn’t want to stop. I put the initial project
on the back-burner, as I began to focus on a slab list implementation.

It took two and a half months to complete.

Basic Constraints

The basic constraints for slab lists were:

  • Had to run on Illumos (at least).

  • Potentially take advantage of multicore processors.

  • Store data much more efficiently than any linked list.

  • Insert and remove data quickly.

  • Keep data sorted (if necessary).

  • Preserve insertion order (if necessary).

  • Run the test suite in situ.

  • Potentially map functions to a list in parallel.

  • Potentially fold functions over a list in parallel.

Any of the potential constraints above, didn’t have to be implemented on first
release, or ever. They just had to be possible should the need for those
features arise.

Overview and Comparison

The slab list is designed for configurability and maximum memory efficiency,
while maintaining the semantics of inserting, removing, and retrieving
elements to/from a list or collection. We aim to leverage the advantages of
both doubly linked lists and arrays.

The main structure of the slab list, is the slab. It is a structure that can
hold 952 bytes of data plus some metadata. The metadata is 66 bytes in size
(80 bytes with compiler-inserted padding). This makes the slab exactly 1K in
size, meaning that we can fit 4 slabs on one page. The 952 bytes of data is
carved into 8-byte elements, meaning that each slab can hold 119 elements.

Elements can be any 8-byte sequence (64-bit integers, 64-bit double
precision floating point numbers, 64-bit pointers, and so on).

The data to total-data ratio per slab is 952/1024 which is 93% efficiency.
Increasing the number of elements per-slab increases the efficiency, but
decreases performance. Empirically, the 1K slab offers the best performance
improvement per percentage of memory efficiency sacrificed.

To get a more visual perspective of what the ratio between data and metadata
in a slab is look at the diagrams below:

In the slab list ASCII diagram below, the stars and arrows are pointers.
“MD” is meta data, that are not next and prev pointers.

[*|MD|////952b////|*] -><- [*|MD|////952b////|*]-><-[*|MD|////952b////|*]

If we zoom into one of the slabs:

 *              Legend:
 *                      nextslab: ptr to the next slab (8 bytes)
 *                      prevslab: ptr to the prev slab (8 bytes)
 *                      slistptr: ptr to slab list slab belongs to (8 bytes)
 *                      NE: number of elements in slab (2 bytes)
 *                              ^ the compiler may pad this to be 8 bytes
 *                      maxvalue: largest element in the slab (8 bytes)
 *                      minvalue: smallest element in the slab (8 bytes)
 *                      rest: user-given data (952 bytes)
 *
 *   base of slab structure --->  +--------------------------+----------+
 *                                | mutex structure 24 bytes | nextslab |
 *                                +----------+----------+----+----------+
 *                                | prevslab | slistptr | NE | maxvalue |
 *                                +----------+----------+----+----------+
 *                                | minvalue | 952 bytes of user data   |
 *                                +----------+--------------------------+
 *                                |                ||                   |
 *                                |                ||                   |
 *                                |                \/                   |
 *                                :                                     :
 *                                .                                     .
 *                                .                                     .
 *                                .                                     .
 *                                :                                     :
 *                                |                                     |
 *     end of slab structure--->  +-------------------------------------+

Every slab has a mutex structure. This structure is needed to support locking
concurrency. Currently, slab lists are single-threaded, but support for
multiple threads is in the pipeline.

As can be seen by the diagram above, user data is an overwhelming portion of
the entire slab.

Compare slab lists to a doubly linked list. Each node has 2 pointers and an
8-byte element which is 24-bytes per node.

The data to total-data ratio per node is 8/24 which is 33% efficiency.

The following two diagrams illustrate why this is so:

[*|8b|*] -><- [*|8b|*]-><-[*|8b|*]

Zooming in on one of the doubly linked list nodes:

 *      base of node ------> +----------+
 *                           | prevnode | (8 bytes)
 *                           +----------+
 *                           | nextnode | (8 bytes)
 *                           +----------+
 *                           | userdata | (8 bytes)
 *      end of node ------>  +----------+

As you can see userdata is 1/3 of the whole node.

Compare slab lists to the Illumos kernel’s AVL Tree implementation, where
the metadata per node is 24 bytes in size, and we have 8/32 which is 25%
efficiency.

The following diagrams illustrate why this is so.

In the AVL tree diagram below, MD is meta data which takes up 16 bytes.

 *                        [*|MD..|8b|*]
 *                        /           \
 *                       /             \
 *           [*|MD..|8b|*]             [*|MD..|8b|*]

Zooming in on one of the AVL nodes:

 *                               +-----------+
 *                               | lf child  | (8 bytes)
 *                               +-----------+
 *                               | rt child  | (8 bytes)
 *                               +-----------+
 *                               |  avl_pcb  | (8 bytes)
 *                               +-----------+
 *                               |  userdata | (8 bytes)
 *                               +-----------+

The avl_pcb is 64 bits containing:

pointer to the parent (60 bits),

a bit indicating if the node is left or right of the parent,

and a balance value (2 bits).

Zooming in on avl_pcb:

 *     |63                               3|        2        |1          0 |
 *     |----------------------------------|-----------------|-------------|
 *     |   avl_parent hi order bits       | avl_child_index | avl_balance |
 *     |----------------------------------|-----------------|-------------|

As can be seen in the second AVL diagram, userdata is 1/4 of the node, which
is indeed 25%.

To summarize the memory-efficiency of each data strucure’s node, from best
to worst:

 *              Slab list:              93%
 *              Doubly linked list:     33%
 *              AVL Tree:               25%

This is all good and well, when a slab is full. However partially full
slabs could end up having worse memory efficiency per slab than both doubly
linked lists and AVL trees.

In order to guarantee a minimal memory efficiency of 50%, we transparently
use a singly-linked list until the number of elements in a slab list reaches
60 (1/2 of a slab). After we reach 1/2 of a slab we copy all of the
elements into a new slab, and continue using slabs thereafter.

Ranges

Each slab is ranged. That is, it knows what its maximum and minimum elements
are.

The usefulness of the range depend on the type of the slab list. A slab
list can be either sorted or ordered. A sorted slab list keeps all of the
elements sorted (according to a user-given comparison function), whereas an
ordered slab list merely inserts each new element to the end of the list.

Ranges on slabs are useful for sorted slab lists, as we can use it to
traverse the list 119 times more quickly than a linked list, when searching
for an element. When we get to a slab that may contain an element, we try
to find the index of the element via binary search.

Sublayers

If the slab list exceeds a certain number of slabs (user-defined), we create
a sublayer for that slab list. A sublayer is an internal slab list that
contains pointers to slabs as elements. All of the slab-pointers are sorted,
and the sublayer’s slabs are also ranged. A sublayer is only created for
sorted slab lists. Slabs in a sublayer are called sub-slabs. Superslabs are
subslabs that are refered to from a lower subslab. Slabs at the highest
superlayer (slabs that contain the user’s data) are simply known as slabs,
or superslabs from the perspective of the highest sublayer. The lowest
sublayer is called the baselayer.

 *    slabs from superlayer -------->  [ ][ ][ ][ ][ ][ ][ ][ ] .....
 *                                      ^  ^  ^  ^  ^  ^  ^  ^
 *                                      |  |  |  |  |  |  |  |
 *                                      |  |  |  |  |  |  |  |
 *                               +.....+--+--+--+--+--+--+--+--+......+
 *    a slab from sublayer ----> |.....|* |* |* |* |* |* |* |* |......|
 *                               +.....+--+--+--+--+--+--+--+--+......+
 *                           metadata   slab elems (slab ptrs)

This sublayering of slabs gives us a structure that looks like an inverted
pyramid.

 *        highest superlayer --->   _________
 *                                  \       /
 *                                   \     /
 *                                    \   /
 *        baselayer -------------->    \_/

If the maximum number of slabs the baselayer can have is N, then it
can have at most N*119 elems. If we have only 1 sublayer attached to the
slab list, then we have N*119 elems. If we have two we have N*(119^2)
user-inserted elems. If we have 3, N*(119^3) elems. And so on.

Slablists are limited to a maximum of 8 sublayers. With a completely full
slab list, of 8-byte integers, we would consume 0.365 zettabytes of RAM.
That exceeds the largest hard drives on the marked today (2012), let alone
any amount RAM one could cram into computer.

If we wanted to find a slab with element E, and our slab list has
sublayers, we 1) go to the baselayer, 2) we do a linear search on that
sublayer, until we find a slab with the range into which E could belong
to.

The slab that has been found at the baselayer, contains pointers to slabs in
the superlayer. We use binary search find the slab in the superlayer into
which E could belong to. We keep doing this repeatedly until we get to the
highest superlayer. The highest superlayer is the layer that contains data
that the user gave it.

The use of sublayers of slab lists, and binary search gives a search time
comparable to balanced binary trees (that is, logarithmic time).

Information specific to individual slab lists is encoded in the slabist_t
struct.

The user can specify the minimum capacity ratio (number of elems in entire
slab list, to number of possible elems [119*nslabs]), name, slabs for
sublayer, object size, and comparison function.

If a slab is added to or removed from the superlayer, we have to ripple the
changes down to the sublayer by removing or adding a reference to the slab.
If that sublayer, gains or loses a slab as a result of the previous ripple,
we have to ripple that change down to the next sublayer, and so on.

DTrace Probes / Test Suite

This slab list implementation has a profusion of DTrace probes. There are
two kinds of probes: tracing probes and testing probes.

Tracing probes are ordinary DTrace probes that notify the user of events
within libslablist.

Testing probes are DTrace probes that, when enabled, trigger tests that
verify the correctness of the slab list data structure after each
modification, and report failure or success via dtrace. All testing probes
have a test_* prefix.

Tests are driven by driver programs that barrage slab lists with random data
(integers and variable length strings).

The probes and tests will be expanded upon in future posts.

Memory Allocation

Slab lists use the libumem slab allocator available on Illumos, but can be
compiled against any libc malloc implementation. This is acheived by
abstracting away the allocation code.

Instead of doing:

slab_t s = malloc(sizeof(slab_t));

or

slab_t s = umem_cache_alloc(cache_slab, UMEM_NOFAIL);

we do:

slab_t s = mk_slab();

Where mk_slab calls the appropriate allocation function, depending on the
compile time options.

Code Organization

See src/slablist_provider.d for all of the slablist probes (testing and
tracing).

See src/slablist_add.c and src/slablist_rem.c for the addition and removal
algorithms.

See src/slablist_find.c for search algorithms used.

See src/slablist_umem.c and src/slablist_malloc.c for the specifics behind
memory allocation.

See src/slablist_test.c for the test functions. They are only called when a
corresponding dtrace test-probe is enabled. All telemetry is consumed via
dtrace probes.

See src/slablist_provider.d for comments regarding individual probes. See
slablist_test.c for test function implementation.

See src/slablist.d for translators used by the probes.

See build/build_lib for build script implementation. If you have primary
admin access on Illumos you can run ./build_install_all. If not, you’ll
figure something out.

See drv/ for various driver programs. drv/drv_rand_ops executes an arbitrary
number of random insertions and removals, on 4 kinds of slab lists (sorted
int and str list, and ordered int and str list).

See tests/ for various test D scripts.

See tools/ for various tool D scripts.

Illuminating Lucene with Illumos

Intuition is the clear conception of the whole at once.
– Johann Kaspar Lavater

Every once in a while one comes across a program of such monstrous complexity,
that understanding by mere code-reading is a privilege reserved only for the
most persistent and patient of people.

Lucene is one such program, that I have to use at school for various projects.
The sheer size of the code-base is a massive barrier to understanding, which is
only exponentiated by the fact that I program mostly in C, and am merely
competent (if that) in Java.

(more…)

A Good Post About SmartOS

WordPress’s reblogging facility isn’t advanced enough to figure out that I reblogged something to the wrong blog, and would like to move it. So I have to resort to far more crude, but effective, method of sharing this excellent introductory post:

Discovering SmartOS

Introductory and informative blog posts like this one are needed to take SmartOS (and other Illumos-based tech) beyond the murky, mysterious realm of kernel engineering, and into the intensely pragmatic realm of common system administration.

Lex Provider Update

A while ago, I wrote about the DTrace provider I made for lex, the lexical
analyzer that ships with Illumos.

I added one new probe to it, that solves a simple but particularly vexing
issue: tracing and aggregating on the regular expression used to match a token
instead of the actual token or the returned token-id. This new probe is called
match_regex.

The returned token-id (value returned by yylex()) would only be adequate in
identifying the most commonly matched regular expressions, if yylex()
necessarily returned on every single match. It doesn’t.

(more…)