Introducing Slab Lists

UPDATE: Check out the new paper on Slab Lists. It’s still in the works, but is more detailed and more up to date than this post. It also has detailed performance comparisons with AVL Trees, Skip Lists, and B-Trees.

Data Structures Overview

Since the dawn of the computer age, computers have served one unchanging
function: to store raw data, carry out logical and arithmetic operations on
this data, and to present this data as information to user. This is the central
problem that has created billion-dollar industries and companies that try to
solve it with advanced file-systems, memory allocators, databases, compression
algorithms, and so forth. All of these domains share one thing in common and
that is that they use a variety of containers known as data structures to
assist them in solving their problems. Here I present a new, memory and
cpu-efficient data structure: the slab list.

The slab list is a parallel list data structure that uses the notion of
slabs introduced in the slab allocator, to efficiently store and do
operations on data. The slab list’s closest relative (that I was able to find),
is likely the unrolled linked list, which tries to settle for a middle ground
between linked lists and arrays.

The slab list is designed with systems programming in mind, and as such its
space and time performance is dependant on the quality of the memory allocator
and thread library used.

This is an introductory post to the data structure. It describes its
implementation in broad strokes. Future posts will drill into some of the
gritty, grimey, and critical implementation details, that make slab lists
practical to use. When you come back, be ready to read some code.

At the moment libslablist is still being tested and developed. It is not yet
production quality. But feel free to use it, hack it, and check it out.

History

In 1993 Jeff Bonwick of Sun Microsystems introduced his innovative slab
allocator
in a USENIX paper. The slab allocator made memory allocation more
efficient while decreasing fragmentation. The idea was soon adopted by almost
every modern operating system.

The slab allocator had the slab as the central data structure. It was a 4K
region of memory (at least), that would hold objects of equal size. The slabs
would be stored in a doubly linked list. The 1993 paper (and the 2001 sequel to
that paper) refered to this doubly linked list of slabs as a slab list.

I read these papers in 2006 and the slab list was vaguely interesting, and
certainly practical for use in the allocator, but it did not warrant deeper
inspection. It was, after all, a mere linked list.

Fast forward to 2012. I was working on a project that had to store some of its
data in a doubly linked list, and some of its data in an AVL tree. The program
then had to store 200 million nodes. This would have been trivial on a high-end
server. However, the target platform was a consumer laptop with 4G of memory.
The system was unable to stay responsive under heavy load due to swapping,
and was unable to hold all of the data in RAM anyway.

At first, I thought, some other process must have been consuming a lot of RAM,
or that my code had a memory leak. The reason I thought so is that the program
was storing 64-bit integers and doubles. That’s 8 bytes of data. Two hundred
million nodes, is 1.6 billion bytes. Which is around 1.5 gigabytes. This should
have fit in 4GB RAM. The reason it didn’t is that a doubly linked list has 16
bytes of overhead, due to the pointers. Making a single node worth 24 bytes,
instead of 8 bytes. And an AVL tree has 24 bytes of overhead, making a single
node worth 32 bytes instead of 8.

After I identified the source of my anguish, I sent a probe down into my brain,
hoping it would find a solution. I was surprised by what was sent back.

The slab list. It was not an obvious choice. It didn’t have a general purpose
implementation (the slab allocator’s code is highly specialized and low level,
even for C code). It wasn’t tested in any context outside of slab allocation.
There were many unknowns. What was the ideal slab size? Was it possible to
access data in logarithmic time? How would it interact with the memory
allocator? How would the performance compare with an AVL tree? Would shifting
elements in a slab on every addition or removal have negligible impact on
performance? Or would it make slab lists unusable? How long would it take to
implement?

The most obvious data structure to select from my mental catalog, would have
been the B Tree. It has been in use for decades, and has multiple open source
implementations scattered about the net. The B Tree was the proven solution.

But the novelty of a non-tree data structure that was memory efficient, had
logarithmic time complexity, and could potentially (and easily) be parallelized
was far too exciting and compelling. Once I started thinking about it and
implementing it I couldn’t stop — didn’t want to stop. I put the initial project
on the back-burner, as I began to focus on a slab list implementation.

It took two and a half months to complete.

Basic Constraints

The basic constraints for slab lists were:

  • Had to run on Illumos (at least).

  • Potentially take advantage of multicore processors.

  • Store data much more efficiently than any linked list.

  • Insert and remove data quickly.

  • Keep data sorted (if necessary).

  • Preserve insertion order (if necessary).

  • Run the test suite in situ.

  • Potentially map functions to a list in parallel.

  • Potentially fold functions over a list in parallel.

Any of the potential constraints above, didn’t have to be implemented on first
release, or ever. They just had to be possible should the need for those
features arise.

Overview and Comparison

The slab list is designed for configurability and maximum memory efficiency,
while maintaining the semantics of inserting, removing, and retrieving
elements to/from a list or collection. We aim to leverage the advantages of
both doubly linked lists and arrays.

The main structure of the slab list, is the slab. It is a structure that can
hold 952 bytes of data plus some metadata. The metadata is 66 bytes in size
(80 bytes with compiler-inserted padding). This makes the slab exactly 1K in
size, meaning that we can fit 4 slabs on one page. The 952 bytes of data is
carved into 8-byte elements, meaning that each slab can hold 119 elements.

Elements can be any 8-byte sequence (64-bit integers, 64-bit double
precision floating point numbers, 64-bit pointers, and so on).

The data to total-data ratio per slab is 952/1024 which is 93% efficiency.
Increasing the number of elements per-slab increases the efficiency, but
decreases performance. Empirically, the 1K slab offers the best performance
improvement per percentage of memory efficiency sacrificed.

To get a more visual perspective of what the ratio between data and metadata
in a slab is look at the diagrams below:

In the slab list ASCII diagram below, the stars and arrows are pointers.
“MD” is meta data, that are not next and prev pointers.

[*|MD|////952b////|*] -><- [*|MD|////952b////|*]-><-[*|MD|////952b////|*]

If we zoom into one of the slabs:

 *              Legend:
 *                      nextslab: ptr to the next slab (8 bytes)
 *                      prevslab: ptr to the prev slab (8 bytes)
 *                      slistptr: ptr to slab list slab belongs to (8 bytes)
 *                      NE: number of elements in slab (2 bytes)
 *                              ^ the compiler may pad this to be 8 bytes
 *                      maxvalue: largest element in the slab (8 bytes)
 *                      minvalue: smallest element in the slab (8 bytes)
 *                      rest: user-given data (952 bytes)
 *
 *   base of slab structure --->  +--------------------------+----------+
 *                                | mutex structure 24 bytes | nextslab |
 *                                +----------+----------+----+----------+
 *                                | prevslab | slistptr | NE | maxvalue |
 *                                +----------+----------+----+----------+
 *                                | minvalue | 952 bytes of user data   |
 *                                +----------+--------------------------+
 *                                |                ||                   |
 *                                |                ||                   |
 *                                |                \/                   |
 *                                :                                     :
 *                                .                                     .
 *                                .                                     .
 *                                .                                     .
 *                                :                                     :
 *                                |                                     |
 *     end of slab structure--->  +-------------------------------------+

Every slab has a mutex structure. This structure is needed to support locking
concurrency. Currently, slab lists are single-threaded, but support for
multiple threads is in the pipeline.

As can be seen by the diagram above, user data is an overwhelming portion of
the entire slab.

Compare slab lists to a doubly linked list. Each node has 2 pointers and an
8-byte element which is 24-bytes per node.

The data to total-data ratio per node is 8/24 which is 33% efficiency.

The following two diagrams illustrate why this is so:

[*|8b|*] -><- [*|8b|*]-><-[*|8b|*]

Zooming in on one of the doubly linked list nodes:

 *      base of node ------> +----------+
 *                           | prevnode | (8 bytes)
 *                           +----------+
 *                           | nextnode | (8 bytes)
 *                           +----------+
 *                           | userdata | (8 bytes)
 *      end of node ------>  +----------+

As you can see userdata is 1/3 of the whole node.

Compare slab lists to the Illumos kernel’s AVL Tree implementation, where
the metadata per node is 24 bytes in size, and we have 8/32 which is 25%
efficiency.

The following diagrams illustrate why this is so.

In the AVL tree diagram below, MD is meta data which takes up 16 bytes.

 *                        [*|MD..|8b|*]
 *                        /           \
 *                       /             \
 *           [*|MD..|8b|*]             [*|MD..|8b|*]

Zooming in on one of the AVL nodes:

 *                               +-----------+
 *                               | lf child  | (8 bytes)
 *                               +-----------+
 *                               | rt child  | (8 bytes)
 *                               +-----------+
 *                               |  avl_pcb  | (8 bytes)
 *                               +-----------+
 *                               |  userdata | (8 bytes)
 *                               +-----------+

The avl_pcb is 64 bits containing:

pointer to the parent (60 bits),

a bit indicating if the node is left or right of the parent,

and a balance value (2 bits).

Zooming in on avl_pcb:

 *     |63                               3|        2        |1          0 |
 *     |----------------------------------|-----------------|-------------|
 *     |   avl_parent hi order bits       | avl_child_index | avl_balance |
 *     |----------------------------------|-----------------|-------------|

As can be seen in the second AVL diagram, userdata is 1/4 of the node, which
is indeed 25%.

To summarize the memory-efficiency of each data strucure’s node, from best
to worst:

 *              Slab list:              93%
 *              Doubly linked list:     33%
 *              AVL Tree:               25%

This is all good and well, when a slab is full. However partially full
slabs could end up having worse memory efficiency per slab than both doubly
linked lists and AVL trees.

In order to guarantee a minimal memory efficiency of 50%, we transparently
use a singly-linked list until the number of elements in a slab list reaches
60 (1/2 of a slab). After we reach 1/2 of a slab we copy all of the
elements into a new slab, and continue using slabs thereafter.

Ranges

Each slab is ranged. That is, it knows what its maximum and minimum elements
are.

The usefulness of the range depend on the type of the slab list. A slab
list can be either sorted or ordered. A sorted slab list keeps all of the
elements sorted (according to a user-given comparison function), whereas an
ordered slab list merely inserts each new element to the end of the list.

Ranges on slabs are useful for sorted slab lists, as we can use it to
traverse the list 119 times more quickly than a linked list, when searching
for an element. When we get to a slab that may contain an element, we try
to find the index of the element via binary search.

Sublayers

If the slab list exceeds a certain number of slabs (user-defined), we create
a sublayer for that slab list. A sublayer is an internal slab list that
contains pointers to slabs as elements. All of the slab-pointers are sorted,
and the sublayer’s slabs are also ranged. A sublayer is only created for
sorted slab lists. Slabs in a sublayer are called sub-slabs. Superslabs are
subslabs that are refered to from a lower subslab. Slabs at the highest
superlayer (slabs that contain the user’s data) are simply known as slabs,
or superslabs from the perspective of the highest sublayer. The lowest
sublayer is called the baselayer.

 *    slabs from superlayer -------->  [ ][ ][ ][ ][ ][ ][ ][ ] .....
 *                                      ^  ^  ^  ^  ^  ^  ^  ^
 *                                      |  |  |  |  |  |  |  |
 *                                      |  |  |  |  |  |  |  |
 *                               +.....+--+--+--+--+--+--+--+--+......+
 *    a slab from sublayer ----> |.....|* |* |* |* |* |* |* |* |......|
 *                               +.....+--+--+--+--+--+--+--+--+......+
 *                           metadata   slab elems (slab ptrs)

This sublayering of slabs gives us a structure that looks like an inverted
pyramid.

 *        highest superlayer --->   _________
 *                                  \       /
 *                                   \     /
 *                                    \   /
 *        baselayer -------------->    \_/

If the maximum number of slabs the baselayer can have is N, then it
can have at most N*119 elems. If we have only 1 sublayer attached to the
slab list, then we have N*119 elems. If we have two we have N*(119^2)
user-inserted elems. If we have 3, N*(119^3) elems. And so on.

Slablists are limited to a maximum of 8 sublayers. With a completely full
slab list, of 8-byte integers, we would consume 0.365 zettabytes of RAM.
That exceeds the largest hard drives on the marked today (2012), let alone
any amount RAM one could cram into computer.

If we wanted to find a slab with element E, and our slab list has
sublayers, we 1) go to the baselayer, 2) we do a linear search on that
sublayer, until we find a slab with the range into which E could belong
to.

The slab that has been found at the baselayer, contains pointers to slabs in
the superlayer. We use binary search find the slab in the superlayer into
which E could belong to. We keep doing this repeatedly until we get to the
highest superlayer. The highest superlayer is the layer that contains data
that the user gave it.

The use of sublayers of slab lists, and binary search gives a search time
comparable to balanced binary trees (that is, logarithmic time).

Information specific to individual slab lists is encoded in the slabist_t
struct.

The user can specify the minimum capacity ratio (number of elems in entire
slab list, to number of possible elems [119*nslabs]), name, slabs for
sublayer, object size, and comparison function.

If a slab is added to or removed from the superlayer, we have to ripple the
changes down to the sublayer by removing or adding a reference to the slab.
If that sublayer, gains or loses a slab as a result of the previous ripple,
we have to ripple that change down to the next sublayer, and so on.

DTrace Probes / Test Suite

This slab list implementation has a profusion of DTrace probes. There are
two kinds of probes: tracing probes and testing probes.

Tracing probes are ordinary DTrace probes that notify the user of events
within libslablist.

Testing probes are DTrace probes that, when enabled, trigger tests that
verify the correctness of the slab list data structure after each
modification, and report failure or success via dtrace. All testing probes
have a test_* prefix.

Tests are driven by driver programs that barrage slab lists with random data
(integers and variable length strings).

The probes and tests will be expanded upon in future posts.

Memory Allocation

Slab lists use the libumem slab allocator available on Illumos, but can be
compiled against any libc malloc implementation. This is acheived by
abstracting away the allocation code.

Instead of doing:

slab_t s = malloc(sizeof(slab_t));

or

slab_t s = umem_cache_alloc(cache_slab, UMEM_NOFAIL);

we do:

slab_t s = mk_slab();

Where mk_slab calls the appropriate allocation function, depending on the
compile time options.

Code Organization

See src/slablist_provider.d for all of the slablist probes (testing and
tracing).

See src/slablist_add.c and src/slablist_rem.c for the addition and removal
algorithms.

See src/slablist_find.c for search algorithms used.

See src/slablist_umem.c and src/slablist_malloc.c for the specifics behind
memory allocation.

See src/slablist_test.c for the test functions. They are only called when a
corresponding dtrace test-probe is enabled. All telemetry is consumed via
dtrace probes.

See src/slablist_provider.d for comments regarding individual probes. See
slablist_test.c for test function implementation.

See src/slablist.d for translators used by the probes.

See build/build_lib for build script implementation. If you have primary
admin access on Illumos you can run ./build_install_all. If not, you’ll
figure something out.

See drv/ for various driver programs. drv/drv_rand_ops executes an arbitrary
number of random insertions and removals, on 4 kinds of slab lists (sorted
int and str list, and ordered int and str list).

See tests/ for various test D scripts.

See tools/ for various tool D scripts.

About these ads
Comments are closed, but you can leave a trackback: Trackback URL.

Trackbacks

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: