Illuminating Lucene with Illumos

Intuition is the clear conception of the whole at once.
– Johann Kaspar Lavater

Every once in a while one comes across a program of such monstrous complexity,
that understanding by mere code-reading is a privilege reserved only for the
most persistent and patient of people.

Lucene is one such program, that I have to use at school for various projects.
The sheer size of the code-base is a massive barrier to understanding, which is
only exponentiated by the fact that I program mostly in C, and am merely
competent (if that) in Java.

Lucene is designed in typical OOP fashion, where the codebase is split among a
multitude of files, each inheriting from several others, in a very deep object
hierarchy.

For most programmers that are familiar with Java, the task of understanding and
modifying Lucene is far less of a struggle, as they at least understand what
certain objects are (like Collection, Throwable, StringReader, Reference,
WeakReference, etc), and are familiar with java idioms.

An additional help, is Lucene’s extensive online documentation. Unfortunately,
the documentation is as sprawling as the code, and while it will tell you
what a class or method does, it is up to you to visualize and understand
where that class or method fits in the overall system.

For grizzled systems guys, who are used to functions that can only have 1
definition, and languages with very simple rules, poking at the innards of
lucene is like tickling a dragon’s tail. A single misstep, and you get burned
by Java or Lucene arcana.

My goal was to understand what code-path Lucene took to assign a score to
various documents, given a query.

Instead of reading the code, and simulating what it does in our minds (an
endeavor that would take weeks or months), we use DTrace (an implementation of
Dynamic Tracing), to create a code-flow diagram of java method calls that are
made by Lucene.

To accomplish this, I used a (slightly modified) D script from Brendan Gregg’s
DTrace book (java_flow.d):

#pragma D option bufsize=16m
#pragma D option quiet
#pragma D option switchrate=10

self int depth[int];

dtrace:::BEGIN
{
    printf("%3s %6s %-16s -- %s\n", "C", "PID", "TIME(us)", "CLASS.METHOD");
}

hotspot$target:::method-entry
{
    this->class = (char *)copyin(arg1, arg2 + 1);
    this->class[arg2] = '';
    this->method = (char *)copyin(arg3, arg4 + 1);
    this->method[arg4] = '';

    printf("%3d %6d %-6d %*s-> %s.%s\n", cpu, pid, timestamp / 1000,
        self->depth[arg0] * 2, "", stringof(this->class),
        stringof(this->method));
    self->depth[arg0]++;
}

hotspot$target:::method-return
{
    this->class = (char *)copyin(arg1, arg2 + 1);
    this->class[arg2] = '';
    this->method = (char *)copyin(arg3, arg4 + 1);
    this->method[arg4] = '';

    self->depth[arg0] -= self-> depth[arg0] > 0 ? 1 : 0;
    printf("%3d %6d %-6d %*s<- %s.%s\n", cpu, pid, timestamp / 1000,
        self->depth[arg0] * 2, "", stringof(this->class),
        stringof(this->method));
}

I ran it on the search demo code that came with a prior version of Lucene. The
code is available here.

Here is a link to the generated code-flow diagram (it is 16M).

And here is a snippet (that should look familiar to regular DTrace users):

...SNIP...
0   1408 27469043126               -> org/apache/lucene/search/Similarity.idfExplain
0   1408 27469043136                 -> org/apache/lucene/search/IndexSearcher.maxDoc
0   1408 27469043143                 <- org/apache/lucene/search/IndexSearcher.maxDoc
0   1408 27469043151                 -> org/apache/lucene/search/DefaultSimilarity.idf
0   1408 27469043162                 <- org/apache/lucene/search/DefaultSimilarity.idf
0   1408 27469043338                 -> org/apache/lucene/search/Similarity$1.<init>
0   1408 27469043352                   -> org/apache/lucene/search/Explanation$IDFExplanation.<init>
0   1408 27469043360                     -> java/lang/Object.<init>
0   1408 27469043367                     <- java/lang/Object.<init>
0   1408 27469043372                   <- org/apache/lucene/search/Explanation$IDFExplanation.<init>
0   1408 27469043378                 <- org/apache/lucene/search/Similarity$1.<init>
0   1408 27469043383               <- org/apache/lucene/search/Similarity.idfExplain
0   1408 27469043395               -> org/apache/lucene/search/Similarity$1.getIdf
0   1408 27469043402               <- org/apache/lucene/search/Similarity$1.getIdf
0   1408 27469043408             <- org/apache/lucene/search/TermQuery$TermWeight.<init>
0   1408 27469043414           <- org/apache/lucene/search/TermQuery.createWeight
...SNIP...

The above snippet shows the first call made to functions in the Similarity
class from the TermWeight subclass of TermQuery.

This particular invocation was used, as SearchFiles ran in a different window:

pfexec dtrace -p' `pgrep java` -s java_flow.d > jflow

The classes of interest were Similarity and DefaultSimilarity.

So one can eyeball the flow diagram, and get an idea of what is going on.

If one just wanted to see functions that are defined in lucene but not in,
say java/io, one could just do:

cat jflow | grep lucene | less

If one wanted to see the classes that are related to the Query class (or
querying in general), one would do:

cat jflow | grep query | less

Cool stuff.

This code-flow graph alone, which took less than a minute to produce, revealed
with incredible concision where objects fit in lucene’s hierarchy. In minutes,
it became perfectly clear which classes needed to be modified and created, to
achieve what I wanted. I was blind but now I see.

If I had to plow through Lucene’s documentation and code, it would have likely
been a week before I figured out what classes needed to be modified. And even
then, I’d need some time to actually write (and rewrite) the code.

I am convinced that DTrace is not merely a useful debugger, but rather a force
multiplier for programmers. It saves obscene amounts of time.

It took me a total of 3 hours to write all of the code that I needed and to
understand what needed to written. My earlier estimate for a sans-DTrace
execution of this project was 1 week (I’m accounting for the fact that I have
other classes to deal with). According to Poul-Henning Kemp’s rule for project
estimation, the most realistic estimate is: an initial estimate, multiplied by
two, and raised to the next-highest unit of time.

raise(1 week * 2) => raise(2 weeks) => 2 months

And I believe it. I (unexpectedly) had to implement my own QueryParser class
(because its use of TermQuery and TermScorer is hard-coded, and one can’t
simply swap them out). One can only swap out Similarity which, doesn’t have
access to some of the data that’s needed from the TermScorer class.

Indeed, forking and modularizing Lucene’s QueryParser took a while, but
DTrace helped me cut through the other trivial code, by giving me an intuition
of where these modules fit.

And this is what DTrace does. It gives programmers intuition about the code
that is running on Illumos. Without DTrace, that intuition is very difficult to
get (if not impossible), and the cost of getting it is very high. This is just
the tip of the ice-berg, as far as Lucene is concerned. DTrace can be used to
see which queries or code-paths generate the most IO, or cause the most
swapping, or which files are being accessed the most. The possibilities are
limitless, and the Lucene developers need all of the data they can get if they
want to shove Lucene forward, to faster speeds and higher scales.

Not having DTrace available on my machine, would have been the kiss of death
for this assignment. What a pity that this technology isn’t available in
every programmer’s toolbox.

On a tangential note, Illumos is a very apt name for this OS, as it
illuminates the darkest corners of every program that can run on it.

About these ads

2 thoughts on “Illuminating Lucene with Illumos

  1. Hi Vladimir. I’m glad you liked the post, and am sorry about the state of affairs in Russia.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s