Thursday, April 6, 2017

Unbuffered I/O of arbitrary record sizes.

A few years ago I wrote a plain text indexer.

It was a bit of an experiment, so I had fun with it.

The files were all around 10MB each.

There were around 40GB of logs per day.

So, the overall data did not fit in memory (this was years ago), but individual files did.

Something I had always wanted to use successfully was unbuffered I/O.

Unbuffered I/O was appropriate here because an indexer tends to read things only once. Unbuffered is also uncached -- the data goes directly into user-supplied buffers, skipping the file system cache.

On Windows, at least, unbuffered I/O requires aligned file offsets, aligned read/write lengths, and aligned buffer addresses. This make sense -- the alignment allows the block-level driver to work directly -- no buffering. The alignment requirement is not constant, it is up to some underlying driver and can be queried. But I just use 64K.

To align the buffers I just use VirtualAlloc. This aligns at 64K boundaries and is not interesting or challenging.

The indexer was line oriented (spec suggested by someone else and I went with it). The lines were of arbitrary length. Not fixed length and not short.

And word oriented -- it would find words, and record the offset of the start of the line they occured on.

Initially I said, hey, this is easy, whenever a line crosses the end of the buffer, double the buffer size.

This does work a little teeny tiny bit, but it is actually pretty dumb and bad.

What happens is that you will repeatedly hit a line that crosses the end of the buffer. It doesn't require a long line to do this.

The buffer doubles and doubles and doubles. The only thing stopping it is when the entire file fits in the buffer. Which, for 10MB isn't terrible, but defeats the point and doesn't scale well (line orientation or word orientation should lead to an ability to index documents that do not fit in memory, as well the indexer was partly multi-threaded, so 10MB buffer per thread is perhaps large).

So, the solution is to read into the middle of the buffer (double the initial buffer size perhaps).

If you then cross the end of the buffer, you copy the buffer contents to before the middle. Not the entire half buffer, just from the start of the line that cross the end of the buffer.

This will happen repeatedly.

The last line in the buffer will repeatedly cross the end of the buffer, and every time you will copy it back so that the data you have ends at the middle of the buffer.

And then eventually you might hit a very long line that does not fit in the buffer at all. Only then do you double the buffer, and copy the entire thing, to end at the middle of the newly doubled buffer.

Some pictures:

 

              64K          128K
 |------------|------------|
               the\n quick brown fox\njumped really really high\n

 
First line fits no problem. Second line straddles. Copy back and read more.

              64K          128K
 |------------|------------|
        quick brown fox\njumped really really high\n

 
Second line fits now. Third line straddles. Copy back and read more.

              64K          128K
 |------------|------------|
            jumped really really high\n            

 
Line still straddles and first half of buffer already in use. Double buffer and read more. When you double, you also copy, and you copy to before the middle. It is possible maybe to copy to a different place but this maximizes buffer use and ensures you have an aligned location to read to.

                           128K                      256K
 |-------------------------|-------------------------|
            jumped really really high\n


 

Note that this example is contrived. Typically many lines will fit in the buffer, and relatively little copying occurs. And doubling will be rare. But note that some copying will occur at almost every buffer end. - Jay

No comments:

Post a Comment