panthema / 2010 / stx-cbtreedb
Thumbnail of structure of the packed, sequential B-tree

STX Constant B-Tree Database Template Classes

TalkBox
Gravatar Timo:

Alexey: the library works just like you expected it to: constant keys are mapped to variable length data blobs.

Gravatar Alexey:

Thanks, Timo! :-) This thing is ugly fast on reading! But will this thing break if I will pump it with variable-sized blobs ( having constant-size keys, however )? Thanks in advance!

Gravatar Timo:

Yes, the node sizes are configureable from the template parameter PageSize and all nodes are concatenated on disk.

Gravatar Artem:

Are the nodes page-aligned?

Gravatar Timo:

The two b-tree libraries are only related in ideas and underlying data structure. They do not share source code and are independent from each other.

Gravatar thanhnt:

is CBTreeDB related to stx::btree?




Posted on 2010-04-14 by Timo Bingmann at Permlink.

Summary

The stx::CBTreeDB is a collection of C++ classes with which read-only key-value database files can be created and read. A database efficiently maps a large number of integral fixed-length keys to opaque binary value blobs. Variable-length or duplicate keys are currently not supported. Keys are organized into a highly compact index structure, which is very similar to a B-tree and allows very fast key lookups. Both keys and values are stored in order and thus queries in a local proximity can benefit from caching effects. All applications mapping a large number of constant, integral keys to string or data blobs can benefit from this library.

Key features of the database classes are:

The complete B-tree source code is contained in the header file in doxygen stx-cbtreedb.h or with plain text comments stx-cbtreedb.h.

The main header code is covered to 91.3% by test cases. A graphical display of the test suite's coverage can be viewed online.

See the README file below for a more detailed overview.

Downloads

STX Constant B-Tree Database Template Classes Version 0.7.0 (current) released 2010-04-14
Source code archive:
(includes Doxygen HTML)
Download stx-cbtreedb-0.7.0.tar.bz2 (480kb)
MD5: 0832044ee1abcfa001e0ff2d47d068c0
Browse online
 
Extensive Documentation: Browse documentation online

License

The B-tree source code is released under the GNU Lesser General Public License v2.1 (LGPL) which can be found in the file COPYING. Other parts of the source code were copied from the Botan library and are under a BSD-license.

Git Repository

The git repository containing all sources and packages is fetchable by running git clone https://github.com/bingmann/stx-cbtreedb.git
Some further drawings and history may also available there.

README

Original Application

The original purpose of this library is to organize lookups of 32-bit integer identifiers to constant data blobs. In my application a few million non-sequential identifier keys are mapping to data blobs, which are unalterable short strings or small (1-10 kb) data files. Most key lookups occur in close proximity, usually in ascending order.

The resulting data volume is around 20 GB in size and was previously stored using the BerkeleyDB library. However, since the key and values in my application are read-only, the overhead both in file size and processing time introduced by BerkeleyDB became unacceptable. Using stx::CBTreeDB access to the value records is faster than before, and file sizes were reduced to a minimum due to the compact sequential storage in the read-only databases.

An alternative to the B-tree database classes are the well-known cdb or tinycdb libraries. However, these are basically hash tables and thus do not preserve key locality. Thus retrieval of 10 keys in ascending order requires 10 disk accesses at pseudo-random places in the database. With the B-tree library the disk areas read are stored in ascending order, just like the keys.

Design Principles and Database Architecture

Most design principles follow directly from the intended application.

Maybe the most important design aspect was to store the data blobs without separation in the most compact way possible. Thus the read-only database file simply stores all data blobs in sequence, without putting them onto data pages or similar overhead typical of a read-write database.

To accelerate key lookups an index structure very similar to a B-tree is prepended to the data area. This "packed, sequential" B-tree is constructed by the Writer classes and only differs from the original B-tree in one aspect. The nodes of each level containing the highest keys may be less than half full. The basic idea of the "packed, sequential" B-tree is to use only full nodes and creating inner nodes and levels as needed. All nodes except the one with the highest key are full! Thus the number of nodes used by the search tree is minimal and all lookup properties of the B-tree are retained. A drawing of the B-tree structure can be seen below:

Structure of the packed, sequential B-tree

SVG viewer required

All B-tree nodes are stored "in order" after the database file's header. The root node is always first, followed by all nodes of the next level, which in turn are followed by the next level until the leaf level is reached.

See the corresponding structures for the data fields in these nodes: stx::CBTreeDB::InnerNode and stx::CBTreeDB::LeafNode. These structures use some less obvious optimizations to further reduce overhead:

The in order storage of all B-tree nodes makes saving of offsets for each child node of an inner node unnecessary: the n+1 children nodes are stored consecutively starting after the first node. Thus the page offset of a child node can be calculated from only the first node's offset and the child number. This optimization removes about half of ordinarily needed fields in an inner node (stx::CBTreeDB::InnerNode) and allows a higher fan-out.

Each leaf node (stx::CBTreeDB::LeafNode) contains both an array of keys and corresponding data offsets. For size optimization all data offsets are 32-bit values relative to a base 64-bit starting offset. This reduces the size of the offset array and allows more keys to fit into a leaf node. This also imposes a restriction on data size: the sum of all data handled by one leaf must not exceed 2^32. This constraint is currently unchecked by the library, but should not occur in practice. Data sizes are calculated from subsequent data offsets. Furthermore, each leaf node contains one more offset number than strictly necessary: the offset of the data item following the highest key in the leaf. This offset is used to calculate the data size for the highest key without requiring retrieval of the following leaf.

Implementation Overview

All classes of the cbtreedb library are enclosed in the top-level template class stx::CBTreeDB. This class is parametrized by two types and two integer values. See the class documentation for more information on the template parameters.

The library contains two writer classes, which are used to create read-only databases from a set of key-value pairs. The difference between the two classes is the order in which key-value pairs are delivered to the class. The stx::CBTreeDB::Writer allows keys to be added in random order to the internal std::map. When finished the B-tree is constructed and the whole set is written to the database in one function call.

The obvious problem with Writer is that with large databases all key-value data must be stored in memory until written. This prohibits use of this writer class for very large databases, which are a main goal of the library. For purpose of writing large databases the sequential writer class stx::CBTreeDB::WriterSequential can be used. With this writer the key sequence must be delivered in ascending order and the database is constructed in two phases: in the first phase key and value-lengths are declared and the B-tree is constructed, and in the second phase the key-value pairs are delivered and written directly to the file with no extra buffering.

Note that both writer classes create _identical_ databases for equal input sets.

There is only one stx::CBTreeDB::Reader class. The Reader class itself is a pointer implementation to a referenced counted implementation object (stx::CBTreeDB::ReaderImpl), so you can easily copy Reader objects. Note however, that non of the classes are reentrent or thread-safe. For multi-threaded applications access must be guarded by mutexes, patches welcome.

Each Reader object may optionally have an associated stx::CBTreeDB::PageCache object, which is then used to cache hot B-tree pages like the root. The PageCache contains the most recently used pages (LRU replacement strategy) up to the maximum number specified. It features a structure allowing O(1) Store() and Retrieve() functions. A PageCache object can be shared between different database Readers. For more information see the corresponding documentation page.

A Reader object can load exactly one database using the Open() function. Note that the class template parameters must match those used to write the database. Some parameters are checked by the Open() function and appropriate error messages are returned.

Once opened the database can be queried using Exists(), Lookup(), GetIndex() and the operator[] functions as if it were a map.

Example Usage

Please see the doxygen HTML documentation for example usage of the Reader and Writer classes.