This is a chance to see how these two data structures perform in real-world applications and find out how to tune RocksDB for your specific application needs.
Fundamentals of B-Tree and LSM Tree Data Structures
There are two data structures used in key-value stores: B-Tree and LSM tree. If you’re using an RDBMS then you’ve most likely experienced B-tree in action. B-Tree can also be found in some key-value databases. LSM can be found in more modern relational and key-value databases. LSM is especially popular for scalable, write-intensive applications because of its speed and efficiency for writes and updates.
Each has its own advantages, disadvantages, and tuning options. Let’s start with the core of how each works.
B-tree data structure is so called because data is self-organized in the form of a tree with data sorted for efficient sequential access. B-Tree structure uses an organized multi-level tree structure to minimize the number of disk reads which lends itself to being a strong performer for read-intensive operations.
Nodes have a set capacity and as inserts occur data splits and is written to child nodes. B-tree handles inserts well but is not as well-suited to high-speed deletions and updates. The ideal use-case for B-tree has been where high performance is needed for mostly read operations.
An advantage of B-tree is that it is efficient with both random and sequential reads because of the shallow depth of the file structure. The height of a B-tree is kept low by the way that leaf nodes are created as keys are inserted. The organization of the keys is designed for quick retrieval by not having to traverse many nodes as key searches occur.
B-tree can experience write bottlenecks when data is inserted into the tree sequentially, plus B-tree also struggles with frequent updates or deletions which causes fragmentation in the data.
LSM tree is a data structure that is at the core of many more modern data and storage platforms that need write-intensive throughput. With LSM, writes are done very quickly in-memory and a transaction log (write-ahead log - WAL) is written to protect the data as it waits to be flushed from memory to persistent storage.
Speed and efficiency are increased because LSM uses an append-only write process that allows rapid sequential writes with no fragmentation challenges like B-trees are subject to. Inserts and updates can be made much faster and then the leveled tree structure is organized (and re-organized continuously) with a background compaction process to optimize key structure in the persistent Static Sort Table (SST) files.
The main disadvantage you will hear about LSM is that read performance can be poor if data is accessed in small, random chunks. Queries/reads are not done in the recursive way as a B-tree query which lends itself to good read performance. There are ways to mitigate this with the use of indexes, bloom filters, and other tuning for file sizes, block sizes, memory usage, and other tunable options.
The Pros and Cons of B-Tree
B-trees are well suited for workloads where reads and writes are evenly balanced, and data is accessed in small, random chunks. This makes them a good choice for general purpose databases.
Pros of B-Tree:
- Excellent performance with read-intensive workloads
- Mature structure (developed in the 1970s) used by many older storage engines
Cons of B-Tree:
- Increased space overhead to deal with fragmentation
- Uses random writes which causes slower create/insert behavior
- Concurrent writes may require locks which slows write performance
- Scaling challenges, especially with >50% write transactions
Pros and Cons of LSM Tree
LSM trees are especially well-suited for workloads where writes are more common than reads, or when data is accessed in large sequential chunks. LSM uses an append-only write structure which makes it super-efficient for writes. Compression and compaction help to keep data organized and reduce overall storage utilization.
Pros of LSM:
- More space efficient by using append-only writes and periodic compaction
- Better performance with fast-growing data and sequential writes
- No fragmentation penalty because of how SSTable files are written and updated
Cons of LSM:
- CPU overhead for compaction can meaningfully affect performance and efficiency if not tuned appropriately
- More tuning options increase flexibility but can seem complex to developers and operators
- Read/search performance can be optimized with the use of bloom filters
Workload patterns should define which data structure you should use, right? If the general capabilities of key-value stores using either data structure are similar it is a matter of choosing the right tradeoffs. What if we could optimize the more flexible data structure and focus on the greater of the features?
What Makes LSM a Better Choice?
B-Tree seems like a great fit for read-intensive workloads while LSM is an obvious choice for write-intensive applications. For overall flexibility, LSM has the most capabilities and tunable options that make it a great base to innovate on. There are still many tweaks needed as things scale but the compaction features of LSM make it a highly versatile data structure option.
Let’s look at a couple of use-case examples:.
LSM for Write-Intensive Workloads
LSM data structure is designed by default to support more write-intensive workloads. Running a default or incorrectly tuned RocksDB configuration will leave you thinking that LSM is not all it’s cracked up to be.
There are hundreds of combinations for tuning options to target performance and resiliency. These all come with potential tradeoffs. Many of the common configuration options focus on one workload profile which likely sacrifices specific optimization for a generally optimized deployment.
The default compaction configuration functions reasonably but requires tuning to optimize for different workloads and resource usage patterns. RocksDB can support different compaction algorithms which can help tailor more to your application needs but also requires that you deeply understand the performance and usage patterns to be able to tune for specific workloads.
LSM is designed for better write performance which can greatly improve application usability. There are also enough options to help tune RocksDB for both performance and hardware durability.
Lots of dials and knobs help adapt to the diversity of workloads but require a deep understanding of both the application and the many options to optimize.
LSM for Read-Intensive Workloads
LSM may be geared towards write performance but indexes and bloom filters help to manage read/query performance. There is also the option to shard the database which can help optimize as the data size increases.
The challenge that we usually see in RocksDB, or any key-value store implementation, is that one bottleneck is released which then moves the issue to the next bottleneck downstream. This requires careful design and planning for storage architecture, file and memory configuration, and data layout for optimal read performance.
EXAMPLE: Inefficient reads are happening, so you increase the amount of memory for indexes and bloom filters. You are decreasing the amount of false positives during key searches but the memory usage is rising above where you’ve designed the system for.
Read performance will be affected by write operations. Write operations will be affected by compaction. Read query performance will be affected by both raw memory and flash performance. All of these factors are also application-specific and will change as usage patterns and data patterns adjust over time.
This shows the use case for each application is unique and requires both upfront measurement of what needs optimization and revisiting performance tuning and resource utilization regularly.
You have many considerations when choosing your key-value store database platform beginning with the platform itself and then the data structure which matches your application needs.
Making the choice between B-trees and LSM can be a challenge that many developers and operators face at both the design and implementation phases. Choosing your data structure and data engine will include knowing your requirements (e.g. data size, update/delete frequency, read/write ratios) and often change from workload to workload and also depend on infrastructure choices (e.g. cloud, container, hypervisor).
For the Speedb data engine, we chose to customize the RocksDB LSM tree structure in a way that significantly reduces write amplification as much as from 30X down 5X. We’ve also enhanced compaction and tuned the data engine to minimize space usage and lower CPU utilization. LSM had the built-in advantage for write performance and we continue to work on getting the most out of this very versatile data structure and the highly tunable RocksDB platform.