Here at Speedb’s performance lab, we’ve written on the three key amplification factors that affect performance. Each has both an efficiency and performance component. RocksDB is known for having excellent space efficiency thanks to leveraging LSM data structure plus flexible and efficient compaction capabilities for data written to the filesystem.
Let’s start with a view of how leveled compaction works in RocksDB to see the advantages and potential challenges you can run into.
A Primer on RocksDB Leveled Compaction
The way RocksDB data structure and compaction capabilities work is part of the secret sauce that makes it a high-performance and versatile platform. Data begins by being written in active memory (memtable) until filled and then becomes immutable until then it is finally flushed to the filesystem. Data integrity is maintained throughout by writing to a journal so data can be recovered in the case of disruption.
The first level is L0 and then Ln notation is used for the structure which will be L1, L2, L3, up to Ln where n is the upper bound. It’s also referred to as Lmax.
Each non-0 level has a configurable target size. Compaction processes are triggered as capacity is reached which reorganizes the data and moves it down to lower levels. The data is made up of multiple SST (Static Sorted Table) files and target size usually increases exponentially at each level as you can see by our example here:
There can be multiple versions of a key at L0 because the SST files at L0 are stored in the order they are generated. Because there may be multiple versions of any key in L0 there are some interesting challenges. As the number of files in L0 increases it will impact query performance even though they are performed on a faster memory tier.
This changes as data gets written to L1 and the key range of each file is re-organized so there is no overlap between files. Only one version of each key will be written down to L1 and keys are updated and written over time to subsequent levels down to Lmax.
The compaction background process will be triggered when the number of files in L0 reaches the threshold in the level0_file_num_compaction_trigger setting. This moves one or more SST files from L0 by compacting and writing the data to L1.
The goal of compaction is to always keep the size of each level under the target size. As data is flushed from L0 to L1 and exceeds the size of L1, data is then compacted and moved from L1 to be available space in L2, and so on down to Lmax.
There is a capability within RocksDB for intra-L0 compaction. By compacting and reorganizing into larger files within L0 you can reduce the need to flush data to the filesystem.
This can increase read/query performance though it comes with its own set of tradeoffs and will depend on the frequency of writes, and the type and size of data.
Periodic Compaction in RocksDB
You can also set a periodic compaction to trigger by period of time rather than just when L0 is flushed down to L1 using the options.periodic_compaction_seconds setting. The default setting is UINT64_MAX – 1 which lets RocksDB control the period. RocksDB defaults to 30 days and can be configured to not run at all by configuring the setting to 0.
It’s unlikely that you would want to halt periodic compaction but knowing the precise setting requires investigation based on your application usage pattern and many hardware factors.
Goal of 90% of Data in Lmax
The default RocksDB compaction and level structure ensures that 90% of total data is held in the Lmax level and will greatly benefit space efficiency. The cascading compaction and SST file reorganization has obvious benefits, but there are resource costs and tradeoffs depending on how you configure compaction.
There can be surprising results during actual operations that can have unexpected impacts on the application. The efficiency may also depend on your data itself such as the way you handle deletes and aging of keys which stay around a long time which can affect how they are compacted.
Multiple Compaction Algorithms?
RocksDB was built to take the best of LevelDB and optimize for more diverse workloads. This meant extending to support multiple compaction algorithms that will more closely match the unique requirements of the application.
We’ve now taken a look at leveled compaction but there are multiple compaction algorithms that can be used including classic leveled, leveled-N, tiered, tiered + leveled, and FIFO. Among these algorithm choices, there are many configurable options that will affect compaction behavior, performance, and efficiency.
Other custom options include using the Speedb storage engine, other third-party vendors, and custom compaction options.
What Can Go Wrong with Compaction?
A few common things need to be watched for that can be issues with the compaction process. This is despite (and sometimes caused) by both the choice in tuning your compaction, and general system and application behaviors.
RocksDB may encounter WriteStall issues for a number of reasons including:
- Too many large memtables – can trigger OOM errors and cause WriteStalls as the memtables are being flushed
- Too many pending compaction bytes – many levels require compaction which can’t be completed as it exceeds the soft-pending-compaction-bytes-limit or hard-pending-compaction-bytes-limit
- Too many L0 SST files – as mentioned above, each SST in L0 must be queried in turn because of the potential for multiple instances of the same key. WriteStall can occur to block writing while this happens when there are too many files in L0.
These are just three specific reasons that can trigger a WriteStall condition and more exist that can come from a wide variety of configuration setting choices that impact local resources.
Running the process threads for compaction has a direct impact on CPU queuing and percentage CPU utilization. As compaction is triggered automatically or manually the CPU will spike and increase while compaction runs.
It’s generally a nominal increase in CPU but can spike as updates occur. We have found that the performance and efficiency also change.
Your Application Workload Has Changed
There are many cases where application usage patterns change such as an increase in writes. The optimal compaction configuration depends on the choices you’ve made and each parameter has direct and indirect performance and efficiency impact on the application and overall system.
Incorrect Choice of Compaction Priority
You have four options in RocksDB to choose which files will be compacted in each compaction process.
- kByCompensatedSize - prioritize files with the most tombstones first
- kOldestLargestSeqFirst – used for workloads that update some hot keys in small ranges
- kOldestSmallestSeqFirst - for uniform updates across the entire key space
- kMinOverlappingRatio - looks at the ratio between overlapping size in next level and its size
These also have deeper effects as general system performance changes (e.g. storage, memory increase, etc.) and your application consumption as well.
Results (sometimes unexpected) May Vary
You don’t have to look too far to see how challenging setting the right configurations and the sometimes unexpected impacts. This is one of the many challenges with the amount of configuration options. Optimizing is incredibly complex with the individuality of the application plus many factors that will cause application-specific and environment-specific problems which vary as the application and data evolve over time.
There is a lot of work being done in the community to benchmark and validate configuration changes, but the end results can vary greatly by application. Mark Callahan shared a deep dive into benchmarking RocksDB trying to improve MyRocks versus InnoDB.
Making changes to compaction configuration can profoundly affect your application and the system performance and efficiency. RocksDB is versatile and has been used at scale in many environments but requires careful consideration when tuning which is uniquely application-dependent.
Solve RocksDB Compaction Woes with Speedb
Speedb Enterprise offers a 100% compatible drop-in replacement library for RocksDB, which introduces ‘multi-dimensional compaction’. This updated compaction data structure allows for dynamic fluctuation between Universal and Leveled compaction, as well as the ability to compact small parts of the levels to avoid forced full compaction runs. This new data structure is dynamic and allows for adjusting in real-time for the particular workload, and comes completely pre-tuned to avoid all the hassle of complex configurations.
Speedb’s updated compaction engine allows RocksDB to hit maximum performance without IO hangs and stalls related to traditional compaction options.
In addition, Speedb Enterprise allows scaling RocksDB capacity 20x to 1TB per node before performance is affected. This allows for scale-up capability per node, reducing servers, computing resources and avoids the need for sharding to maintain performance.
Check out Speedb OSS at our GitHub repo, or the Enterprise version on our website to learn more.
Or contact us directly to discuss your RocksDB challenges, we can help.