Understanding Leveled Compaction

24.10.2022

RocksDB is highly versatile for applications that need low-latency databases. Part of what we need to understand when tuning for efficiency and performance is the way RocksDB stores and organizes data.

Level Compaction
RocksDB
Speedb
Subscribe to get more content like this

RocksDB is highly versatile for applications that need low-latency databases. Part of what we need to understand when tuning for efficiency and performance is the way RocksDB stores and organizes data. 

We’ve written on the three key amplification factors that affect performance. Each has both an efficiency and performance component. RocksDB is known for having excellent space efficiency thanks to leveraging LSM data structure plus flexible and efficient compaction capabilities for data written to the filesystem. 

Let’s start with a view of how leveled compaction works in RocksDB to see the advantages and potential challenges you can run into.

A Primer on RocksDB Leveled Compaction

The way RocksDB data structure and compaction capabilities work is part of the secret sauce that makes it a high-performance and versatile platform. Data begins by being written in active memory (memtable) until filled and then becomes immutable until then it is finally flushed to the filesystem. Data integrity is maintained throughout by writing to a journal so data can be recovered in the case if disruption. 

The first level is L0 (read-only memtable) and then Ln notation is used for the structure which will be L1, L2, L3, up to Ln where n is the upper bound. It’s also referred to as Lmax.  

Each non-0 level has a configurable target size. Compaction processes are triggered as capacity is reached which reorganizes the data and moves it down to lower levels. The data is made up of multiple SST (Static Sorted Table) files and target size usually increases exponentially at each level as you can see by our example here:

There can be multiple versions of a key at L0 because the SST files at L0 are stored in the order they are generated. Because there may be multiple versions of any key in L0 there are some interesting challenges. As the number of files in L0 increases it will impact query performance even though they are performed on a faster memory tier. 

This changes as data gets written to L1 and the key range of each file is re-organized so there is no overlap between files. Only one version of each key will be written down to L1 and keys are updated and written over time to subsequent levels down to Lmax. 

Level 0

The compaction background process will be triggered when the number of files in L0 reaches the threshold in the level0_file_num_compaction_trigger setting. This moves one or more SST files from L0 by compacting and writing the data to L1. 

L1's target size breached

The goal of compaction is to always keep the size of each level under the target size. As data is flushed from L0 to L1 and exceeds the size of L1, data is then compacted and moved from L1 to be available space in L2, and so on down to Lmax.

Level compaction

Intra-L0 Compaction

There is a capability within RocksDB for intra-L0 compaction. By compacting and reorganizing into larger files within L0 you can reduce the need to flush data to the filesystem.

This can increase read/query performance though it comes with its own set of tradeoffs and will depend on the frequency of writes, and the type and size of data. 

Periodic Compaction in RocksDB

You can also set a periodic compaction to trigger by period of time rather than just when L0 is flushed down to L1 using the options.periodic_compaction_seconds setting. The default setting is UINT64_MAX – 1 which lets RocksDB control the period. RocksDB defaults to 30 days and can be configured to not run at all by configuring the setting to 0. 

It’s unlikely that you would want to halt periodic compaction but knowing the precise setting requires investigation based on your application usage pattern and many hardware factors. 

Talk to one of our engineers

Goal of 90% of Data in Lmax

The default RocksDB compaction and level structure ensures that 90% of total data is held in the Lmax level and will greatly benefit space efficiency. The cascading compaction and SST file reorganization has obvious benefits, but there are resource costs and tradeoffs depending on how you configure compaction. 

There can be surprising results during actual operations that can have unexpected impacts on the application. The efficiency may also depend on your data itself such as the way you handle deletes and aging of keys which stay around a long time which can affect how they are compacted. 

Multiple Compaction Algorithms?

RocksDB was built to take the best of  LevelDB and optimize for more diverse workloads. This meant extending to support multiple compaction algorithms that will more closely match unique requirements of the application. 

We’ve now taken a look at  leveled compaction but there are multiple compaction algorithms that can be used including classic leveled, leveled-N, tiered, tiered + leveled, and FIFO. Among these algorithm choices there are many configurable options which will affect compaction behavior, performance, and efficiency. 

Other custom options include using the Speedb engine, other third-party vendors, and custom compaction options. 

What Can Go Wrong with Compaction?

A few common things need to be watched for that can be issues with the compaction process. This is despite (and sometimes caused) by both the choice in tuning your compaction, and general system and application behaviors. 

WriteStall Issues

RocksDB may encounter WriteStall issues for a number of reasons including:

  • Too many large memtables – can trigger OOM errors and cause WriteStalls as the memtables are being flushed 
  • Too many pending compaction bytes – many levels require compaction which can’t be completed as it exceeds the soft-pending-compaction-bytes-limit or hard-pending-compaction-bytes-limit 
  • Too many L0 SST files – as mentioned above, each SST in L0 must be queried in turn because of thpotential for multiple instances of the same key. WriteStall can occur to block writing while this happens when there are too many files in L0.

These are just three specific reasons that can trigger a WriteStall condition and more exist that can come from a wide variety of configuration setting choices that impact local resources.

CPU Impact

Running the process threads for compaction has a direct impact on CPU queuing and percentage CPU utilization. As compaction is triggered automatically or manually the CPU will spike and increase while compaction runs. 

It’s generally a nominal increase in CPU but can spike as updates occur.  We have found that the performance and efficiency also change.

Your Application Workload Has Changed

There are many cases where application usage patterns change such as an increase in writes. The optimal compaction configuration depends on the choices you’ve made and each parameter has direct and indirect performance and efficiency impact on the application and overall system. 

Incorrect Choice of Compaction Priority

You have four options in RocksDB to choose which files will be compacted in each compaction process.

  • kByCompensatedSize - prioritize files with the most tombstones first
  • kOldestLargestSeqFirst – used for workloads that update some hot keys in small ranges 
  • kOldestSmallestSeqFirst - for uniform updates across the entire key space
  • kMinOverlappingRatio - looks at the ratio between overlapping size in next level and its size

These also have deeper effects as general system performance changes (e.g. storage, memory increase, etc.) and your application consumption as well.

Results (sometimes unexpected) May Vary

You don’t have to look too far to see how challenging setting the right configurations and the sometimes unexpected impacts. This is one of the many challenges with the amount of configuration options. Optimizing is incredibly complex with the individuality of the application plus many factors that will cause application-specific and environment-specific problems which vary as the application and data evolve over time. 

There is a lot of work being done in the community to benchmark and validate configuration changes, but the end results can vary greatly by application. Mark Callahan shared a deep dive into benchmarking RocksDB with to try to improve MyRocks versus InnoDB. 

Conclusion

Making changes to compaction configuration can have a profound effect on your application and the system performance and efficiency. RocksDB is versatile and has been used at scale in many environments but requires careful consideration when tuning which is uniquely application-dependent. 

Have a question? chat with one of our engineers.

Solve the Multi-Dimensional Performance and Efficiency Challenge with Speedb

Speedb offers a drop-in replacement embedded solution for RocksDB, tailored to your hyperscale data processing needs. Click here to learn about how Speedb can give you bespoke customization services to address use-case-specific requirements including adaptive auto-tuning of the system parameters to ensure high performance for any workload.

Subscribe to get more content like this