October 24, 2022

Hilik Yochai

7 Minutes read

Understanding RocksDB Leveled Compaction

RocksDB is highly versatile for applications that need low-latency databases. Part of what we need to understand when tuning for efficiency and performance is the way RocksDB stores and organizes data.

Here at Speedb’s performance lab, we’ve written on the three key amplification factors that affect performance. Each has both an efficiency and performance component. RocksDB is known for having excellent space efficiency thanks to leveraging LSM data structure plus flexible and efficient compaction capabilities for data written to the filesystem.

Let’s start with a view of how leveled compaction works in RocksDB to see the advantages and potential challenges you can run into.

‍

A Primer on RocksDB Leveled Compaction

The way RocksDB data structure and compaction capabilities work is part of the secret sauce that makes it a high-performance and versatile platform. Data begins by being written in active memory (memtable) until filled and then becomes immutable until then it is finally flushed to the filesystem. Data integrity is maintained throughout by writing to a journal so data can be recovered in the case of disruption.

The first level is L0 and then Ln notation is used for the structure which will be L1, L2, L3, up to Ln where n is the upper bound. It’s also referred to as Lmax.

Each non-0 level has a configurable target size. Compaction processes are triggered as capacity is reached which reorganizes the data and moves it down to lower levels. The data is made up of multiple SST (Static Sorted Table) files and target size usually increases exponentially at each level as you can see by our example here:

‍

‍

There can be multiple versions of a key at L0 because the SST files at L0 are stored in the order they are generated. Because there may be multiple versions of any key in L0 there are some interesting challenges. As the number of files in L0 increases it will impact query performance even though they are performed on a faster memory tier.

This changes as data gets written to L1 and the key range of each file is re-organized so there is no overlap between files. Only one version of each key will be written down to L1 and keys are updated and written over time to subsequent levels down to Lmax.

‍

The compaction background process will be triggered when the number of files in L0 reaches the threshold in the level0_file_num_compaction_trigger setting. This moves one or more SST files from L0 by compacting and writing the data to L1.

‍

‍

The goal of compaction is to always keep the size of each level under the target size. As data is flushed from L0 to L1 and exceeds the size of L1, data is then compacted and moved from L1 to be available space in L2, and so on down to Lmax.

‍

‍

Intra-L0 Compaction

There is a capability within RocksDB for intra-L0 compaction. By compacting and reorganizing into larger files within L0 you can reduce the need to flush data to the filesystem.

This can increase read/query performance though it comes with its own set of tradeoffs and will depend on the frequency of writes, and the type and size of data.

‍

Periodic Compaction in RocksDB

You can also set a periodic compaction to trigger by period of time rather than just when L0 is flushed down to L1 using the options.periodic_compaction_seconds setting. The default setting is UINT64_MAX – 1 which lets RocksDB control the period. RocksDB defaults to 30 days and can be configured to not run at all by configuring the setting to 0.

It’s unlikely that you would want to halt periodic compaction but knowing the precise setting requires investigation based on your application usage pattern and many hardware factors.

‍

Talk to one of our engineers

‍

Goal of 90% of Data in Lmax

The default RocksDB compaction and level structure ensures that 90% of total data is held in the Lmax level and will greatly benefit space efficiency. The cascading compaction and SST file reorganization has obvious benefits, but there are resource costs and tradeoffs depending on how you configure compaction.

There can be surprising results during actual operations that can have unexpected impacts on the application. The efficiency may also depend on your data itself such as the way you handle deletes and aging of keys which stay around a long time which can affect how they are compacted.

‍

Multiple Compaction Algorithms?

RocksDB was built to take the best of LevelDB and optimize for more diverse workloads. This meant extending to support multiple compaction algorithms that will more closely match the unique requirements of the application.

We’ve now taken a look at leveled compaction but there are multiple compaction algorithms that can be used including classic leveled, leveled-N, tiered, tiered + leveled, and FIFO. Among these algorithm choices, there are many configurable options that will affect compaction behavior, performance, and efficiency.

Other custom options include using the Speedb storage engine, other third-party vendors, and custom compaction options.

‍

What Can Go Wrong with Compaction?

A few common things need to be watched for that can be issues with the compaction process. This is despite (and sometimes caused) by both the choice in tuning your compaction, and general system and application behaviors.

‍

WriteStall Issues

RocksDB may encounter WriteStall issues for a number of reasons including:

Too many large memtables – can trigger OOM errors and cause WriteStalls as the memtables are being flushed
Too many pending compaction bytes – many levels require compaction which can’t be completed as it exceeds the soft-pending-compaction-bytes-limit or hard-pending-compaction-bytes-limit
Too many L0 SST files – as mentioned above, each SST in L0 must be queried in turn because of the potential for multiple instances of the same key. WriteStall can occur to block writing while this happens when there are too many files in L0.

‍

These are just three specific reasons that can trigger a WriteStall condition and more exist that can come from a wide variety of configuration setting choices that impact local resources.

‍

CPU Impact

Running the process threads for compaction has a direct impact on CPU queuing and percentage CPU utilization. As compaction is triggered automatically or manually the CPU will spike and increase while compaction runs.

It’s generally a nominal increase in CPU but can spike as updates occur. We have found that the performance and efficiency also change.

‍

Your Application Workload Has Changed

There are many cases where application usage patterns change such as an increase in writes. The optimal compaction configuration depends on the choices you’ve made and each parameter has direct and indirect performance and efficiency impact on the application and overall system.

‍

Incorrect Choice of Compaction Priority

You have four options in RocksDB to choose which files will be compacted in each compaction process.

kByCompensatedSize - prioritize files with the most tombstones first
kOldestLargestSeqFirst – used for workloads that update some hot keys in small ranges
kOldestSmallestSeqFirst - for uniform updates across the entire key space
kMinOverlappingRatio - looks at the ratio between overlapping size in next level and its size

‍

These also have deeper effects as general system performance changes (e.g. storage, memory increase, etc.) and your application consumption as well.

‍

Results (sometimes unexpected) May Vary

You don’t have to look too far to see how challenging setting the right configurations and the sometimes unexpected impacts. This is one of the many challenges with the amount of configuration options. Optimizing is incredibly complex with the individuality of the application plus many factors that will cause application-specific and environment-specific problems which vary as the application and data evolve over time.

There is a lot of work being done in the community to benchmark and validate configuration changes, but the end results can vary greatly by application. Mark Callahan shared a deep dive into benchmarking RocksDB trying to improve MyRocks versus InnoDB.

‍

Conclusion

Making changes to compaction configuration can profoundly affect your application and the system performance and efficiency. RocksDB is versatile and has been used at scale in many environments but requires careful consideration when tuning which is uniquely application-dependent.

‍

Solve RocksDB Compaction Woes with Speedb

Speedb Enterprise offers a 100% compatible drop-in replacement library for RocksDB, which introduces ‘multi-dimensional compaction’. This updated compaction data structure allows for dynamic fluctuation between Universal and Leveled compaction, as well as the ability to compact small parts of the levels to avoid forced full compaction runs. This new data structure is dynamic and allows for adjusting in real-time for the particular workload, and comes completely pre-tuned to avoid all the hassle of complex configurations.

‍

Speedb’s updated compaction engine allows RocksDB to hit maximum performance without IO hangs and stalls related to traditional compaction options.

‍

In addition, Speedb Enterprise allows scaling RocksDB capacity 20x to 1TB per node before performance is affected. This allows for scale-up capability per node, reducing servers, computing resources and avoids the need for sharding to maintain performance.

‍

Check out Speedb OSS at our GitHub repo, or the Enterprise version on our website to learn more.

‍

Or contact us directly to discuss your RocksDB challenges, we can help.

‍

Understanding RocksDB Leveled Compaction

RocksDB is highly versatile for applications that need low-latency databases. Part of what we need to understand when tuning for efficiency and performance is the way RocksDB stores and organizes data.

A Primer on RocksDB Leveled Compaction

Intra-L0 Compaction

‍

Periodic Compaction in RocksDB

Goal of 90% of Data in Lmax

Multiple Compaction Algorithms?

What Can Go Wrong with Compaction?

WriteStall Issues

CPU Impact

Your Application Workload Has Changed

Incorrect Choice of Compaction Priority

Results (sometimes unexpected) May Vary

Conclusion

Solve RocksDB Compaction Woes with Speedb

Related content:

Speedb in 2023: A Year of Innovation and Advancements

How Snapshot Optimization Takes Transactional Databases to the Next Level

Dirty Data Manager

Under the hood of Kafka Streams - Tune your storage engine to boost the application performance

Speedb Cloud: The Game Changer for Efficient Cloud-Based Data Management

Key-Value Store vs Storage Engine

Speedb Launches Enterprise RocksDB Technical Support Program

Boosting Your Application with Speedb

Speedb v2.6 is Out

Speedb Public Roadmap

Speedb Basics: What is Speedb?

RocksDB - no more restarts: live configuration changes with Speedb

RocksDB memtables flushing - improving memory management

Modern Storage Engine Magic

Speedb may seem similar to RocksDB, but it’s a whole other animal

Kafka streams - Scaling performance using Speedb

Whether Monolithic or Microservices-based, Data Access is the Weak Link in Your Architecture

Sharding — No Longer a Necessary Evil

Speedb: The Storage Engine of the Future

Understanding RocksDB Leveled Compaction

What Factors Affect Performance in RocksDB?

How does RocksDB Memory Management work?

Performance stability using improved delayed write mechanism

Hang in There, Help is On the Way

A word from our CEO (part 1): Why we took the task of designing the next generation storage engine?

LSM vs B-Tree

Speedb

Speedb