In the world of data engines, RocksDB has become widely adopted and very popular. In large part due to its well-known use inside Facebook and by many other companies. There are so many possible use-cases for such an embedded library, each indicating different application patterns. All of which require distinctly different configurations of RocksDB to handle the diverse performance needs of these applications.
Let’s unpack why this is such a contentious debate which has led to the work we have done for RocksDB performance optimization at Speedb.
Tuning RocksDB per Application Use-Case
Since RocksDB is designed to make the best use of the latest high-speed memory and flash technologies, you can achieve extremely high-performance levels when optimally tuned. Tuning will be individualized based on both the infrastructure and application architecture, and the expected utilization pattern. Performance comes at a cost both figuratively (configuration choices) and literally (memory and flash storage are expensive).
The first determining factor will be the application usage pattern which will suggest those parameters to prioritize while optimizing each RocksDB instance. First, we must determine which type of performance profile you seek.
You have 3 choices, choose one, or if you must, two…
- Read-intensive - a high ratio of reads to writes - this could be an application like video streaming, social media, or retail which will have high rates of search and read activity.
- Write-intensive - a high ratio of writes to reads - applications that have many new inserts and updates to keys or writes logs which puts pressure on write transactions both in-memory, and when memory/cache is flushed to disk.
- Space-intensive - needs to insert data that must scale out with growth as the critical factor. This could be an application like Twitter or Uber which will have high rates of change and new key-value inserts that expands the database files and increases space usage on disk.
The reason we say one OR two is that you can have an application that is read-intensive (lots of reads into memory and reads from file system into memory/cache) but also has a large amount of data so it’s both read and space intensive.
RocksDB has tremendous flexibility for storing simple key-value pairs or complex, compound objects, as well as supporting both small and large data objects. Each application may have many services, each with their own RocksDB instances or sharing a single instance.
Flexibility also makes it more challenging to understand utilization patterns. Fortunately, RocksDB has many tunable parameters to support all these use-cases. While it’s great to have lots of knobs and dials, it’s also one of the dark sides of RocksDB when trying to find the optimal configuration.
3 Amplification Factors in a Key-Value Store
The three core performance factors in a key-value store are write amplification, read amplification, and space amplification. Each with significant implications on the application’s eventual performance, stability, and efficiency characteristics. It’s also a living challenge that constantly morphs and evolves as the application utilization, as well as the infrastructure and requirements, change over time.
It’s a constant set of trade-offs because your key-value store is likely tuned for 1 or 2 out of the 3 amplification factors. Read, write, and space amplification each impact your application performance individually, and collectively understanding these relationships is critical
Write amplification is determined as the total amount of resulting bytes written within a logical write operation. As the data is moved, copied, and sorted, within the internal levels, it is re-written again and again - or amplified. Write amplification varies based on source data size, number of levels, size of memtable, amount of overwrites and other factors which we will discuss more extensively in a different post.
A simplified example of write amplification in LSM using RocksDB:
- Data is written to memtable and an entry is written to the write-ahead log
- Memtable fills up and writes data to immutable/read-only memtable
- As memtable size exceeds target and triggers a flush to persistent files at Level 0 (L0)
- Compaction merges, sorts, and removes stale key entries in L0 and writes data to L1
- L1 exceeds target size triggering compaction and SST files are moved to L2…and the process can continue to max level (Lmax)
Just this simple example of how many writes can occur with a single row update. From our team’s experience across many environments, we have seen that the WAF of the storage engine is up to 30X (3000%) more than the theoretical lowest calculated cost.
There is also a further multiplication of the theoretical WAF beyond just LSM down at the flash tier. For example, flash storage must be erased before being rewritten which introduces another performance penalty. This adds to the amplification and increases the risk of I/O hangs and degraded application performance. Understanding actual end-to-end latency is often not apparent until in production with live workloads.
Adding more memory to the system may seem like the ideal performance boost but requires tuning to make the best use of it and is entirely data dependent based on how your application writes to the data engine. That also introduces the trade-off of performance and actual hardware costs of high-performance, expensive memory and storage tiers.
This is a factor defined by the number of disk reads that an application read request causes. If you have a 1K data query that is not found in rows stored in memtable, the read request goes to the files in persistent storage.
RocksDB uses bloom filters to reduce the amount of unnecessary disk reads which helps reduce read amplification. The type of query (e.g. range query versus point query) and size of the data request will also impact the read amplification and overall read performance.
The performance of reads will also vary over time as application usage patterns change.
This is the ratio of the amount of storage/memory space consumed by the data divided by the actual size of data. This will be affected by the type/size of data written and updated by the application, whether compression is used, compaction method, and frequency of compaction.
There are number of ways that affect space amplification including:
- Having a large amount of stale data that has not been garbage collected yet
- Level target size and compaction
- Number of inserts and updates
- Compaction algorithm and frequency
- Compression algorithm, and choice of how to implement
All of these will affect space amplification and tuning for each has implications and challenges. Space amplification matters in utilization and saturation of storage too, especially in higher-cost storage tiers like flash and NVRAM.
There are lots of additional tuning options that affect space amplification. You can customize the way compression and compaction behave, set the level depth and target size of each level, and tune when compaction occurs to help optimize data placement.
All 3 of these amplification factors are also affected by the workload/data type, the memory/storage infrastructure, and the pattern of utilization by the application itself.
The Multi-Dimensional Performance Challenge
RocksDB has hundreds of tunable parameters. These include setting write buffer size, active memory usage, when to flush to disk, when to compact files, and how to compact files just to name a few. We would love a write amplification factor of n where n is as low as possible. A commonly found WAF of 30 will drastically impact application performance compared to a more ideal WAF closer to 5.
We assume that a write to memory is reasonably low in performance cost and reads are also very low cost. But as soon as data needs to be moved from memory to storage, we have a direct increase to the write cost that’s affected by the hardware, OS caching, memory usage, and disk performance.
As your application is writing data with inserts into active memory, RocksDB is also flushing from active memory to disk, and a background task may be running to reorganize the file structure to free up space (compaction). Even though SSD and flash are low latency they require blocks to be erased and rewritten which incurs another performance hit and affects the lifetime durability of the flash hardware. All of this is affecting the end-to-end performance of your application.
Reads also face a penalty based on resource impact by the system and vary greatly. A read that calls from a key still in active memory is obviously very low in performance cost. The reality is that your query may need to go to file (or multiple files), bring the data up to cache, and is being impacted by other I/O processes happening at the same time.
The only way to really understand the impact of your RocksDB performance and the effect on your application is to rigorously measure the actual utilization in varied settings and differing loads. Many factors will affect read performance, write performance, and space utilization. The complexities and interdependencies portray a myriad of choices and compromises to be navigated.
Tuning your RocksDB instance is a constantly shifting combination of hardware performance, OS performance, data read/write patterns, and other active processes. It’s no simple task, unless perhaps, you’re Facebook. There are other ways to reduce WAF and overall performance by technology that goes beyond what just tuning can do.
The Speedb platform offers a simple drop-in replacement embedded solution for RocksDB specifically tuned and tailored to your application workload. Customers using Speedb have seen improvements such as reduction of write amplification from 30X to 5X. Speedb also makes many other significant improvements with or compaction algorithm to reduce space amplification, and also workload-specific tuning to optimize your RocksDB read performance.
Stay tuned as we continue further with upcoming blogs about the finer intricacies of RocksDB, how it works, and what parameters affect performance and cost for real-world applications.