To a large extent, customer experience is no longer a differentiator for online businesses but rather a basic expectation. Any company that does not provide a nearly flawless experience will feel the impact of customer dissatisfaction on the bottom line. Customers today are intolerant of issues, especially application stalls and long response times. With so many alternatives, users simply switch to someone else’s app and never look back.

Customer experience management is a frustrating task with the number of unexpected problems that occur every second causing stalls, performance degradation, broken applications, and much more. Companies must make huge investments in improving the customer experience across the stack to remain competitive. However, as computing environments become increasingly more complex and interwoven, it is more likely that an overlooked component will break down and eventually impact the performance of the entire system.

Practically any component in a system can potentially become a bottleneck, from the storage and network layers through the CPU to the application GUI. In most cases, when the root cause is at the upper levels of the stack, the entire system might not be affected dramatically, and the problem can be fixed relatively easily. But when the root cause is buried deep in the system, finding it might not be that simple. At the same time, the deeper the root cause, the greater the impact on the system.

For example, performance hits are commonly related to I/O bottlenecks in the storage engine, AKA data engine, which is the deepest and “lowest” part of the software stack that sorts and indexes data. As such, it can be seen as the weakest link in the system as any I/O hang originated in this layer may trickle up the stack and cause huge delays. This is the stage where errors start to show and users begin to abandon.

The reason why data engine I/O bottlenecks are becoming increasingly common is the ever-growing volumes of data handled by modern systems. One of the main drivers for the continuous data explosion is the growth of unstructured data in the form of objects arriving from an increasing number and variety of sources, e.g., documents, audio and video files, IoT and sensor data. In particular, the growth of metadata associated with these objects is becoming a major issue as an increasingly large number of objects that may only be a few bytes large may now be holding a metadata of about the same size, and sometimes even more.

As the volume of metadata continues to grow, the shortcomings of existing data engines become apparent. Currently, available data engines are based on architectures that were not designed to support the scales of modern datasets. Most of them use Log-Structured Merge (LSM) tree-based key value stores (KVS) to keep metadata in memory. In an LSM tree-based KVS, key/value string pairs are arranged in sorted strings tables (SSTables). SSTables are append-only files, which means that they are never updated. Instead, when new or updated data comes in, additional SSTables are created. Then, multiple SSTables are typically merged and sorted into a single SSTable in a process called compaction, which allows for faster access and retrieval of data.

The problem with this method is that compaction processes typically involve significant I/O overhead. Hence, an LSM-tree based KVS may experience periodic I/O surges when moving large datasets, resulting in performance bottlenecks.

Accordingly, one of the main design objectives of the Speedb data engine was to eliminate I/O hangs. To accomplish that, we revamped the basic components of KVS. For example, we developed a new compaction method that dramatically reduces write amplification for large scale LSM, and a new flow control mechanism that eliminates spikes in user latency. The result is a data engine that can natively support write-intensive workloads, enabling our customers to achieve new levels of performance and consistency without compromising storage capacity and agility. By addressing this critical component, businesses can ensure that the entire system doesn’t grind to a halt every other week and protect their customer experience. 


Related content: