Now that we’re officially an open source project, I’d like to make a confession in the spirit of the most important value of open source companies - transparency and openness:
Until about 2 years ago, I didn't even know what a storage engine was.
I remember the first time that Hilik, Speedb’s Chief Scientist and Co-Founder, came to me all excited, telling me that he has built what he thought could be the future of storage engines.
I was pretty embarrassed to tell him: “Cool, that’s really great, but what the hell is a storage engine?”
“Ah,” Hilik replied, “you surely know about LSM trees, right?”
I looked a bit dazzled and answered, “Yeah, I mean, I think so. It's that thing that’s like a B-tree but is actually not…” (Spoiler alert: I didn’t have a clue). Most embarrassing is the fact that, up to that point, I’d enjoyed a rather long career in the storage industry, and I thought I knew quite a lot :)
So my first motivation to learn about the storage engine world was to avoid feeling like an idiot in my next talk with Hilik. It’s a well-known fact that many applications essentially act as an interface that sits on top of a database of some kind, and that database is connected to a storage device.
If this is the case, I had one question: Where was the f***ing storage engine hiding? And how is it possible that almost every data application (including a database for that matter) uses a storage engine and I didn't even know about it?
Let's start at the beginning...
Our world today, unless you are in a world with no internet, revolves around data. Data is stored on various storage devices. It can be memory, disks, flash drives or any kind of device that can store information.
The applications that we use in our day-to-day life seem to know more and more about us. This is because they’re accessing a lot of information that comes from data that is stored somewhere.
This data can be organized and saved in many different ways like databases, file systems, and object stores that eventually store the data on the media variations mentioned above.
Regardless of where it’s stored, ensuring quick and easy access to data is one of the most important considerations when designing an application. This is not a trivial task at all. We can get data from endless sources at speeds that were never possible before (we’ll cover this in a different post).
One of the most popular ways to store data is to use a database of some kind. The database world has come a long way since the days when you had a central monolithic database that held all the data in a structured way. Today we have distributed and non-distributed databases, structured and unstructured databases, and additional ways of storing data.
When we store data in a database, it’s organized in a logical way that allows us to easily access it later on. But eventually, the data needs to reside on a physical storage system where it can remain, safely and securely, while the software layer above it performs its magic.
Right there, between the database and the storage, hides the “storage engine”. So, why is this hidden (usually quiet) creature suddenly becoming interesting, and why have we decided to build a company that will create the world's next storage engine?
As the old saying goes, a chain is measured by its weakest link. And we at Speedb think that in the data chain, from the application and down to the storage, every layer has been nicely developed and evolved throughout the years. However, the storage engine has been largely ignored, or more accurately, only recently began to receive the attention it deserves from giants like Google and Facebook, who actually needed to design their own data stack.
The story about storage engines is that they were originally created to store metadata - the critical “data about the data” that companies utilize for recommending movies to watch, products to buy, etc. This metadata also tells us when the data was created, where exactly it’s stored, and much more.
As always, giants like Google and Facebook are the first to encounter technical problems, simply because they work at a far greater scale than other companies. As a result, these giants are more quickly exposed to the weaker links in their stack than other smaller companies, which encounter those same weaknesses more gradually over time.
You see where I'm going with this: The use of storage engines started to grow in parallel to the growth of metadata, and as long as the metadata challenge was “owned” by Facebook and Google, so were the initial solutions to address it.
So why are storage engines now being used by almost every application that manages data? More importantly, why are the storage engines that Google and Facebook created not suitable for all the market, and why is the problem just getting worse?
Let's start with the fact that the way we consume data has changed. Today, we consume information from multiple small devices. Applications that used to reside on large servers in huge companies are now accessible to every one of us via our smartphones and other digital devices.
This plethora of information sources actually changes the way basic data is being captured, saved, processed and retrieved. And the most important part of this information is metadata. Without metadata, we would simply have tons of information without any way to access it effectively. That’s why it is paramount to have a data infrastructure that can manage metadata properly.
Consequently, what used to be the problem of tech giants like Facebook or Google that needed to get data from millions of users at the same time and offer value to their customers, is now everyone’s problem. Whether you’re a bank, an eCommerce website, a cybersecurity vendor or basically any other company, everyone wants to understand their own data.
However, what works for tech giants doesn’t necessarily work for everyone else. As the use of storage engines continues to expand, so does the demand for innovation that could take storage engines to the next level, and address a broader variety of use cases based on newly developed capabilities.
While the storage engines developed by the tech giants were designed for their particular needs, it may not be suitable for everyone else. Per Maslow’s famous quote, ”if the only tool you have is a hammer, you tend to see every problem as a nail.”
Following this analogy, for the hyperscalers, every big data problem looks like a problem that can be solved by sharding.
Sharding is a valid workaround when it comes to handling large amounts of data, and sometimes, it’s even a good solution, especially if you have unlimited resources. But, and there is a big “but” here, sharding is extremely complex, since developers need to architect the solution from the get go, in order for it to be sharded.
Even when well-planned, sharding is a tedious task that requires developers to spend more and more precious time on maintenance instead of coding. On top of that, sharding is a costly process that requires increasing investment in resources to handle the growing number of datasets. With a cloud infrastructure, scaling becomes easier, but it comes with a high price, and sharding falls exactly into that.
Basically, clients simply want to use an embedded storage engine in their application without the need to shard or reshard the system depending on the data size and scale. And that’s exactly what we are aiming to solve here in Speedb.
Our next post will discuss why Speedb is a superior alternative to sharding, and why we moved from closed source to open source.