I love talking about databases. I guess it’s because they are such a foundational piece of a backend system. Your choice of database will have a huge impact on the rest of the design, often for years to come. This is why I think having some first-hand experience with a few different databases is important. There are many trade-offs, so finding the right match for what you want to prioritize makes the process much easier. This is especially important if you’re looking to replace the data store for an existing system.
AWS’s Keyspaces for Apache Cassandra should definitely be on your radar.
The traditional choice is a SQL database like MySQL or Postgres running on a server you control. And traditional definitely doesn’t mean bad! This is a great option for many applications. There are a huge amount of resources online for working with and optimizing SQL databases. But, there are two things that might push you in a different direction: fault tolerance and scalability.
The rise of NoSQL databases was primarily driven to address these two problems. The idea is straightforward: you get fault tolerance and scalability by giving up some of the features of SQL. SQL is very powerful, so this can actually add a lot of challenge to development. But, if you need the scale, it could be a necessity. And for me, the main appeal of NoSQL is a system that can deal with a server falling over at 2am without waking me up.
If fault tolerance is what you are after, a fully managed service might be the most tempting option. In this arrangement, a cloud provider actually manages the database for you entirely. There are no servers for you to maintain and operate at all. You just get a database you can connect to without any of the operational requirements. In exchange for all this, you do pay a premium. However, because most managed databases can be pay-on-demand, it is possible that you actually come out ahead for lower-volume applications.
A great example of a managed database service is AWS’s DynamoDB. Dynamo is a widely-used NoSQL datastore that offers extreme scalability and availability. It even has a free tier, so for very low usage, Dynamo can also be one of the cheapest options available.
I haven’t used Dynamo for a large-scale system, yet. It does seem like Dynamo can present some challenges around controlling costs as traffic goes up. But for me, the biggest issue I’ve run into is the developer experience. Dynamo can be a real challenge to use.
Like some other NoSQL databases, the success of a Dynamo design depends entirely on data modelling. Dynamo is basically a key-value store with an additional component for sorting. This might sound very constrained, but in practice you can use these simple tools for most data storage needs. The big issue is incorrect data modelling cannot be easily fixed after the fact. It is crucial it be done correctly up front. Because Dynamo offers very little structure, most of this modelling ends up feeling extremely ad-hoc.
Here’s an actual example from AWS’s documentation for the structure of a critical element for Dynamo, a sort key:
To get correct behavior, both the ordering and separation of fields are required. Those “#” symbols are essential. You’ll have to correctly construct and parse this string in your application code. It is absolutely doable, and some data access code can hide these details from most of your system. But, it’s cumbersome and offers no type-checking at the database-level. I find this annoying when iterating on a schema design. And it is especially problematic for manual interaction, which ends up being a real pain.
Before Dynamo, I had quite a bit of experience using Cassandra. The two databases are actually quite similar. Both use the key+ sorting model. However, one major difference is Cassandra uses a query language, called CQL. CQL borrows heavily from SQL, and will look familiar to SQL users. This can make for an easier transition into the NoSQL world.
CQL also is well-suited to help you manage the data modelling complexities it imposes. Consider a similar CQL definition for the Dynamo example above:
CREATE TABLE events ( id UUID, country TEXT, region TEXT, state TEXT, county TEXT, city TEXT, neighborhood TEXT, PRIMARY KEY (id, country, region, state, county, city, neighborhood); );
Both the types and ordering of the critical elements are expressible in CQL. This makes it easier to define and also query these values. Cassandra also uses discrete tables, instead of Dynamo’s rather confusing single table paradigm.
It is definitely true that modelling data with Cassandra can still be tricky. But, I find it way easier to work on schemas in CQL. It definitely feels like a top-level design consideration for Cassandra. Once you get the basics down, it can actually be fun.
At a high level, Cassandra and Dynamo’s data modelling challenges are equivalent. While Cassandra does give you many more tools, at the end of the day, the designs end up being similar. My biggest problems with Cassandra have all been operational. It can be a tough database to maintain.
I’ve had Cassandra nodes die in the middle of the night without causing any issues at all. But, I’ve also had Cassandra clusters that required days of around-the-clock supervision when under stress. Monitoring and maintaining Cassandra is non-trivial. Keeping a cluster happy is a big job, particularly as the node count starts increasing.
This is why I got really excited when I found out about Keyspaces. Cassandra without the operations sounded like a dream to me. And with very few exceptions, it has lived up to my expectations.
AWS Keyspaces solves two major pain points with Cassandra. The first is it completely eliminates the need to manage nodes and cluster capacity. This is absolutely huge, since in my experience this is the most difficult aspect of using Cassandra.
Keyspaces also reduces the cost of running Cassandra for small-scale applications dramatically. With a normal Cassandra cluster, you’d probably need to run three servers just to get started. Keyspaces can be billed on-demand, so low usage means low costs. Keyspaces makes Cassandra a great choice for smaller, cost-sensitive projects. It’s also excellent for experimentation, where you can get something started in minutes.
One important thing to keep in mind is that Keyspaces is not (yet?) 100% Cassandra-compatible. There are some types and consistency levels that are not supported. For the most part, I think this will be more of an issue for those trying to migrate from an existing Cassandra cluster into Keyspaces. If you are starting out new, or thinking of going from Dynamo to Keyspaces, it should be much less of an issue.
I started experimenting with Keyspaces as soon as a preview was announced. It’s been great. It is true that Dynamo is more mature. But, I think Keyspaces’ developer ergonomics are vastly superior and make it worth consideration. If you have been using Dynamo in the past, you’ll find Cassandra really nice. CQL is a breath of fresh air compared to working with Dynamo queries. And, if you are a Cassandra user, I think you’ll absolutely love having someone else manage your cluster. Database choice is tough, there are just so many trade-offs. I love having options, but right now, I’m not sure I need any.
Keyspaces is my new default choice for a datastore in AWS.
Mon, May 25, 2020 - Matt Massicotte