As part of the task to automate our production modeling pipeline, I was exploring options for storing offline training metrics on AWS. The goal is two-fold: 1) to keep tracking of training evaluation metrics and 2) for our internal web app to consume and display this data.
AWS provides various storage options such as file storage system S3 and NoSQL database DynamoDB. This post summarizes the differences between these two services and how/why we choose one over the other. Bear in mind that the comparison between two services here is based on our specific use case.
Fundamentally, S3 and DynamoDB are different storage systems - one is a file system and the other is a database.
S3 is basically a file storage system that treats everything as an
object. But, before that, there is the concept of a
bucket. This is essentially a top level folder that is used to group the data from your various applications. For example, you could you one
bucket that stores your log files and another
bucket to backup your database. You could configure different permission and other settings for each
bucket, S3 actually doesn’t have a folder structure per-se. That means, S3 treats everything inside of a
bucket as a group of a flattened group of
listKey API would return the keys, or, file names, in layman’s term, of all
object within a bucket. However, you can somewhat force a folder structure by naming
object with a prefix such as
example/ so that the
object is stored as
example/foo in the bucket. To list all keys within that “folder”, you can pass a prefix
example/ to the API.
object in S3 can be as large as 5TB so it’s very suitable for storing large objects. The latency is higher than DynamoDB but it supports concurrency out of the box, which means you can do a lot more things without worrying too much.
S3 has basic HTTP compatibility so that any applications can be pointed straight to a
bucket. It also supports versioning so that you can keep multiple variants of an
object in the same
bucket, although you need to turn it on deliberately as it’s off by default.
DynamoDB is a No-SQL database that can be used as a key-value store. Its selling point is the low latency/high availability and the scalability. It is really good for storing a lot of small items, since one of its limitation is the 400kb item size limit (including the binary length of both attribute name and attribute value), which was sort of a deal-breaker for our use case, since some of the model metrics could be pretty large given the size of the dataset.
Every value stored in DynamoDB is keyed by a unique primary key that is consisted of a hash key and a range key. AWS suggest to keep the hash key unique as well for ensuring uniform load distribution, but it’s not required. Because of its database nature, DynamoDB has better query performance with reasonable index structures and, thus, scan is generally discourage.
Each DynamoDB table has three geographically distributed replicas on SSD to ensure high availability, low latency and data durability.
The pricing models are also drastically different for two services, the cost of which could be huge if one doesn’t make a thoughtful choice based on usage.
S3’s pricing model is straightforward - it essentially charges a unit price per GB usage. Specifically, AWS charges a storage price, a request price and a data transfer price. The price is overall pretty cheap since the primary use cases of S3 are storing huge amount of data. I found this simple price calculator to be very helpful.
DynamoDB has a little more complicated pricing model. It depends on a pre-specified throughput capacity (in units), a storage price, an optional service named DynamoDB Streams and data transfer fee.
The throughput capacity is used to provision the table. It includes a read and a write capacity, which hides a lot of complexities from developers. After the throughput capacity is specified, a flat, hourly rate will be charged. The storage price is on a S3-like model, where a unit price per GB is charged, including the uploaded data size and a fixed indexing overhead. Again, a price calculator is available to estimate your monthly charge.
In summary, S3 and DynamoDB are both great AWS services with different use cases in mind. S3 is a general storage solution targeting users with needs to store a huge amount of unstructured data. DynamoDB is a No-SQL database set out to solve the scalability challenge for many web applications.
For us, since the offline metrics storage can be pretty large with potential unstructured data such as figures, S3 seems to be the viable choice of the two. Plus, the summarized offline metrics are consumed by an internal web app with an infrequent fashion - S3 could help us to save a few bucks as well.
As always, I would really appreciate your thoughts/comments. Feel free to leave them following this post or tweet me @_LeiG.