Tag Archives: Storage

Brief reference on cloud storage

This is very brief and shallow comparison of data model and partitioning principles in Amazon S3 and Azure Storage. Please also see my feature comparison post of various storage platforms: https://timanovsky.wordpress.com/2012/10/26/comparison-of-cloud-storage-services/
Amazon S3
Getting most out of Amazon S3: http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
Their storage directory is lexigraphically-sorted, and leftmost characters used as partition key. It is not said, but looks like you need to have your prefix tree balanced in order for partition balancing to work optimally. I.e. if you prefix with 0-9A-F as suggested in the article, amount of requests going to all 16 prefixes must be roughly the same. This underneath might mean that key space is always partitioned evenly – split into fixed amount of equal key ranges. That is totally my speculation, but otherwise I can not explain why such prefixes would matter.

Microsoft Azure Storage
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/12/30/windows-azure-storage-architecture-overview.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2011/11/20/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx
Having glanced over MS docs I’m under impression that Azure storage can split key ranges independently based on the load and size.
Update: The following quote shows that Azure is similar to S3, and I was wrong:

A downside of range partitioning is scaling out access to
sequential access patterns. For example, if a customer is writing
all of their data to the very end of a table’s key range (e.g., insert
key 2011-06-30:12:00:00, then key 2011-06-30:12:00:02, then
key 2011-06:30-12:00:10), all of the writes go to the very last
RangePartition in the customer’s table. This pattern does not take
advantage of the partitioning and load balancing our system
provides. In contrast, if the customer distributes their writes
across a large number of PartitionNames, the system can quickly
split the table into multiple RangePartitions and spread them
across different servers to allow performance to scale linearly
with load (as shown in Figure 6). To address this sequential
access pattern for RangePartitions, a customer can always use
hashing or bucketing for the PartitionName, which avoids the
above sequential access pattern issue.