Hiccups with Java double brace initialization of anonymous object

Double brace initialization is often used technique to overcome Java limitations on working with anonymous objects. In my case I used HashMap to pass tracking event parameters to Flurry Analytics class. The code is like this :

final HashMap<String, String> eventParms = new HashMap<String, String>() {{
    put("fileSize", Long.toString(new File(item.path).length()));
    put("mimeType", mimeType);
}};
FlurryAgent.logEvent(event , eventParms);

Looks pretty nice, and avoids creating local variable and then putting parameter pairs into it. However I started seeing OutOfMemoryExceptions in my Android application. I took HPROF memory dump, and pretty quickly realized that the problem was with anonymous classes they somehow were of a huge size. I googled around and found this article : http://vanillajava.blogspot.ru/2011/06/java-secret-double-brace-initialization.html, and it became obvious that it was exactly my problem: Anonymous object retained reference to outer class, which in my case had members of a big size – Bitmaps etc. So while Flurry events were kept int the queue those huge bitmaps ere also occupying heap memory.

Additional note on HPROF: In order to use standard Memory Analyzer tool you need to convert HPROF heap dump from Android format to regular Java format. Use android_sdk/tools/hprof-conv.

HTTP benchmark suites

Brief reference on HTTP benchmark I know

  • ab – Apache Benchmark, fairy simple
  • tsung – written in Erlang, supports multiple worker nodes
  • JMeter – witten in Java, supports multiple worker roles
  • Yandex-Tank – written in Python. Good documentation how to configure Linux kernel - https://github.com/yandex-load/yandex-tank
  • Loader.io – clous bases service

Note: I still to find my ideal tool. Creating a calibrated high load is not a trivial task. Simple hand-written load generator on Node.js provided 10%-20% better performance than ab. OS also plays a role here. I found it to be very hard to generate stable outgoing load from Linux machine. Occasionally outgoing TCP connection handshake would freeze for multiple of 3s intervals (tcp retry), or kernel will run out of ephemeral ports, as thousands of sockets are in FIN_WAIT state. I’m speaking about big RPS here, like > 1000 rps per core.

 

API versioning scheme for cloud REST services

Traditional approach

Traditionally it is considered a good practise to have REST API versioning like “/api/v1/xxx” and when you transition to next version of your API you introduce “/api/v2/xxx”. It works well for traditional web application deployments where HTTP reverse proxy / load balancer front end exists. In such set up application-level load balancer can also work as an HTTP router, and you can configure where requests go. For example you can define “v1″ request go to one rack, “v2″ to other, and static files to third one.

Cloud services tend to simplify things and they often don’t have path-based HTTP routing capabilities. For example Microsoft Azure does load balancing at TCP level. So any request can hit any VM, so your application needs to be prepared to handle all types of requests at each VM instance. This introduces a lot of issues:

  • No isolation. If you have a bug in v1 code which can be exploited all your instances are vulnerable;
  • Development limitations. You need to support whole code base making sure ‘v1′ and ‘v2′ live nicely together. Consider situation when some dependency is used in both ‘v1′ and ‘v2′ but require specific different versions;
  • Technology lock in. It either impossible or very hard to have ‘v1′ in C# and ‘v2′ in Python.
  • Deployment hurdle. You need to update all VMs in your cluster each time.
  • Capacity planing and monitoring. It is hard to understand how much resources are consumed by ‘v1′ calls vs. ‘v2′

Overall it is very risky, and eventually can become absolutely unmanageable.

Cloud aware approach

To overcome these and probably other difficulties I suggest to separate APIs at domain name level: e.g. “v1.api.mysite.com”, “v2.api.mysite.com”. It is then fairy easy to setup DNS mapping of the names to particular cloud clusters handling the requests. This way your clusters for “v1″ and “v2″ are completely independent – deployment, monitoring, they can even run on different platforms – JS, python, .NET, or even in different clouds – Amazon and Azure for example. You can scale them independently – scale v1 cluster down as load reduces, and scale v2 cluster up as its load increases. Then you can deploy “v2″ into multiple locations and enable geo load balancing still keeping “v1″ legacy set up compact and cheap. You can have ‘v1′ in house, and ‘v2′ hoste in cloud. Overall such set up is much more flexible and elastic, more easy to manage and maintain.

Of course this technique applies not only to HTTP REST API services but to others as well – HTTP, RTMP etc.

Comparison of BLOB cloud storage services

In the table below I’m trying to compare Amazon AWS S3, Microsoft Azure Storage, Google Cloud Storage, Rackspace CloudFiles.

Hope this is useful.

Amazon AWS S3 MS Azure Google Cloud Storage Rackspace Cloud Files
Max account size Unlimited? 200 TB per storage account(by default you have 20 storage accounts per subscription) = 4PB Unlimited? ??
Geo redundancy +extra price +extra price ? -
SLA Availability SLA: 99.9%Designed to provide 99.999999999% durability and 99.99% availabilityReduced Redundancy:Designed to provide 99.99% durability and 99.99% availability Availability SLA: 99.9% Availability SLA: 99.9% Availability SLA: 99.9%
HTTP + + + -
HTTPS + + + +
Container namespace Global LName collision is possible Storage account Global L Account
Container (bucket) levels single single single single
Root container -but with HTTP each bucket can have DNS name (not possible with HTTPS) + -but each bucket can have DNS name -
Cross origin CORS support + - +experimental +
Max number of containers (buckets) 100 unlimited Unlimited? 500,000
Container name length 3-63a few limitations, basic form is “my.aws.bucket“in US region:255 bytes[A-Za-z0-9.-_]But DNS-compliant is recommended 3-63 chracters, valid DNS name 3 to 63 charactersvalid DNS nameIf name is dot-separated each component <63 character, and 222 total 256after URL encoding, no ‘/’ character
Max files per container Unlimited, no performance penalty unlimited unlimited Unlimited, but API throttling begins at 50000 files.
List files, max results 1000 5000 ? 10,000
List files returns metadata - +(configurable) - -
File (BLOB/object) name 1024 1024 1024 1024 bytes after URL encoding
Max file size 5TB 200GB (Block blob)1TB (page BLOB) 5TB UnlimitedMultiple files in the same container with same name prefix can be joined into single virtual file.
Max upload size 5 GB 64 MB 5 MB? 5GB
HTTP Chunked upload - - + +
Parallel upload + + ?(Resumable Uploads of Unknown Size) +
Upload Data checksumming + + +
Server-side copy +(up to 5 GB?) +supports external sources + +
File metadata 2KB, ASCII,no update 8KB + (size limit non specified)No update 90 pairs or 4096 bytes total
# data centers 8 8 ? 2: ORD, DFW
CDN-enabled + + ? +
Authentication HMAC-SHA1 signing HMAC-SHA256signing OAuth 2.0Google cookie1-hr JWT token 24-hr token obtained from Cloud Authentication Service.
Time-limited shared access + + + +
File auto expiration + - - +
Server-side encryption +AES-256 - - -
PRICING
Storage per GB a month $0.055 – $0.095 for fully redundant

$0.037 – $0.076 for reduced redundacny

 

$0.055 – $0.095 for geo redundant

$0.037 – $ 0.07 for locally redundant storage

$0.054 – $0.085 for full redundancy

$0.042 – $0.063 reduced redundancy

$0.10
Outgoing bandwidth per GB $0.05 – $0.12 (US)

1GB /mo is free

$0.05 – $0.12 (UE/EU)

$0.12-$0.19 (Asia)

$0.08-$0.12 (US)

$0.15-$0.21 (Asia-Pacific)

$0.18
Incoming bandwidth $0 $0 $0 $0
PUT, POST, LIST $0.005 per 1000 $0.01 per 100 000 storage transactions $0.01 per 1000 $0
HEAD, GET, DELETE  $0.004 per 10 000(no charge for delete) $0.01 per 100 000 storage transactions $0.01 per 10 000(no charge for delete) $0

Update (Nov 23, 2013):  About 30% cost reduction by Google; up to 24% price reduction by Amazon ; same 24% reduction by Microsoft

UPDATE (Nov 5, 2012): Azure Strorage now has higher limits, way better internal connectivity network, and higher performance targets, see announcement.

2000 IPOS per partition, 20000 IPOS per storage account.

Brief reference on cloud storage

This is very brief and shallow comparison of data model and partitioning principles in Amazon S3 and Azure Storage. Please also see my feature comparison post of various storage platforms: http://timanovsky.wordpress.com/2012/10/26/comparison-of-cloud-storage-services/
Amazon S3
Getting most out of Amazon S3: http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
Their storage directory is lexigraphically-sorted, and leftmost characters used as partition key. It is not said, but looks like you need to have your prefix tree balanced in order for partition balancing to work optimally. I.e. if you prefix with 0-9A-F as suggested in the article, amount of requests going to all 16 prefixes must be roughly the same. This underneath might mean that key space is always partitioned evenly – split into fixed amount of equal key ranges. That is totally my speculation, but otherwise I can not explain why such prefixes would matter.

Microsoft Azure Storage
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/12/30/windows-azure-storage-architecture-overview.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2011/11/20/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx
Having glanced over MS docs I’m under impression that Azure storage can split key ranges independently based on the load and size.
Update: The following quote shows that Azure is similar to S3, and I was wrong:

A downside of range partitioning is scaling out access to
sequential access patterns. For example, if a customer is writing
all of their data to the very end of a table’s key range (e.g., insert
key 2011-06-30:12:00:00, then key 2011-06-30:12:00:02, then
key 2011-06:30-12:00:10), all of the writes go to the very last
RangePartition in the customer’s table. This pattern does not take
advantage of the partitioning and load balancing our system
provides. In contrast, if the customer distributes their writes
across a large number of PartitionNames, the system can quickly
split the table into multiple RangePartitions and spread them
across different servers to allow performance to scale linearly
with load (as shown in Figure 6). To address this sequential
access pattern for RangePartitions, a customer can always use
hashing or bucketing for the PartitionName, which avoids the
above sequential access pattern issue.

Enabling ARCHIVE storage engine in IUS MySQL 5.1

IUS is great repo which allows seamless integration of MySQL 5.1 and Python 2.6 into CentOS systems (which have 5.0 and 2.4 versions). The only issue is that if you run ‘SHOW ENGINES’ it will only show you MRG_MYISAM, CSV, FEDERATED, InnoDB, MEMORY, MyISAM engines. I wanted to experiment with ARCHIVE storage engine for storing raw input BI events, which is basically JSON. ARCHIVE engine seems to be a good hit for this – it supports compression (of our highly redundant data) and auto increment, which is necessary to implement queue-like processing, how it goes should be a topic for the separate post. So I was puzzled when I didn’t see archive storage engine in MySQL by IUS. Initial googling only suggested that ARCHIVE storage engine is enabled at compile time, which was pretty sad and I couldn’t understand why on the Earth had they omitted it. Later I found this post suggesting that ARCHIVE storage engine can be installed as plugin, and I need to install separate YUM packages. Finding for those packages in the current repo gave no results. So I finally found this bugreport revealing taht a few plugins are actually installed as a part of mysql51-server package and you only need to enable it! So I went a head and

mysql> INSTALL PLUGIN archive SONAME 'ha_archive.so';
Query OK, 0 rows affected (0.02 sec)

And then

mysql> show engines \G

*************************** 7. row ***************************
Engine: ARCHIVE
Support: YES
Comment: Archive storage engine
Transactions: NO
XA: NO
Savepoints: NO
7 rows in set (0.00 sec)

Voila!

The following plugins are installed into the OS but not into MySQL:


ha_archive.so - ARCHIVE
ha_blackhole.so - BLACKHOLE
ha_example.so - EXAMPLE
ha_innodb_plugin.so - InnoDB Plugin

Enjoy!

Update: Experiment on 24M rows shows 11X compression ratio! From 437 bytes/row in InnoDB (no indexes) down to 38 bytes per row.