Category Archives: Programming

Cloud storage features wish list

This is a follow up to my previous article about DynamoDB shortcomings. Here I’m talking about solutions I’m familiar with: AWS DynamoDB, MS Azure Storage Tables, Google AppEngine Datastore. As far as I know there is no other solutions of comparable scale / maturity out there.

Transparent support for data compression.

All the storages impose some limit on item size or attribute size. So when storing large objects like texts or JSON it makes sense to compress them as one way to mitigate the limit. This looks like totaly user-land problem, however life would be much easier if one could specify content-encoding for each attribute. This way one could have let’s say data attribute and serialize it with or without compression based on the specific conditions or software version. Then one can dream of having this natively supported by official client libraries — your cloud is only as good as your client library. Just let user set  is_compressible optional flag for each attribute, and let library handle the rest, and do transparent decompression on read based on content-encoding. And as a third level support web management console should understand and support decompression too.

Transparent support for “overflow” storage

This may be familiar to those who knows how Postgres et al store data. While it is short it is stored as a single row record, once it becomes too large it “overflows” to separate table. I understand technical reasons of 400KB / 1MB limits and reasoning that “nobody will ever need more” ;)) However occasionally people do need more. In our use case 99.9% or 99.99% of data fits into the limit (with a good headroom), but then there is that 0.01–0.1% which does not, and it is a headache. Please, do transparent support for large objects. Make it slow, make it expensive, but do support large object.

Transparent support for encryption

Some people/businesses are really nervous (or conservative, or cautious in a good sense) about storing sensitive data in the cloud. DynamoDB does not even provide an encryption-at-rest guaranty. But many would want not only at-rest, but end-to-end encryption to. Similar to compression support described above, it would be great to be able to tag sensitive attributes as encrypted and provide AES key for encode and decode operation. Storage may also support storing key thumbprint together with encrypted attribute to simplify decryption and key rotation.

Massive cloud object storage, what could go wrong?

We at workato.com have been using DynamoDB for two years, some lessons learned. At that time there were no other option — MS Azure tables had less features; Google DataStore did not have REST API yet. Now we are migrating to Google Datastore, so expect similar article from me in two years 🙂

So without longer forewords my complaints about DynamoDB:

  • Small item size limit. When we started it was mere 64KB, later on it was increased to 400KB, still too small by modern requirements. We had to implement our own chunking scheme, which makes it very hard to operate on data and add new indexes.
  • Small number of secondary indexes you can have (5 local + 5 global).
  • No multi-column indexes which means you need to design your schema in advance and create artificial columns merging several attributes to emulate such multi-column index. This is just a no go. Your requirement evolve, you data query needs evolve, if you cannot add index without massive (unfeasible) data migration you are screwed.
  • No support for batch delete, you can not delete by index. You need first to query, and then delete in batches. (this is true for most cloud storages though)
  • Poor elasticity toward traffic spikes. DynamoDB does support temporary traffic bursts, but you still need to be below provisioned throughput in 5-minute window average. (And I believe it was silently changes last year making us into troubles. I can not remember exact details now, but it was either 15 minutes to 5 minutes change, or even 5 minutes to 1 minute) Roughly you need to provision to 99.5% percentile level.
  • As developer you don’t see the costs unless you have access to whole AWS account billing. If you have account without access to whole AWS account billing, you don’t know what charges are. All numbers are there, but you need to go through all the tables and indexes, sum up provisioned throughput and consumed volume, go to pricing page and calculate the number. But why so much headache? The number does not need to be precise as in billing report, but approximate number (split by table plus total) would be very useful.
  • No support for multiple namespaces or databases. If you need to run multiple deployments (production, staging etc), you either need to implement prefixing of table names + access rules based on table names, or set up separate AWS accounts with separate billing, replicating admin access permissions etc.

Hot partition issue:

  • Each partition gets only a fraction of total provisioned throughput. If you have 10 partitions, you throughput per partition is 1/10th of total one.
  • And you don’t know how many partitions you have. I repeat it, just internalize it, you don’t know how many partitions you currently have in your system. So you don’t know your per-partition limit. There is a formula in documentation, but it does not work in practice. When I was troubleshooting that formula suggested I should have 2 or 3 partitions, talking to support revealed we had a dozen. And (see below) number of partitions is defined not by current settings, but by historical ones.
  • No visibility into partition load. AWS support has internal tools which help to visualize partitions heatmap, but you don’t have access to them. Only if support sends you a screenshot.
  • Poor monitoring / manifestation of throughput errors. So you see throttling errors, telling you your are over your provisioned throughput limits. You go to AWS monitoring to see that consumed throughput is 1/10th or 1/20th of provisioned. You are lost, you are puzzled, you panic, you cry. There is zero indication it is a hot partition problem.
  • No scale down. Once you increased your throughput to say accomodate ETL job, number of partitions increase accordingly. And it never shrinks. This extremely counter-intuitive. You increase your throughput (per table), but it decreases (per partition). Or you pay more, but receive less.
  • As time goes and your table grows, so number of partitions, and so per-partition throughput shrinks. So yesterday provisioned throughput was enough, tomorrow it may be not.
  • Even if you shard well on your primary key, it does not prevent hot partition issue completely. One write operation is only 1KB, so write 400KB item and you are consuming 400 write units instantly. Dynamo may or may not accommodate this based on the other traffic.
  • Hot partition problem is (almost) unavoidable. Even if you shard your data perfectly using hash-like key (and how you implement chunking then?) and into small (1KB) items, chances are you will need some secondary index to query data on, be it user_id or updated_at or something else. Global secondary indexes are eventually consistent, and are updated based on the queue of requests as far as I understand, so they are a bit more elastic. But only marginally. Eventually (sorry;) you will get into the same hot partition problem, now with indexes table which is even harder to diagnose.

Sidenote: I think MS Azure Tables approach is more straightforward: limited throughput per whole table, and limited throughput per partition. That’s it. As you table grows, your per-partition throughput remains constant.

As a summary: DynamoDB may be good as low-level key-value storage if you understand all complications and can design access patterns around it. It does not work as generic object storage.

 

Cross-posted to medium.com/@timanovsky

All about API throttling. Indicating overload and quota excess.

Foreword

In this article I would like describe how throttling should be done in HTTP APIs designed to be used by third parties, which may be different from approach if the API is only used by internal clients (web or mobile). More specifically, considerations presented here are particularly relevant for applications which APIs are used by some sort of automation or integration app like workato.com. Again, if your API is designed to be used by interactive applications (your own or third-party) the tradeoffs and priorities are different.

TLDR

Return HTTP 500 on general server overload.

Return HTTP 429 with Retry-After header set upon API usage exceeding predefined quota.

API throttling

API throttling is used to protect a system from overload by excess amount of otherwise legit requests or to impose a usage quota. When server decides to reject incoming request it responds with special error code and expects API client to stop sending requests and back off for some time. It does not protect you from harmful DoS / DDoS attackers, which don’t respect server response.

Possible throttling policies

When you decide to throttle incoming requests you can do it at different scope granularity. You can have single global limit per whole API across all users. Or you can account usage per particular user account. You can set different limits for authenticated users and guests, read and write operations etc. In any case do provide clear documentation on the policy you implement. Document limits and how user can change them, e.g. – use higher tier paid plan, or reach to support and have limits bumped up manually. Below are some popular policies. Here we consider all API operations being equal, you may want however to count significantly different operations separately. Good example here are Google and Amazon AWS cloud api which provide different quotas for read and write operation.

1. Application-wide limit

In this policy application as a whole (API provider) has a single limit of requests per second it can handle. It can be hard, like 1000 req/sec, or soft – start bouncing requests off as soon as server gets overloaded or both. In general it is a good practice to follow “fail fast” pattern and have a hard limit on API rate your servers are able to cope with rather than to try to execute each request with growing latency and see degraded service performance and stability. This method is easy to implement however throttling affect all users equally.

2. Limit per application (API key) or per traffic source (IP)

This imposes a limit on particular application which uses your API. This sounds as a natural thing to do if you give access to your API. However the problem with it is that end users can do nothing about it, and in most cases have no visibility of what is happening. User have application A and application B which are linked but not working because of some limits user have no idea of. If you impose this limit, set it high enough and provide a channel for application developers to request quota increase. Formally we can say that scope of the limit applied is application_id.

3. Limit per user on whose behalf API is being called

In this case limit is per user account (scope is user_id). In this case many people can use integration between A and B without interference. The drawback is that if user has two integrations of application A with say B and C, then those application B and C will now interfere. And if throttling is requested B and C may not receive equal share of API “bandwidth”, as one application may start retrying more frequently than other and thus preempting it.

apps

3a. Limit per authentication token(session), optionally limit number of tokens per account.

To resolve previous problem you may want to set a limit not per user account, but rather per authentication token, which will presumably be different for apps B and C. You can achieve similar result if you account per (user_id, api_key) tuple. This approach provides good isolation of particular application pairing for particular user. Careless approach would be to say it’s users’ responsibility to manage access among integrated apps, in reality users can do nothing about it.

I would recommend combination of all three quotas at different level. Let’s see example to see what I mean:

Domain Limit Notes
Per API key (per integrated appication) 10,000 req/sec We want to empower integrated apps, and let them do real work. We may change this limit based on support requests, partner agreements etc.
Per user account 100 req/sec All integrated applications belonging to the same user can do 100 requests per second in total.
Per session 50 req/sec Still we want to limit any single application from consuming all user’s quota.

4. Subdomains or organizations (like workato.zendesk.com)

All logic of the above holds true, you want to limit amount of operations in particular scope. If you only support integrations at organization level (e.g. no individual org members can set up their own links), when you will probably have the following quote levels:

  • application_id (per API key)
  • organization_id (per organization as whole)
  • (application_id, organization_id) tuple

If additionally any org member has access to API and can do integration, then you additionally may have

  • (organization_id, user_id) (per user in organization)
  • (applicaiton_id, organization_id, user_id) (per user in organization using specific integration)

Rolling limit vs static intervals

There are two approaches to calculate api usage and detect when to activate the throttling. One is to have rolling window, say of 1000 requests per hour max. You count how many request you have had in last 60 minutes and if it is more than 1000 return appropriate error code. This is a bit tricky to implement, but delivers better recovery time. Easier approach is to start counting at the beginning of an hour, and reset it at the end. This is much easier to implement but then user needs to wait till the end of an hour even if overuse was negligible.

Return codes

429 with Retry-After. 500 is ok if api global threshold is reached, as particular api user can do nothing about it other than retry with increasing back-off. In fact, in may be even easier for api consumers to have different error codes for different expected behavior. 429 with Retry-After for managed back-off and 500 for generic back-off on global errors. This is example of how you should treat api limits in general – clearly indicate scope of the problem especially if expected handling is different. In practice you are unlikely to impose more complex combination than 1 plus any one of 2-4.

Client behavior

Let’s chat about clients a little bit. If client does not obey the throttling the discussed stuff does not make much sense. Client is behavior relatively simple — Back off for the time as specified by server if 429 with retry-after is received (or functionally equal response as documented by api provider), or back off exponentially (or other algorithm with increasing retry interval) otherwise (500, or no retry-after header) as it should do with any retry-able error code.

API examples

Dropbox

Dropbox limits amount per (application_id, user_id) unique tuple. It returns 503 for OAuth 1.0 requests and 429 for OAuth 2.0. See https://www.dropbox.com/developers-v1/core/docs and https://www.dropbox.com/developers-v1/core/bestpractices

Zendesk

Zendesk accounts per organization_id and uses 429 response code with Retry-After set. Refer tohttps://developer.zendesk.com/rest_api/docs/core/introduction#rate-limits

Google calendar

Google has multiple accounted usage scopes. It uses 403 HTTP result code with more detailed information about the scope inside JSON-encoded HTTP body. It has no indication when to retry and suggests to use exponential backoff. See https://developers.google.com/google-apps/calendar/v3/errors#403_daily_limit_exceeded

Hiccups with Java double brace initialization of anonymous object

Double brace initialization is often used technique to overcome Java limitations on working with anonymous objects. In my case I used HashMap to pass tracking event parameters to Flurry Analytics class. The code is like this :

final HashMap<String, String> eventParms = new HashMap<String, String>() {{
    put("fileSize", Long.toString(new File(item.path).length()));
    put("mimeType", mimeType);
}};
FlurryAgent.logEvent(event , eventParms);

Looks pretty nice, and avoids creating local variable and then putting parameter pairs into it. However I started seeing OutOfMemoryExceptions in my Android application. I took HPROF memory dump, and pretty quickly realized that the problem was with anonymous classes they somehow were of a huge size. I googled around and found this article : http://vanillajava.blogspot.ru/2011/06/java-secret-double-brace-initialization.html, and it became obvious that it was exactly my problem: Anonymous object retained reference to outer class, which in my case had members of a big size – Bitmaps etc. So while Flurry events were kept int the queue those huge bitmaps ere also occupying heap memory.

Additional note on HPROF: In order to use standard Memory Analyzer tool you need to convert HPROF heap dump from Android format to regular Java format. Use android_sdk/tools/hprof-conv.

Brief reference on cloud storage

This is very brief and shallow comparison of data model and partitioning principles in Amazon S3 and Azure Storage. Please also see my feature comparison post of various storage platforms: https://timanovsky.wordpress.com/2012/10/26/comparison-of-cloud-storage-services/
Amazon S3
Getting most out of Amazon S3: http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
Their storage directory is lexigraphically-sorted, and leftmost characters used as partition key. It is not said, but looks like you need to have your prefix tree balanced in order for partition balancing to work optimally. I.e. if you prefix with 0-9A-F as suggested in the article, amount of requests going to all 16 prefixes must be roughly the same. This underneath might mean that key space is always partitioned evenly – split into fixed amount of equal key ranges. That is totally my speculation, but otherwise I can not explain why such prefixes would matter.

Microsoft Azure Storage
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/12/30/windows-azure-storage-architecture-overview.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2011/11/20/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx
Having glanced over MS docs I’m under impression that Azure storage can split key ranges independently based on the load and size.
Update: The following quote shows that Azure is similar to S3, and I was wrong:

A downside of range partitioning is scaling out access to
sequential access patterns. For example, if a customer is writing
all of their data to the very end of a table’s key range (e.g., insert
key 2011-06-30:12:00:00, then key 2011-06-30:12:00:02, then
key 2011-06:30-12:00:10), all of the writes go to the very last
RangePartition in the customer’s table. This pattern does not take
advantage of the partitioning and load balancing our system
provides. In contrast, if the customer distributes their writes
across a large number of PartitionNames, the system can quickly
split the table into multiple RangePartitions and spread them
across different servers to allow performance to scale linearly
with load (as shown in Figure 6). To address this sequential
access pattern for RangePartitions, a customer can always use
hashing or bucketing for the PartitionName, which avoids the
above sequential access pattern issue.

Fast IP->location resolution in SQL

Abstract
Many web services nowadays rely on users’ geo location information. It can be required purely for statistics purposes, or to add value to provided service itself. In most cases the only possible way to detect user location is to use IP address of originating request. IP addresses are not hierarchical in a geographical sense, large block of IP addresses is reserved to an ISP, and then broken down for smaller companies or individuals. There is no much order in the system. So a quick database is required in order to be able to make millions of lookups.

Geo location data
Surely enough companies exist which provide such IP-based geo location data. Maxmind is one of them. Fortunately they provide free version of their database under some legal constraints. There are other providers too, I’ve recently came across IpInfoDB. So far so good, now we need a system to make use of the data.

Physical database design
Data consists of (startIp, endIp, country) tuples. If IP address is in the range, it is originated from that country. There blind unassigned ranges, there are private ranges, and as said above multiple (very many) ranges map to a Country. In my table there is > 100K distinct IP ranges.

Physical design in MySQL:
CREATE TABLE LookupIpToCountry (
startIp int(10) unsigned NOT NULL,
endIp int(10) unsigned NOT NULL,
countryId tinyint(3) unsigned NOT NULL,
PRIMARY KEY (startIp),
KEY fk_countryId (countryId),
CONSTRAINT LookupIpToCountry_ibfk_1 FOREIGN KEY (countryId) REFERENCES Countries (id)
) ENGINE=InnoDB;

Other database or engine can be used, provided that startIp is clustering index, i.e. data in the table is physically sorted according to startIp in ascending order.

Fast lookup query
mysql> select * from LookupIpToCountry where startIp<=INET_ATON('74.125.19.99') and endIp>=INET_ATON('74.125.19.99') order by startIp desc limit 1;
+------------+------------+-----------+
| startIp | endIp | countryId |
+------------+------------+-----------+
| 1249693272 | 1249773023 | 230 |
+------------+------------+-----------+
1 row in set (0.00 sec)

The idea here is to find the first region into which IP of interest falls. SELECT ... WHERE startIp<= X ... order by startIp desc will navigate through B-tree index to the last row satisfying startIp criterion and setup scan iterator in descending order. (Because it can only validate second condition endIp>= X by scanning). But the iteration ends immediately because of LIMIT 1 if second condition is satisfied. So effectively whole query is reduced to B-tree lookup of one value. Of course, in case of the miss (IP address does not fall to any of known ranges), there is a performance hit:

mysql> select * from LookupIpToCountry where startIp<=INET_ATON('255.255.255.255') and endIp>=INET_ATON('255.255.255.255') order by startIp desc limit 1;
Empty set (0.39 sec)

In order to fix it you can simplify the query to select * from LookupIpToCountry where startIp<=INET_ATON('255.255.255.255') order by startIp desc limit 1; and check second condition endIp>=INET_ATON('255.255.255.255') at application level, if you think it is worth.

Solving performance problems in pub-sub erlang server

Intro
So I wrote (see some prehistory here) a kind of notification server where clients can subscribe to events and be notified when event of interest happens. Clients use HTTP long poll method to get notifications delivered, and one of the application of the server is chat room.

Symptoms
In my case there were two problems:

  1. general lack of performance (I started optimizing somewhere from 200-300 messages per second)
  2. under high load server would lock up, sometimes for extended period of time – it does not deliver any messages
  3. even under moderate load performance is not stable and would drop eventually for some period of time (sometimes to complete lock up, sometimes not, sometimes for a couple of seconds, sometimes for longer time).
  4. In-depth investigation with tcpdump has shown that server can not even accept connections.

Side note on server not accepting tcp connections
That is interesting how it happens though. I don’t know if it is specific Linux 2.6 behavior of it is a norm, but the sequence is:

  1. client sends SYN
  2. server responds with ACK, but acknowledgment number set to some arbitrary big number
  3. client drops connection by sending RST

So should you see similar behavior note that this is just a symptom, not disease. The problem is somewhere dipper. In my case increasing TCP listen backlog from mochiweb’s default 30 helped a bit, fewer timeouts were observed, but still performance sucked.

Fixing Root Cause
So how pub-sub servers in Erlang are generally built? You get a message being processed in some process, it might have been received from somewhere else or originates from this process . And then you have bunch of waiting processes representing connected subscribed clients waiting for a message be delivered. Each of these processes normally represents TCP connection or in my case HTTP long polling connection. And delivering message to this process releases it from wait state and allows message to be delivered to end user. There is of course some router module or process which determines a subset of processes (PIDs) to which the message should be delivered. How to do efficiently is very interesting topic but not for this post. Then you do something like
lists:foreach(fun(Pid) -> send_message(Pid, Message) end, PidList)

The result of this is that each of processes from the target group (selected by router) becomes immediately available for execution. And if the group size is big, chances are that current process broadcasting this notifications will be preempted. And the thing is that this actually causes context switching storm. I’m not 100% sure how Erlang runtime is implemented, but it seems that if process receives a message it gets some kind of priority and is scheduled for execution, like it happens in some OSes. So the message sending loop may take quite awhile.

Now, if the message broadcast loop is more complex, say it consists of two nested loops, and inside it you do some non trivial operations followed by sending the message for one particular PID, then the things become very bad. Context switching overhead no matter how light it is in Erlang kills your performance.

Recipe

  1. If you have complex loops to calculate to which PID send the message, and you do message sending inside that loop – rewrite the code. First prepare list of PIDs using whatever complex procedures you have, then send messages to those PIDs in one shot.
  2. Then sending messages to bunch of PIDs do temporarily boost performance of your thread so it won’t be preempted.

Example:

PidList = generate_pid_list(WhatEver),
OldPri = process_flag(priority, high), % Raise priority, save old one
lists:foreach(fun({Pid, Msg}) -> send_message(Pid, Msg) end, PidList),
process_flag(priority, OldPri)


That’s basically it. The result is that now I’m reliably achieving about 1.5k messages/sec with all stuff like HTTP/JSON parsing, ETS operations, logging, etc. I would like to pump this number few times higher, but at the moment that’s what I can get. I will come back when learn something new 🙂

PS. You may also find this discussion followed $1000 code challenge useful:
http://groups.google.com/group/erlang-programming/browse_thread/thread/1931368998000836/b325e869a3eea26a

Fix Eclipse Config to Start with Broken Perspective

I happened to switch to Ruby Browsing Perspective in Eclipse eventually, for whatever reason it crashed Eclipse with Out of Memory exception. When I try to restart it won’t, because it tries to load all the same Perspective. The solution is to edit one of Eclipse config files. The file of interest is "workspace/.metadata/.plugins/org.eclipse.ui.workbench/workbench.xml" (in my case on Mac it is ~Documents/workspace/.metadata/.plugins/org.eclipse.ui.workbench/workbench.xml)

The default perspective is configured by the following directive:
<perspectives activePart="org.rubypeople.rdt.ui.EditorRubyFile" activePerspective="org.rubypeople.rdt.ui.RubyBrowsingPerspective">

So what I had to do is replace org.rubypeople.rdt.ui.RubyBrowsingPerspective with org.eclipse.jdt.ui.JavaPerspective

You can find for exact class names of available perspectives in the same file. Perspective declaration looks like:

<perspective editorAreaTrimState="2" editorAreaVisible="1" fixed="0" version="0.016">
<descriptor class="org.eclipse.jdt.internal.ui.JavaPerspectiveFactory" id="org.eclipse.jdt.ui.JavaPerspective" label="Java"/>
...

First line is opening of perspective declaration, and <descriptor> tag contains id attribute which is what you want. If you search trough workbench.xml you will find all available perspective names.