All about API throttling. Indicating overload and quota excess.

Foreword

In this article I would like describe how throttling should be done in HTTP APIs designed to be used by third parties, which may be different from approach if the API is only used by internal clients (web or mobile). More specifically, considerations presented here are particularly relevant for applications which APIs are used by some sort of automation or integration app like workato.com. Again, if your API is designed to be used by interactive applications (your own or third-party) the tradeoffs and priorities are different.

TLDR

Return HTTP 500 on general server overload.

Return HTTP 429 with Retry-After header set upon API usage exceeding predefined quota.

API throttling

API throttling is used to protect a system from overload by excess amount of otherwise legit requests or to impose a usage quota. When server decides to reject incoming request it responds with special error code and expects API client to stop sending requests and back off for some time. It does not protect you from harmful DoS / DDoS attackers, which don’t respect server response.

Possible throttling policies

When you decide to throttle incoming requests you can do it at different scope granularity. You can have single global limit per whole API across all users. Or you can account usage per particular user account. You can set different limits for authenticated users and guests, read and write operations etc. In any case do provide clear documentation on the policy you implement. Document limits and how user can change them, e.g. – use higher tier paid plan, or reach to support and have limits bumped up manually. Below are some popular policies. Here we consider all API operations being equal, you may want however to count significantly different operations separately. Good example here are Google and Amazon AWS cloud api which provide different quotas for read and write operation.

1. Application-wide limit

In this policy application as a whole (API provider) has a single limit of requests per second it can handle. It can be hard, like 1000 req/sec, or soft – start bouncing requests off as soon as server gets overloaded or both. In general it is a good practice to follow “fail fast” pattern and have a hard limit on API rate your servers are able to cope with rather than to try to execute each request with growing latency and see degraded service performance and stability. This method is easy to implement however throttling affect all users equally.

2. Limit per application (API key) or per traffic source (IP)

This imposes a limit on particular application which uses your API. This sounds as a natural thing to do if you give access to your API. However the problem with it is that end users can do nothing about it, and in most cases have no visibility of what is happening. User have application A and application B which are linked but not working because of some limits user have no idea of. If you impose this limit, set it high enough and provide a channel for application developers to request quota increase. Formally we can say that scope of the limit applied is application_id.

3. Limit per user on whose behalf API is being called

In this case limit is per user account (scope is user_id). In this case many people can use integration between A and B without interference. The drawback is that if user has two integrations of application A with say B and C, then those application B and C will now interfere. And if throttling is requested B and C may not receive equal share of API “bandwidth”, as one application may start retrying more frequently than other and thus preempting it.

apps

3a. Limit per authentication token(session), optionally limit number of tokens per account.

To resolve previous problem you may want to set a limit not per user account, but rather per authentication token, which will presumably be different for apps B and C. You can achieve similar result if you account per (user_id, api_key) tuple. This approach provides good isolation of particular application pairing for particular user. Careless approach would be to say it’s users’ responsibility to manage access among integrated apps, in reality users can do nothing about it.

I would recommend combination of all three quotas at different level. Let’s see example to see what I mean:

Domain Limit Notes
Per API key (per integrated appication) 10,000 req/sec We want to empower integrated apps, and let them do real work. We may change this limit based on support requests, partner agreements etc.
Per user account 100 req/sec All integrated applications belonging to the same user can do 100 requests per second in total.
Per session 50 req/sec Still we want to limit any single application from consuming all user’s quota.

4. Subdomains or organizations (like workato.zendesk.com)

All logic of the above holds true, you want to limit amount of operations in particular scope. If you only support integrations at organization level (e.g. no individual org members can set up their own links), when you will probably have the following quote levels:

  • application_id (per API key)
  • organization_id (per organization as whole)
  • (application_id, organization_id) tuple

If additionally any org member has access to API and can do integration, then you additionally may have

  • (organization_id, user_id) (per user in organization)
  • (applicaiton_id, organization_id, user_id) (per user in organization using specific integration)

Rolling limit vs static intervals

There are two approaches to calculate api usage and detect when to activate the throttling. One is to have rolling window, say of 1000 requests per hour max. You count how many request you have had in last 60 minutes and if it is more than 1000 return appropriate error code. This is a bit tricky to implement, but delivers better recovery time. Easier approach is to start counting at the beginning of an hour, and reset it at the end. This is much easier to implement but then user needs to wait till the end of an hour even if overuse was negligible.

Return codes

429 with Retry-After. 500 is ok if api global threshold is reached, as particular api user can do nothing about it other than retry with increasing back-off. In fact, in may be even easier for api consumers to have different error codes for different expected behavior. 429 with Retry-After for managed back-off and 500 for generic back-off on global errors. This is example of how you should treat api limits in general – clearly indicate scope of the problem especially if expected handling is different. In practice you are unlikely to impose more complex combination than 1 plus any one of 2-4.

Client behavior

Let’s chat about clients a little bit. If client does not obey the throttling the discussed stuff does not make much sense. Client is behavior relatively simple — Back off for the time as specified by server if 429 with retry-after is received (or functionally equal response as documented by api provider), or back off exponentially (or other algorithm with increasing retry interval) otherwise (500, or no retry-after header) as it should do with any retry-able error code.

API examples

Dropbox

Dropbox limits amount per (application_id, user_id) unique tuple. It returns 503 for OAuth 1.0 requests and 429 for OAuth 2.0. See https://www.dropbox.com/developers-v1/core/docs and https://www.dropbox.com/developers-v1/core/bestpractices

Zendesk

Zendesk accounts per organization_id and uses 429 response code with Retry-After set. Refer tohttps://developer.zendesk.com/rest_api/docs/core/introduction#rate-limits

Google calendar

Google has multiple accounted usage scopes. It uses 403 HTTP result code with more detailed information about the scope inside JSON-encoded HTTP body. It has no indication when to retry and suggests to use exponential backoff. See https://developers.google.com/google-apps/calendar/v3/errors#403_daily_limit_exceeded

Hiccups with Java double brace initialization of anonymous object

Double brace initialization is often used technique to overcome Java limitations on working with anonymous objects. In my case I used HashMap to pass tracking event parameters to Flurry Analytics class. The code is like this :

final HashMap<String, String> eventParms = new HashMap<String, String>() {{
    put("fileSize", Long.toString(new File(item.path).length()));
    put("mimeType", mimeType);
}};
FlurryAgent.logEvent(event , eventParms);

Looks pretty nice, and avoids creating local variable and then putting parameter pairs into it. However I started seeing OutOfMemoryExceptions in my Android application. I took HPROF memory dump, and pretty quickly realized that the problem was with anonymous classes they somehow were of a huge size. I googled around and found this article : http://vanillajava.blogspot.ru/2011/06/java-secret-double-brace-initialization.html, and it became obvious that it was exactly my problem: Anonymous object retained reference to outer class, which in my case had members of a big size – Bitmaps etc. So while Flurry events were kept int the queue those huge bitmaps ere also occupying heap memory.

Additional note on HPROF: In order to use standard Memory Analyzer tool you need to convert HPROF heap dump from Android format to regular Java format. Use android_sdk/tools/hprof-conv.

HTTP benchmark suites

Brief reference on HTTP benchmark I know

  • ab – Apache Benchmark, fairy simple
  • tsung – written in Erlang, supports multiple worker nodes
  • JMeter – witten in Java, supports multiple worker roles
  • Yandex-Tank – written in Python. Good documentation how to configure Linux kernel – https://github.com/yandex-load/yandex-tank
  • Loader.io – clous bases service

Note: I still to find my ideal tool. Creating a calibrated high load is not a trivial task. Simple hand-written load generator on Node.js provided 10%-20% better performance than ab. OS also plays a role here. I found it to be very hard to generate stable outgoing load from Linux machine. Occasionally outgoing TCP connection handshake would freeze for multiple of 3s intervals (tcp retry), or kernel will run out of ephemeral ports, as thousands of sockets are in FIN_WAIT state. I’m speaking about big RPS here, like > 1000 rps per core.

 

API versioning scheme for cloud REST services

Traditional approach

Traditionally it is considered a good practise to have REST API versioning like “/api/v1/xxx” and when you transition to next version of your API you introduce “/api/v2/xxx”. It works well for traditional web application deployments where HTTP reverse proxy / load balancer front end exists. In such set up application-level load balancer can also work as an HTTP router, and you can configure where requests go. For example you can define “v1” request go to one rack, “v2” to other, and static files to third one.

Cloud services tend to simplify things and they often don’t have path-based HTTP routing capabilities. For example Microsoft Azure does load balancing at TCP level. So any request can hit any VM, so your application needs to be prepared to handle all types of requests at each VM instance. This introduces a lot of issues:

  • No isolation. If you have a bug in v1 code which can be exploited all your instances are vulnerable;
  • Development limitations. You need to support whole code base making sure ‘v1’ and ‘v2’ live nicely together. Consider situation when some dependency is used in both ‘v1’ and ‘v2’ but require specific different versions;
  • Technology lock in. It either impossible or very hard to have ‘v1’ in C# and ‘v2’ in Python.
  • Deployment hurdle. You need to update all VMs in your cluster each time.
  • Capacity planing and monitoring. It is hard to understand how much resources are consumed by ‘v1’ calls vs. ‘v2’

Overall it is very risky, and eventually can become absolutely unmanageable.

Cloud aware approach

To overcome these and probably other difficulties I suggest to separate APIs at domain name level: e.g. “v1.api.mysite.com”, “v2.api.mysite.com”. It is then fairy easy to setup DNS mapping of the names to particular cloud clusters handling the requests. This way your clusters for “v1” and “v2” are completely independent – deployment, monitoring, they can even run on different platforms – JS, python, .NET, or even in different clouds – Amazon and Azure for example. You can scale them independently – scale v1 cluster down as load reduces, and scale v2 cluster up as its load increases. Then you can deploy “v2” into multiple locations and enable geo load balancing still keeping “v1” legacy set up compact and cheap. You can have ‘v1’ in house, and ‘v2’ hoste in cloud. Overall such set up is much more flexible and elastic, more easy to manage and maintain.

Of course this technique applies not only to HTTP REST API services but to others as well – HTTP, RTMP etc.

Comparison of BLOB cloud storage services

In the table below I’m trying to compare Amazon AWS S3, Microsoft Azure Storage, Google Cloud Storage, Rackspace CloudFiles.

Hope this is useful.

Amazon AWS S3 MS Azure Google Cloud Storage Rackspace Cloud Files
Max account size Unlimited? 200 TB per storage account(by default you have 20 storage accounts per subscription) = 4PB Unlimited? ??
Geo redundancy +extra price +extra price ?
SLA Availability SLA: 99.9%Designed to provide 99.999999999% durability and 99.99% availabilityReduced Redundancy:Designed to provide 99.99% durability and 99.99% availability Availability SLA: 99.9% Availability SLA: 99.9% Availability SLA: 99.9%
HTTP + + +
HTTPS + + + +
Container namespace Global LName collision is possible Storage account Global L Account
Container (bucket) levels single single single single
Root container -but with HTTP each bucket can have DNS name (not possible with HTTPS) + -but each bucket can have DNS name
Cross origin CORS support + +experimental +
Max number of containers (buckets) 100 unlimited Unlimited? 500,000
Container name length 3-63a few limitations, basic form is “my.aws.bucket“in US region:255 bytes[A-Za-z0-9.-_]But DNS-compliant is recommended 3-63 chracters, valid DNS name 3 to 63 charactersvalid DNS nameIf name is dot-separated each component <63 character, and 222 total 256after URL encoding, no ‘/’ character
Max files per container Unlimited, no performance penalty unlimited unlimited Unlimited, but API throttling begins at 50000 files.
List files, max results 1000 5000 ? 10,000
List files returns metadata +(configurable)
File (BLOB/object) name 1024 1024 1024 1024 bytes after URL encoding
Max file size 5TB 200GB (Block blob)1TB (page BLOB) 5TB UnlimitedMultiple files in the same container with same name prefix can be joined into single virtual file.
Max upload size 5 GB 64 MB 5 MB? 5GB
HTTP Chunked upload + +
Parallel upload + + ?(Resumable Uploads of Unknown Size) +
Upload Data checksumming + + +
Server-side copy +(up to 5 GB?) +supports external sources + +
File metadata 2KB, ASCII,no update 8KB + (size limit non specified)No update 90 pairs or 4096 bytes total
# data centers 8 8 ? 2: ORD, DFW
CDN-enabled + + ? +
Authentication HMAC-SHA1 signing HMAC-SHA256signing OAuth 2.0Google cookie1-hr JWT token 24-hr token obtained from Cloud Authentication Service.
Time-limited shared access + + + +
File auto expiration + +
Server-side encryption +AES-256
PRICING
Storage per GB a month $0.055 – $0.095 for fully redundant

$0.037 – $0.076 for reduced redundacny

 

$0.055 – $0.095 for geo redundant

$0.037 – $ 0.07 for locally redundant storage

$0.054 – $0.085 for full redundancy

$0.042 – $0.063 reduced redundancy

$0.10
Outgoing bandwidth per GB $0.05 – $0.12 (US)

1GB /mo is free

$0.05 – $0.12 (UE/EU)

$0.12-$0.19 (Asia)

$0.08-$0.12 (US)

$0.15-$0.21 (Asia-Pacific)

$0.18
Incoming bandwidth $0 $0 $0 $0
PUT, POST, LIST $0.005 per 1000 $0.01 per 100 000 storage transactions $0.01 per 1000 $0
HEAD, GET, DELETE  $0.004 per 10 000(no charge for delete) $0.01 per 100 000 storage transactions $0.01 per 10 000(no charge for delete) $0

Update (Nov 23, 2013):  About 30% cost reduction by Google; up to 24% price reduction by Amazon ; same 24% reduction by Microsoft

UPDATE (Nov 5, 2012): Azure Strorage now has higher limits, way better internal connectivity network, and higher performance targets, see announcement.

2000 IPOS per partition, 20000 IPOS per storage account.