Category Archives: Programming

Brief reference on cloud storage

This is very brief and shallow comparison of data model and partitioning principles in Amazon S3 and Azure Storage. Please also see my feature comparison post of various storage platforms: http://timanovsky.wordpress.com/2012/10/26/comparison-of-cloud-storage-services/
Amazon S3
Getting most out of Amazon S3: http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
Their storage directory is lexigraphically-sorted, and leftmost characters used as partition key. It is not said, but looks like you need to have your prefix tree balanced in order for partition balancing to work optimally. I.e. if you prefix with 0-9A-F as suggested in the article, amount of requests going to all 16 prefixes must be roughly the same. This underneath might mean that key space is always partitioned evenly – split into fixed amount of equal key ranges. That is totally my speculation, but otherwise I can not explain why such prefixes would matter.

Microsoft Azure Storage
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/12/30/windows-azure-storage-architecture-overview.aspx
http://blogs.msdn.com/b/windowsazurestorage/archive/2011/11/20/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx
Having glanced over MS docs I’m under impression that Azure storage can split key ranges independently based on the load and size.
Update: The following quote shows that Azure is similar to S3, and I was wrong:

A downside of range partitioning is scaling out access to
sequential access patterns. For example, if a customer is writing
all of their data to the very end of a table’s key range (e.g., insert
key 2011-06-30:12:00:00, then key 2011-06-30:12:00:02, then
key 2011-06:30-12:00:10), all of the writes go to the very last
RangePartition in the customer’s table. This pattern does not take
advantage of the partitioning and load balancing our system
provides. In contrast, if the customer distributes their writes
across a large number of PartitionNames, the system can quickly
split the table into multiple RangePartitions and spread them
across different servers to allow performance to scale linearly
with load (as shown in Figure 6). To address this sequential
access pattern for RangePartitions, a customer can always use
hashing or bucketing for the PartitionName, which avoids the
above sequential access pattern issue.

Fast IP->location resolution in SQL

Abstract
Many web services nowadays rely on users’ geo location information. It can be required purely for statistics purposes, or to add value to provided service itself. In most cases the only possible way to detect user location is to use IP address of originating request. IP addresses are not hierarchical in a geographical sense, large block of IP addresses is reserved to an ISP, and then broken down for smaller companies or individuals. There is no much order in the system. So a quick database is required in order to be able to make millions of lookups.

Geo location data
Surely enough companies exist which provide such IP-based geo location data. Maxmind is one of them. Fortunately they provide free version of their database under some legal constraints. There are other providers too, I’ve recently came across IpInfoDB. So far so good, now we need a system to make use of the data.

Physical database design
Data consists of (startIp, endIp, country) tuples. If IP address is in the range, it is originated from that country. There blind unassigned ranges, there are private ranges, and as said above multiple (very many) ranges map to a Country. In my table there is > 100K distinct IP ranges.

Physical design in MySQL:
CREATE TABLE LookupIpToCountry (
startIp int(10) unsigned NOT NULL,
endIp int(10) unsigned NOT NULL,
countryId tinyint(3) unsigned NOT NULL,
PRIMARY KEY (startIp),
KEY fk_countryId (countryId),
CONSTRAINT LookupIpToCountry_ibfk_1 FOREIGN KEY (countryId) REFERENCES Countries (id)
) ENGINE=InnoDB;

Other database or engine can be used, provided that startIp is clustering index, i.e. data in the table is physically sorted according to startIp in ascending order.

Fast lookup query
mysql> select * from LookupIpToCountry where startIp<=INET_ATON('74.125.19.99') and endIp>=INET_ATON('74.125.19.99') order by startIp desc limit 1;
+------------+------------+-----------+
| startIp | endIp | countryId |
+------------+------------+-----------+
| 1249693272 | 1249773023 | 230 |
+------------+------------+-----------+
1 row in set (0.00 sec)

The idea here is to find the first region into which IP of interest falls. SELECT ... WHERE startIp<= X ... order by startIp desc will navigate through B-tree index to the last row satisfying startIp criterion and setup scan iterator in descending order. (Because it can only validate second condition endIp>= X by scanning). But the iteration ends immediately because of LIMIT 1 if second condition is satisfied. So effectively whole query is reduced to B-tree lookup of one value. Of course, in case of the miss (IP address does not fall to any of known ranges), there is a performance hit:

mysql> select * from LookupIpToCountry where startIp<=INET_ATON('255.255.255.255') and endIp>=INET_ATON('255.255.255.255') order by startIp desc limit 1;
Empty set (0.39 sec)

In order to fix it you can simplify the query to select * from LookupIpToCountry where startIp<=INET_ATON('255.255.255.255') order by startIp desc limit 1; and check second condition endIp>=INET_ATON('255.255.255.255') at application level, if you think it is worth.

Solving performance problems in pub-sub erlang server

Intro
So I wrote (see some prehistory here) a kind of notification server where clients can subscribe to events and be notified when event of interest happens. Clients use HTTP long poll method to get notifications delivered, and one of the application of the server is chat room.

Symptoms
In my case there were two problems:

  1. general lack of performance (I started optimizing somewhere from 200-300 messages per second)
  2. under high load server would lock up, sometimes for extended period of time – it does not deliver any messages
  3. even under moderate load performance is not stable and would drop eventually for some period of time (sometimes to complete lock up, sometimes not, sometimes for a couple of seconds, sometimes for longer time).
  4. In-depth investigation with tcpdump has shown that server can not even accept connections.

Side note on server not accepting tcp connections
That is interesting how it happens though. I don’t know if it is specific Linux 2.6 behavior of it is a norm, but the sequence is:

  1. client sends SYN
  2. server responds with ACK, but acknowledgment number set to some arbitrary big number
  3. client drops connection by sending RST

So should you see similar behavior note that this is just a symptom, not disease. The problem is somewhere dipper. In my case increasing TCP listen backlog from mochiweb’s default 30 helped a bit, fewer timeouts were observed, but still performance sucked.

Fixing Root Cause
So how pub-sub servers in Erlang are generally built? You get a message being processed in some process, it might have been received from somewhere else or originates from this process . And then you have bunch of waiting processes representing connected subscribed clients waiting for a message be delivered. Each of these processes normally represents TCP connection or in my case HTTP long polling connection. And delivering message to this process releases it from wait state and allows message to be delivered to end user. There is of course some router module or process which determines a subset of processes (PIDs) to which the message should be delivered. How to do efficiently is very interesting topic but not for this post. Then you do something like
lists:foreach(fun(Pid) -> send_message(Pid, Message) end, PidList)

The result of this is that each of processes from the target group (selected by router) becomes immediately available for execution. And if the group size is big, chances are that current process broadcasting this notifications will be preempted. And the thing is that this actually causes context switching storm. I’m not 100% sure how Erlang runtime is implemented, but it seems that if process receives a message it gets some kind of priority and is scheduled for execution, like it happens in some OSes. So the message sending loop may take quite awhile.

Now, if the message broadcast loop is more complex, say it consists of two nested loops, and inside it you do some non trivial operations followed by sending the message for one particular PID, then the things become very bad. Context switching overhead no matter how light it is in Erlang kills your performance.

Recipe

  1. If you have complex loops to calculate to which PID send the message, and you do message sending inside that loop – rewrite the code. First prepare list of PIDs using whatever complex procedures you have, then send messages to those PIDs in one shot.
  2. Then sending messages to bunch of PIDs do temporarily boost performance of your thread so it won’t be preempted.

Example:

PidList = generate_pid_list(WhatEver),
OldPri = process_flag(priority, high), % Raise priority, save old one
lists:foreach(fun({Pid, Msg}) -> send_message(Pid, Msg) end, PidList),
process_flag(priority, OldPri)


That’s basically it. The result is that now I’m reliably achieving about 1.5k messages/sec with all stuff like HTTP/JSON parsing, ETS operations, logging, etc. I would like to pump this number few times higher, but at the moment that’s what I can get. I will come back when learn something new :)

PS. You may also find this discussion followed $1000 code challenge useful:
http://groups.google.com/group/erlang-programming/browse_thread/thread/1931368998000836/b325e869a3eea26a

Fix Eclipse Config to Start with Broken Perspective

I happened to switch to Ruby Browsing Perspective in Eclipse eventually, for whatever reason it crashed Eclipse with Out of Memory exception. When I try to restart it won’t, because it tries to load all the same Perspective. The solution is to edit one of Eclipse config files. The file of interest is "workspace/.metadata/.plugins/org.eclipse.ui.workbench/workbench.xml" (in my case on Mac it is ~Documents/workspace/.metadata/.plugins/org.eclipse.ui.workbench/workbench.xml)

The default perspective is configured by the following directive:
<perspectives activePart="org.rubypeople.rdt.ui.EditorRubyFile" activePerspective="org.rubypeople.rdt.ui.RubyBrowsingPerspective">

So what I had to do is replace org.rubypeople.rdt.ui.RubyBrowsingPerspective with org.eclipse.jdt.ui.JavaPerspective

You can find for exact class names of available perspectives in the same file. Perspective declaration looks like:

<perspective editorAreaTrimState="2" editorAreaVisible="1" fixed="0" version="0.016">
<descriptor class="org.eclipse.jdt.internal.ui.JavaPerspectiveFactory" id="org.eclipse.jdt.ui.JavaPerspective" label="Java"/>
...

First line is opening of perspective declaration, and <descriptor> tag contains id attribute which is what you want. If you search trough workbench.xml you will find all available perspective names.

Get Unix timestamp in Java, Python, Erlang, JavaScript, Go, MySQL

Mostly a note for myself. To Get Unix timestamp value in seconds

Java:

long timestamp = System.currentTimeMillis()/1000

Python:

import time
timestamp = int(time.time())

Erlang:

{Mega, Secs, _} = now(),
Timestamp = Mega*1000000 + Secs,

JavaScript:

var ts = Math.floor(Date.now()/1000); // You can also use new Date().getTime()/1000 but this one is faster

Go:

package main
import (
"time"
"fmt"
)

func main() {
// Get and print integer timestamp
timestamp := time.Now().Unix()
fmt.Printf(“unixtime: %d\n”, timestamp)
// get and print floating point timestamp with sub-second precision
timestampFloat := float64(time.Now().UnixNano()) / 1.0e9
fmt.Printf(“unixtime: %f\n”, timestampFloat)

//Convert integer timestamp back to Time:
t1 := time.Unix(timestamp, 0)
fmt.Printf(“Time: %v\n”, t1)

//Convert floating point timestamp back to Time
t2 := time.Unix(int64(timestampFloat), int64((timestampFloat – float64(int64(timestampFloat)))*1e9))
fmt.Printf(“Time: %v\n”, t2)
}

Working with timestamps in MySQL

mysql> SELECT UNIX_TIMESTAMP('1997-10-04 22:23:00');
-> 875996580
mysql> SELECT FROM_UNIXTIME(1111885200);
+---------------------------+
| FROM_UNIXTIME(1111885200) |
+---------------------------+
| 2005-03-27 03:00:00 |
+---------------------------+

Solution for Erlide hangs Eclipse problem

If you are using Erlide to develop Erlang applications you may sooner or later find yourself in a situation than IDE just hangs completely whenever you try to access any Erlide functions. You can not even go to Erlide section of Eclipse Preferences ! This seems to be a kind of mystery, at least googling didn’t reveal any helpful information except hinting that problem could be due to backend connection issues. As far as I understand, backend is erlang compiler/toolchain used by Erlide. I can imagine it is used when compiling, or when doing syntax check, but why it hangs in editor and preferences dialog?!

I was able to find Erlide logfile (~/Document/workspace_erlide.log under MacOS) And it suggested that Erlide was running in dead loop trying to invoke external Erlang. In particualr it complained about that the following command were failing with result code 126:
/opt/local/var/macports/software/erlang/R12B-3_2/opt/local/lib/erlang//bin/erl -name f0a30ba_erlide@alexey-timanovskys-macbook-pro.local -setcookie erlide
. When I tried to execute it from command line I got errors from erlang. Then I discovered that I have two installations of erlang – R12B 3.2 and 5.0, obviously I used macports to upgrade to most recent one, but it kept previous version too. And erlide was configured to use old erlang setup which didn’t work for whatever reason. So the next step was to find where it is configured, I grepped and found it in
~Documents/workspace/.metadata/.plugins/org.eclipse.core.runtime/.settings/org.erlide.core.prefs
I replaced line

runtimes/erlang/homeDir=/opt/local/var/macports/software/erlang/R12B-3_2/opt/local/lib/erlang/
with

runtimes/erlang/homeDir=/opt/local/lib/erlang/

and restarted Eclipse. It worked!

So be careful, breaking anything in erlang configuration will not allow you to use Erlide, you can’t even go to preferences and change erlang home directory to correct one.

I’m using Eclipse 3.4.1 and tried Erlide versions 0.5.0 and 0.5.1.

How to produce hex array from binary data

Suppose you have a binary file you want to include as byte array into C/C++/Java program. Here how you can generate it (just add opening and closing braces and clean some garbage at the end)

hexdump -v -e " 16/1 \"0x%02X, \" \"\n\""

Something like the following will be produced:


0x7F, 0xFE, 0xF9, 0xE7, 0x9F, 0x7F, 0xFE, 0xF9, 0xE7, 0x9F, 0x7F, 0xFE, 0xF9, 0xE7, 0x9F, 0x7F,
0xFE, 0xF9, 0xE7, 0x9F, 0x7F, 0xFE, 0x0F, 0xFD, 0x3F, 0x4F, 0xAC, 0x0A, 0xA2, 0x00, 0xB0, 0x94,
0x01, 0x , 0x , 0x , 0x , 0x , 0x , 0x , 0x , 0x , 0x , 0x , 0x , 0x , 0x , 0x ,

Disappointment about Hibernate and MySQL

One-to-one unidirectional relation implementation in Hibernate sucks. You implement it using <many-to-one ... unique="true">. This creates two indexes in the table for the same field, one for uniqueness constraint, one for foreign key constraint.

Class inheritance when using table-per-subclass is implemented in similar way, again two indexes per field (ID) in child table, one created as primary key constraint, second as foreign key constraint referencing parent class table.

And finally dropping an index in MySQL InnoDB table is very expensive operation – it creates temporary table and reindexes it with remaining indexes. So if you want to drop 4 indexes it will re-create and re-index table 4 times. This can take many hours on any reasonably large database.

I’m disappointed.

Correct implementation of fast server logging in Erlang

Making long story short: just use disk_log.

Actual history with concrete figures (I hate abstract discussions):
I first tried to implement logging as a separate singleton process (gen_server). Formatting was heavily optimized but performed in server thread, the idea was to offload main service threads as much as possible, as I was working on latency-critical application. Output was plain text using raw file output. The best performance I could get was about 300 events/sec (on MacBook pro). And I came to infamous problem that due to scheduling implementation particularities in Erlang, mailbox of logger thread was continuously growing eventually leading to hard crash of Erlang VM (with malloc exception).

I moved all formatting work to calling threads. I used buffered writing at Erlang driver level (delayed_write) and at logger process level (accumulating 1000 events before writing to disk). It worked slightly better – about 1000 events/sec but eventually got to the same problem of malloc out of memory exception due to mailbox memory overflow.

So I decided to give disk_log a try. I used internal format (binary), I also changed to semi-synchronous model (using disk_log:log and disk_log:log_terms) so that multiple processes can be served by log module, but each process have to wait till it’s message is processed. The results are great : about 100K events/sec and memory overflow is inherently impossible. There are of course implications on parsing this logs but they absolutely worth the performance gain it gives.

Profiling a Running Erlang Server

Say you want to profile a running (may be even production) Erlang server. You would do it with fprof. It is relatively easy to profile a function, just follow documentation, let’s see how to profile application.

  1. Start profiling for all processes of interest
    fprof:trace([start, {procs, [whereis(pid1), whereis(pid2)]}]).
    pid1, pid2 etc – registered processes. Fprof will profile them and all spawned processes, so depending on your architecture it is enough to include single process which listens on socket and accepts connections. Documentation states that whereis is not necessary, but it doesn’t work for me otherwise.
  2. After a while stop profiling. Note that trace files are really big, and processing them in consequent steps takes quite awhile, so the the first time you wouldn’t want to run profiling for the whole day :-) Just try 30-60 seconds to begin with. Also keep in mind that load will increase 5-10 times, so if you test in on production server, make sure you have enough resources :)
    fprof:trace([stop]).
  3. Process data. This will process raw data and save result to ‘fprof.trace’ file, or you can give it other name so you can find and load it later.
    fprof:profile().
  4. Analyse and save data to human readable text file – ‘fprof.analysis’
    fprof:analyse([totals, {dest, "fprof.analysis"}]).
  5. Clear all memory. This makes sense if you want to let server continue.
    fprof:stop().

I will not explain how to read resulting text file, they are a bit cryptic and you have to read fprof documentation, frankly speaking I don’t understand it fully :)