Tag Archives: linux

Solving performance problems in pub-sub erlang server

Intro
So I wrote (see some prehistory here) a kind of notification server where clients can subscribe to events and be notified when event of interest happens. Clients use HTTP long poll method to get notifications delivered, and one of the application of the server is chat room.

Symptoms
In my case there were two problems:

  1. general lack of performance (I started optimizing somewhere from 200-300 messages per second)
  2. under high load server would lock up, sometimes for extended period of time – it does not deliver any messages
  3. even under moderate load performance is not stable and would drop eventually for some period of time (sometimes to complete lock up, sometimes not, sometimes for a couple of seconds, sometimes for longer time).
  4. In-depth investigation with tcpdump has shown that server can not even accept connections.

Side note on server not accepting tcp connections
That is interesting how it happens though. I don’t know if it is specific Linux 2.6 behavior of it is a norm, but the sequence is:

  1. client sends SYN
  2. server responds with ACK, but acknowledgment number set to some arbitrary big number
  3. client drops connection by sending RST

So should you see similar behavior note that this is just a symptom, not disease. The problem is somewhere dipper. In my case increasing TCP listen backlog from mochiweb’s default 30 helped a bit, fewer timeouts were observed, but still performance sucked.

Fixing Root Cause
So how pub-sub servers in Erlang are generally built? You get a message being processed in some process, it might have been received from somewhere else or originates from this process . And then you have bunch of waiting processes representing connected subscribed clients waiting for a message be delivered. Each of these processes normally represents TCP connection or in my case HTTP long polling connection. And delivering message to this process releases it from wait state and allows message to be delivered to end user. There is of course some router module or process which determines a subset of processes (PIDs) to which the message should be delivered. How to do efficiently is very interesting topic but not for this post. Then you do something like
lists:foreach(fun(Pid) -> send_message(Pid, Message) end, PidList)

The result of this is that each of processes from the target group (selected by router) becomes immediately available for execution. And if the group size is big, chances are that current process broadcasting this notifications will be preempted. And the thing is that this actually causes context switching storm. I’m not 100% sure how Erlang runtime is implemented, but it seems that if process receives a message it gets some kind of priority and is scheduled for execution, like it happens in some OSes. So the message sending loop may take quite awhile.

Now, if the message broadcast loop is more complex, say it consists of two nested loops, and inside it you do some non trivial operations followed by sending the message for one particular PID, then the things become very bad. Context switching overhead no matter how light it is in Erlang kills your performance.

Recipe

  1. If you have complex loops to calculate to which PID send the message, and you do message sending inside that loop – rewrite the code. First prepare list of PIDs using whatever complex procedures you have, then send messages to those PIDs in one shot.
  2. Then sending messages to bunch of PIDs do temporarily boost performance of your thread so it won’t be preempted.

Example:

PidList = generate_pid_list(WhatEver),
OldPri = process_flag(priority, high), % Raise priority, save old one
lists:foreach(fun({Pid, Msg}) -> send_message(Pid, Msg) end, PidList),
process_flag(priority, OldPri)


That’s basically it. The result is that now I’m reliably achieving about 1.5k messages/sec with all stuff like HTTP/JSON parsing, ETS operations, logging, etc. I would like to pump this number few times higher, but at the moment that’s what I can get. I will come back when learn something new 🙂

PS. You may also find this discussion followed $1000 code challenge useful:
http://groups.google.com/group/erlang-programming/browse_thread/thread/1931368998000836/b325e869a3eea26a

Tuning Linux firewall connection tracker ip_conntrack

Overview
If your Linux server should handle lots of connections, you can get into the problem with ip_conntrack iptables module. It limits number of simultaneous connections your system can have. Default value (in CentOS and most other distros) is 65536.

To check how many entries in the conntrack table are occupied at the moment:

cat /proc/sys/net/ipv4/netfilter/ip_conntrack_count

Or you can dump whole table :

cat /proc/net/ip_conntrack

Conntrack table is hash table (hash map) of fixed size (8192 entries by default), which is used for primary lookup. When the slot in the table is found it points to list of conntrack structures, so secondary lookup is done using list traversal. 65536/8192 gives 8 – the average list length. You may want to experiment with this value on heavily loaded systems.

Modifying conntrack capacity
To see the current conntrack capacity:

cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max

You can modify it by echoing new value there:

# echo 131072 > /proc/sys/net/ipv4/netfilter/ip_conntrack_max
# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
131072

Changes are immediate, but temporary – will not survive reboot.

Modifying number of buckets in the hash table
As mentioned above just changing this parameter will give you some relief, if your server was at the cap, but it is not ideal setup. For 1M connections average list becomes 1048576 / 8192 = 128, which is a bit too much.

To see current size of hash table:

cat /proc/sys/net/ipv4/netfilter/ip_conntrack_buckets

which is read-only aliase for module parameter:

cat /sys/module/ip_conntrack/parameters/hashsize

You can change it on the fly as well:

#echo 32768 > /sys/module/ip_conntrack/parameters/hashsize
# cat /sys/module/ip_conntrack/parameters/hashsize
32768

Persisting the changes
Making these changes persistent is a bit tricky.
For total number of connection just edit /etc/sysctl.conf (CentOs, Redhat etc) and you are done:

# conntrack limits
net.ipv4.netfilter.ip_conntrack_max = 131072

Not so easy with hashtable size. You need to pass parameters to kerenl module at boot time. Edit add to /etc/modprobe.conf:

options ip_conntrack hashsize=32768

Memory usage
You can find how much kernel memory each conntrack entry occupies by grepping /var/log/messages :

ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack

1M connections would require 304MB of kernel memory.

Toward a million-user long-poll HTTP application – nginx + erlang + mochiweb :)

Intro
First, this post and title are inspired by the following great article, you absolutely must read if you are interested in the subject:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1/
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-2/
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3/

I’m quite far from 1M users, but still getting a load which out-of-the-box configuration can not cope with. What I’m doing is long-poll HTTP server which implements chat and some other real-time notifications for web clients. Nginx is used as a reverse proxy (HTTP router basically). Let’s say we want to handle N connections simultaneously (N should be much bigger than 1000 🙂 ). What parameters need to be changed? Note: all numbers are approximate.

1. Nginx
Number of file handlers (descriptors) provided by OS. It is configured in the script which launches nginx. In my case it is /etc/init.d/nginx. So I just add

ulimit -n N*2

Mind that proxy naturally uses two sockets per connection (uplink and downlink).

Now let’s take a look at Nginx config file (/etc/nginx/nginx.conf for me). Here is extract from Nginx manual on events module:

The worker_connections and worker_proceses from the main section allows you to calculate maxclients value:
max_clients = worker_processes * worker_connections
In a reverse proxy situation, max_clients becomes
max_clients = worker_processes * worker_connections/4

So we have to set

events {
worker_connections N*4;
}

Then as we use Nginx as a proxy not for a regular HTTP server, but for long-polling one, we have to tune Nginx parameters:

location /my_url {
proxy_pass http://127.0.0.1:8000;
proxy_buffering off;
proxy_read_timeout 3600;
}

So we forward all requests to “/my_url” to HTTP server running on localhost port 8000. We disable buffering in Nginx, and tell that it may take a while for the server to respond, so Nginx should wait and not timeout.

ok, we are done with Nginx, let’s go to the server

2. Erlang
Again in start script we set number of available file descriptors:

ulimit -n N

Then we are passing two additional params to erlang:

erl +K true +P N ...your params here

+K option enables kernel polling, which saves some CPU cycles when handling multitude of connections. +P says how many parallel processes Eralng VM can have, by default it is 32768. My application is based on mochiweb library and uses one process per connection (usual case for Erlang server applications), so we need to have at least N processes. Note that some recommend to set it to maximum possible value, roughly 137M, but It leads to that Erlang VM allocates about 1GB of memory, not a big deal per se as it is in virtual address space, but I can imagine what internal reference tables for this memory heap and erlang process tables/mailboxes/whatever are also big and can cause some overhead. So I’d prefer to be on a safer side and set it to the really required value.

3. Application
The last thing we want is tell mochiweb server that we want more than default 2048 simultaneous connections. So I changing start parameters:

mochiweb_http:start([{max, N },{name, ?MODULE}, {loop, Loop} | Options]).