Category: Research & Dev

nginx ncache performance and stability

BACKGROUND: We’ve been running nginx successfully for a long long time now without issue.  We were really pleased with it, and migrated all of our CDN servers across to using it several months ago.  While there were a few little configuration defaults that didn’t play too nicely, we ironed out the problems within a few days and it’s business as usual, serving between 700TB and 1.8PB (In August 2008!) per month!

Now we have the problem that our proprietary systems that actually cache our customer’s sites just aren’t fast enough to handle the fast-changing content that our customers are demanding.  So we’ve been weighing up a few options:

1) Deploy Varnish
2) Deploy Squid
3) Deploy ncache

We actually TRIED to deply varnish after seeing so many recommendations, but at the end of the day it couldn’t keep up.  It really should be noted that our systems are all 32bit, and I get the feeling varnish would perform a lot better on 64bit, but when you have over a hundred servers all running on 32bit processors…..upgrading them all in one hit just isn’t an option.  The problem with varnish is that it would drop connections seemingly because it had run out of memory (although we’re not 100% on this as the debugging wasn’t overly useful).

So…..we tried……we failed.  NEXT

Our next attempt was to look into deploying Squid.  This one proved a bit complex to integrate into our existing CDN network because of squid’s limitations.  We would have to write a bit of custom code to make this one work, so it has been made a last resort.

So, option 3 (which is the whole point of this blog entry), was to try out the ncache nginx module.  So we installed the ncache 2.0 package along with nginx 0.6.32.  We set them all up and things ran absolutely beautifully.  We manually tested caches and it was working great, very fast, very efficient, and well, it was great!

We were extremely happy until the next day when one CDN started reporting high load issues.  In analysing the server, it seems nginx was trying to consume several gig of memory – owch.  So we killed it off and restarted it, and it ran fine again.  Maybe it was just a temporary glitch?

Nope – over the following week, we had several servers experience identical issues – that is, where nginx consumes all available memory until such a point that it simply stops being able to serve files and requires a restart.  Looks like a case of either:

A) ncache 2.0 has a serious memory leak
B) ncache 2.0 doesn’t handle high volumes of cached data very well.

We’ve tried to work through the issues to make sure they’re not configuration issues, but no luck.  At the end of the day it’s going to be cheaper and easier to make some mods to squid and deply that instead.

So for anybody wondering about ncache performance and stability, I’d simply say that it’s a great product, but not really production-ready at this stage if you’re doing high volumes of data.

[del.icio.us] [Digg] [StumbleUpon] [Technorati] [Windows Live]

PHP Parse XML from mysql

Today we were looking at migrating data from one database to another and thought that we’d try out Mysql command line XML output format. The format is fairly simple, so we decided we’d run with it instead of CSV. To parse the XML file, we use PHP, and soon discovered there’s no real nice code snippets out there on the net for parsing XML with PHP. Anyway, we cobbled the following code together to read mysql XML from stdin, and parse it into $rows.

Our original XML file looks something like this….

<?xml version=”1.0″?>

<resultset statement=”SELECT * FROM httpd_access_log_tmp WHERE pending=’Y’
“>
<row>
<field name=”ev_timestamp”>2008-06-30 02:45:10</field>
<field name=”site”>rackcorp.com</field>
<field name=”hostname”>XX.XXX.XX.XX</field>
<field name=”method”>GET</field>
<field name=”url”>/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</field>
<field name=”authuser”>-</field>
<field name=”agent”>Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv</field>
<field name=”referer”>http://www.mysql.com/XXXXXXX</field>
<field name=”status”>200</field>
<field name=”servedfrom”>XXXXXXXXX</field>
<field name=”timeused”>0</field>
<field name=”bytesbody”>9948</field>
<field name=”bytesin”>499</field>
<field name=”bytesout”>10227</field>
<field name=”httpprotocol”>HTTP/1.1</field>
<field name=”httphost”>XXXXXXXXXXXXX</field>
<field name=”pending”>Y</field>
</row>

<row>
etc…

So now the code to parse it:

if ( ! ( $fp = fopen(“php://stdin”, “r”) ) ) {
die (“Couldn’t read STDIN.”);
}

$rows = array();
$columnName = ”;
$rowcount = 0;

if (!($xml_parser = xml_parser_create())) die(“Couldn’t create parser.”);
function startElementHandler ($parser,$name,$attrib){
global $rowcount;
global $rows;
global $columnName;
switch ($name) {
case “FIELD”:
$columnName = $attrib[“NAME”];
break;
default:
$columnName=””;
}
}

function endElementHandler ($parser,$name){
global $rowcount;
global $rows;
global $columnName;

$columnName=”; // Finished with data

if($name==”ROW”) {$rowcount++;}
}

function characterDataHandler ($parser, $data) {
global $rowcount;
global $rows;
global $columnName;

if ($columnName == “”) {return;}

$rows[$rowcount][$columnName] = $data;

}
xml_set_element_handler($xml_parser,”startElementHandler”,”endElementHandler”);
xml_set_character_data_handler( $xml_parser, “characterDataHandler”);

while( $data = fread($fp, 4096)){
if(!xml_parse($xml_parser, $data, feof($fp))) {
break;
}
}
xml_parser_free($xml_parser);

var_dump($rows);

$rows now looks a little something like this:

array(2) {
[0]=>
array(17) {
[“ev_timestamp”]=>
string(19) “2008-06-30 02:45:10”
[“site”]=>
string(15) “rackcorp.com”
[“hostname”]=>
string(12) “XX.XXX.XX.XX”
[“method”]=>
string(3) “GET”
[“url”]=>
string(61) “/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX”
[“authuser”]=>
string(1) “-”
[“agent”]=>
string(50) “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv”
[“referer”]=>
string(46) “http://www.mysql.com/XXXXXXX”
[“status”]=>
string(3) “200”
[“servedfrom”]=>
string(12) “XXXXXXXXX”
[“timeused”]=>
string(1) “0”
[“bytesbody”]=>
string(4) “9948”
[“bytesin”]=>
string(3) “499”
[“bytesout”]=>
string(5) “10227”
[“httpprotocol”]=>
string(8) “HTTP/1.1”
[“httphost”]=>
string(17) “XXXXXXXXXXXXXXXXXXXXXXXXX”
[“pending”]=>
string(1) “Y”
}
[1]=> etc….

That should do it. Then do with $rows what you want. We end up throwing it back into another few databases under a webhosting accesslogs warehousing scheme.

[del.icio.us] [Digg] [StumbleUpon] [Technorati] [Windows Live]

csync2: Install and setup csync2 on CentOS 5

This blog details how to build and install csync2 form source, as well as configure it.

Step 1) Download and install required libraries

If you haven’t already done so, install graft – it’s great. Here’s a tutorial how to install graft:
http://blog.rackcorp.com/?p=16

Go to ftp://ftp.gnupg.org/gcrypt/libgpg-error/ and download the latest version of libgpg-error

cd /usr/local/PKG_BUILD
wget ftp://ftp.gnupg.org/gcrypt/libgpg-error/libgpg-error-1.6.tar.bz2
bzip2 -d libgpg-error-1.6.tar.bz2
tar -xvf libgpg-error-1.6.tar
cd libgpg-error-1.6
./configure –prefix=/usr/local/PACKAGES/libgpg-error-1.6
make
make install
graft -i /usr/local/PACKAGES/libgpg-error-1.6/

Go to http://www.gnupg.org/download/index.en.html, and download libgcrypt and install it:

cd /usr/local/PKG_BUILD
wget ftp://ftp.gnupg.org/gcrypt/libgcrypt/libgcrypt-1.4.1.tar.bz2
bzip2 -d libgcrypt-1.4.1.tar.bz2
cd libgcrypt-1.4.1
./configure –prefix=/usr/local/PACKAGES/libgcrypt-1.4.1
make
make install
graft -i /usr/local/PACKAGES/libgcrypt-1.4.1

Go to http://www.t2-project.org/packages/libtasn1.html, and download libtasn1

cd /usr/local/PKG_BUILD
wget ftp://ftp.gnutls.org/pub/gnutls/libtasn1/libtasn1-1.4.tar.gz
tar -xvzf libtasn1-1.4.tar.gz
cd libtasn1-1.4
./configure –prefix=/usr/local/PACKAGES/libtasn1-1.4
make
make install
graft -i /usr/local/PACKAGES/libtasn1-1.4

If you get a conflict, you can just remove the conflicting file and retry the graft:

rm -f /usr/local/share/info/dir
graft -i /usr/local/PACKAGES/libtasn1-1.4

Go to http://www.sqlite.org/download.html, and download the source tree file:

cd /usr/local/PKG_BUILD
wget http://www.sqlite.org/sqlite-3.5.9.tar.gz
tar -xvzf sqlite-3.5.9.tar.gz
cd sqlite-3.5.9
./configure –prefix=/usr/local/PACKAGES/sqlite-3.5.9
make
make install
graft -i /usr/local/PACKAGES/sqlite-3.5.9

Go to http://www.gnu.org/software/gnutls/releases/, and download the latest gnutls:

cd /usr/local/PKG_BUILD
wget http://www.gnu.org/software/gnutls/releases/gnutls-2.4.0.tar.bz2
bzip2 -d gnutls-2.4.0.tar.bz2
tar -xvf gnutls-2.4.0.tar
cd gnutls-2.4.0
./configure –prefix=/usr/local/PACKAGES/gnutls-2.4.0
make
make install
graft -i /usr/local/PACKAGES/gnutls-2.4.0

Once again, if you get a conflict:

rm -f /usr/local/share/info/dir
graft -i /usr/local/PACKAGES/gnutls-2.4.0

Go to http://librsync.sourceforge.net/ and download the latest librsync source:

cd /usr/local/PKG_BUILD
wget http://internode.dl.sourceforge.net/sourceforge/librsync/librsync-0.9.7.tar.gz
tar -xvzf librsync-0.9.7.tar.gz
cd librsync-0.9.7
./configure –prefix=/usr/local/PACKAGES/librsync-0.9.7
make
make install
graft -i /usr/local/PACKAGES/librsync-0.9.7

Step 2) Download and install csync2

Go to http://oss.linbit.com/csync2/ and download the latest csync2 source:

cd /usr/local/PKG_BUILD
wget http://oss.linbit.com/csync2/csync2-1.34.tar.gz
tar -xvzf csync2-1.34.tar.gz
cd csync2-1.34

Now I couldn’t get csync2 to locate libsqlite. Seems it doesn’t like the latest version (3) anyway, so we have to go back and download & install an older version:

cd /usr/local/PKG_BUILD
wget http://www.sqlite.org/sqlite-2.8.17.tar.gz
tar -xvzf sqlite-2.8.17.tar.gz
cd sqlite-2.8.17
./configure –prefix=/usr/local/PACKAGES/sqlite-2.8.17
make
make install
graft -i /usr/local/PACKAGES/sqlite-2.8.17

Now, back to csync2. Note that we also run ldconfig just to make sure all our libraries are findable:

ldconfig
cd /usr/local/PKG_BUILD/csync2-1.34
./configure –prefix=/usr/local/PACKAGES/csync2-1.34
make
make install
graft -i /usr/local/PACKAGES/csync2-1.34

And we’re done! csync is compiled and installed! On to step 3….

Step 3) Set up xinetd

By default my CentOS 5 server did not have xinetd installed, so let’s install it

yum install xinetd

Create the following file as /etc/xinetd.d/csync2:

service csync2
{
disable = no
protocol = tcp
socket_type = stream
wait = no
user = root
server = /usr/local/sbin/csync2
server_args = -i
}

The csync2 protocol isn’t a standard one so we need to add it:

echo “csync2 30865/tcp” >> /etc/services

Then let’s restart xinetd:

service xinetd restart

And xinetd is ready…..now we need to configure csync2….

Step 4) Configuring csync2

TO BE CONTINUED…

[del.icio.us] [Digg] [StumbleUpon] [Technorati] [Windows Live]

nginx error – upstream sent too big header

We recently decided to test nginx with our CDN.  It seems that lighttpd just isn’t up the the task of serving high-connection rate services.  It was grinding to a halt on our systems as they started processing 400 – 600 connections per second.  We ended up running multiple lighttpd’s on single servers to alleviate this problem, but in doing so we were losing all the benefits of lighttpd such as stat caching etc.

Enter nginx.  Nginx is another web server similar in nature to lighttpd.  We thought we’d give it a go with just one of our clients which processes in the vicinity of 800 connections per second.  Here is our commentary:
1) nginx seems very well designed and coded – I find the configuration absolutely brilliant and very self explanatory to a developer.  I think a non-developer might have a bit of a hard time with it however, but I’m I’m sure they could work it out.  One of the biggest hates in all the forums is how badly nginx is documented – it’s NOT true.  I found the core features to be very well documented (in English) albeit in various locations.

2) Speed – The improvement was immediately noticable.  This webserver software is indeed faster than lighttpd.  Anybody wishing to argue – go home.  It’s faster under both small loads and large loads.

3) High-volume of connections – Handles it smoothly.  Lighttpd really had problems here.  Nginx doesn’t seem to care.

We did get one error which didnt seem documented anywhere so we ended up going through the source code to track down the issue.  The error message we were getting was:

“upstream sent too big header while reading response header from upstream”

Which would cause 502 Bad Gateway errors.  Apparantly this is an expected error with the default fastcgi configuration as the buffer size is not big enough for processing our headers (not sure why considering I think the default is 4k?).  Anyway – here is our fastcgi paramaters that appear to be currently working:

fastcgi_connect_timeout 60;
fastcgi_send_timeout 180;
fastcgi_read_timeout 180;
fastcgi_buffer_size 128k;
fastcgi_buffers 4 256k;
fastcgi_busy_buffers_size 256k;
fastcgi_temp_file_write_size 256k;
fastcgi_intercept_errors on

Overall, we’re going to let nginx run for a few weeks and see if we run into any more little problems.  If we don’t, we’ll look into changing our entire CDN over to nginx as lighttpd just isn’t cutting it performance wise, and the bug support seems almost non existent these days.  Personally, I think lighttpd’s days are numbered and I’m expecting its distribution to start reducing sometime over the next 6 months….

[del.icio.us] [Digg] [StumbleUpon] [Technorati] [Windows Live]

Spam Statistics #2

Some of you who had read our old NSCorp blog would recall that we love periodically looking at spam statistics in order to best tweak our spam systems.

A question was proposed on a hosting forum board a few days ago of just why do spammers target secondary/tertiary mail servers, and we ofcourse suggested it was due to two things:

1) The secondary mail server is more likely to accept the email due to various configurations out there of relaying whole domains rather than individual accounts. Thereby increasing the chance of at least getting a non-delivery bounce being sent to somebody (a bounce is better than no bounce).
2) When the spam software on the primary gets the email, anti spam software is less likely to look at past hops rather than the immediate one, and due to the trust relationship with the secondary/tertiary mail servers, it is less likely to be detected as spam.

That’s all good, reasonable logic behind why spammers would do such things. But do they? So we decided to analyse our mail servers and check what volume of spam really do come in via the primary vs. secondary vs. tertiary mail servers over the month of April:

So there we have some raw numbers…..suggesting that of the email the primary server receives, between 48% and 66% of the email is spam, while the figures are between 80% and 94% for the secondary and tertiary mail servers. These figures aren’t actually too much of a surprise as we ran up similar numbers internally late last year.

While we’ve got the numbers open, let’s do a day-by-day analysis for April too:

Spam Statistics by Day April 2008 - RackCorp Primary MX

Spam Statistics by Day April 2008 - RackCorp Primary MX

So it looks like Tuesday/Wedneday are the biggest days for receiving spam by around 20%, and the weekends the lowest.

There are some limitations on the above stats that could affect the quality of the stats:

1) Some RackCorp customers have greylisting turned on on their accounts/aliases. The above stats are only based upon what has happend AFTER greylisting – so it’s probable that the spam % stats are actually higher.
2) The per-day stats are based upon total emails – more real emails are sent on certain days, which would impact upon the stats.
3) For heavily spammed domains, RackCorp employs a tertiary MX record that accepts connections but never accepts email. This means that some spamming software would send, not get the error, and would never try again. Once again this means it is probable that the spam % stats are higher.

So that wraps up this stats session. There’s nothing much we could use there to improve services, except perhaps increasing the score applied to all emails received via secondary/tertiary MX servers. For now….we’re fairly happy with our anti-spam services. If things get worse, we might have to fall back to this information.

– RackCorp

[del.icio.us] [Digg] [StumbleUpon] [Technorati] [Windows Live]