Category: Research & Dev

FPM Migration from PHP 5.2 to PHP 5.3

Some installation/configuration notes we came up with whle integrating PHP 5.3 into our standard deployment system:

Compile options
PHP 5.3 has FPM inbuilt now, so there’s no longer any need to apply diffs to the source, but you DO still have to specify “–enable-fpm”.  All other FPM parameters have been depreciated now (except the run as user/group ones – but we use chrootuid anyway).  That’s all that’s required for compilation.

Things have changed significantly here, from an XML format into an ini format.  Here’s a typical fpm.ini that we’re using:

pid = /var/log/
error_log = /var/log/php-fpm.log
log_level = notice

emergency_restart_threshold = 10
emergency_restart_interval = 1m
process_control_timeout = 5s
daemonize = yes

;this is the IP/port to listen for fastcgi requests on
listen =
listen.backlog = 1024
listen.allowed_clients =
; not sure if we need to specify user/group here, but it’s indicated it is required, but if we chrootuid php so it’s already running as another use, it seems to be ignored
user = nobody
group = nobody

; This stuff actually works in PHP 5.3 – and works well!!
pm = dynamic
pm.max_children = 3
pm.start_servers = 1
pm.min_spare_servers = 1
pm.max_spare_servers = 1
pm.max_requests = 500

request_terminate_timeout = 40s
request_slowlog_timeout = 300s
slowlog = /var/log/php-fpm.log.slow
rlimit_files = 2048

rlimit_core = 0
catch_workers_output = no


Things have changed here slightly too, we used to run:

bin/php-cgi –fpm

and php would load up the compile-time default configuration file.

With 5.3, please note that the binary to run has changed and moved – it’s in the sbin directory and is called php-fpm. We also have to pass in the fpm configuration ini file as a parameter (as there wasn’t a compile time option for this anymore):

sbin/php-fpm –fpm-config=/conf/php-fpm.ini


And that’s it. We’ve also tested out the dynamic process spawning as this was of most interest to us as on shared servers RAM is like gold! Seems to work well – it seems to guage the necessity for additional processes by whether it has to wait for a request to be served. It seems to drop off unused processes after 10 seconds or so – couldn’t find anything about this so I guess it’s just some magical internal algorithm, but from my tests it looks to work and work well.

[] [Digg] [StumbleUpon] [Technorati] [Windows Live]

sysctl.conf and other settings for high performance webservers


There’s a couple of key settings on CentOS servers that significantly helps for high performance web servers that we always put in by default across all of our managed machines:

  • net.ipv4.tcp_syncookies = 1
    While it’s more commonly seen by people wanting to prevent denial of service attacks from taking down their websites, some people don’t realise that a heavy traffic site is not much different from one that is under a constant denial of service!
  • net.ipv4.netfilter.ip_conntrack_max = 300000
    Netfilter under linux does a great job, but it can sometimes be artificially restricted by some OS limitations that try to prevent some traffic from taking up too many system resources.  This is one of those settings that I feel is often set too low.  We ramp it up to 300,000 which means NetFilter can track up to 300,000 “sessions” (such as a HTTP connection) at one time.  If you’ve got 10,000 people on your website at once, you’ll definitely want to adjust this one!
  • net.ipv4.tcp_max_syn_backlog = 10240
    An application such as Nginx is very capable of serving as many TCP connections as an operating system and hardware can handle.  With that said, there will be a backlog of TCP connections ins a pending state before the user-space application such as Nginx gets to call accept().  The key here is to make sure the backlog of unaccepted TCP connections never exceeds the above number else there will be the equivalent of packetloss of the connection packets, and some clients will experience delays, if not a complete outage.  We find 10240 is a high enough number for this on current modern day servers.
  • net.core.netdev_max_backlog = 4000
    This one is important, particularly for servers that operate past 100MBit/s.  It governs how many packets will be queued inbetween the kernel processing the interface packet queue.  At gigabit speeds on busy servers, seeing the queue exceed the default of 1000 is pretty common.  We usually put this up to 4,000 for web servers.
  • kernel.panic = 10
    While unrelated to performance, there’s nothing worse on a busy web server than seeing a kernel panic.  While this isn’t common, when you do push a server to it’s limits, you can certainly come across kernel panics more commonly than you might otherwise, and this setting just helps reduce downtime on production servers.

We usually also change the TCP congestion control algorithm too by adding the following to rc.local:

  • /sbin/modprobe tcp_htcp

You will also want to increase the send queue on your interface by adding the following to your rc.local (you’ll want to change eth0 to your interface name):

  • /sbin/ifconfig eth0 txqueuelen 10000

There’s a lot of commentary online about changing tcp memory buffers and sizes.  Personally I haven’t found them to make much difference on a suitably spec’d server.  One day I might get around to having a look at how these affect performance, but for now, the above settings are known to achieve gigabit HTTP serving speeds for our webservers so that’s good enough for me!


[] [Digg] [StumbleUpon] [Technorati] [Windows Live]

New RackCorp option in the ongoing fight against spam

We have now added a new option in the ongoing fight against unwanted spam.  As of early this morning, all RackCorp mail servers in Australia, US, and Canada have been updated to RackCorpMailServices-1.14.  In doing do, we have now included a new option in our online portal to help manage spam.

You can find the option here when managing accounts (and similarly for managing aliases):

Spam Defer on RBL

With this option, you can now effectively defer ALL inbound email that matches the realtime blacklists.  Up until now, you only were able to greylist (defer for 10 minutes) any inbound email matching these blacklists.  By permanently deferring the email, you ensure that you do NOT receive any email that is coming from a blacklisted source, AND that the sender will eventually receive notification that you did not receive that email (explaining that it is because they are blacklisted).

It’s not all good though – the downside to doing this is that if someone IS blacklisted and is sending you something urgent, then they might not find out about it for several days.  Exactly how long until they do find out varies between 4 hours and 10 days, and is dependent on the sender’s ISP / mail infrastructure (not ours!).

When do we recommend using this option?  If you’re receiving so much spam that you’re finding it hard to do business, then activate this option – it’ll help a lot.

[] [Digg] [StumbleUpon] [Technorati] [Windows Live]

Choosing a “Critical Services” provider – checklist

I’ve been itching to tackle this subject for so long, but time is hard to find these days!  This isn’t purely a marketing blog here, RackCorp offers international services in LOTS of countries(20+ now), and quite often it’s not cost beneficial to our customers to have a fully decked out presence in some locations, so we too have to choose our providers carefully.

– If you’re serving speed-critical videos, files, game services, or telephony solutions, then you should try to choose someone who has equipment close to your customers.
– If your customers will be uploading / downloading LOTS of data, try utilise peering networks / centres that your customers may be connected to as much as possible as it will save your customers money.
– If you’ve got a small budget, and your service is not speed critical then consider going with equipment in the US or UK.  It may not be the fastest to your customer’s locations (unless they are in the US or UK!), but you’ll find it gives the best return for the money.

– Does the provider perform regular maintenance on their equipment?
– Does the provider replace hardware regularly?
– What versions of firmware/software is your provider running – is it surpassed?
– When was the last time the provider ran without mains power for a test?
– Does the provider notify customers of software updates in advance, and do they have alternatives if your system is unable to upgrade?

Okay, so things go wrong.  Hardware fails, things screw up.  It happens.  Now what!
– Does the provider have at least N+1 hardware on standby – and what’s the turnaround time in getting +1 operational?
– Does the provider have network redundancy that will result in no service degradation even if a primary link fails?
– Does the provider have the systems in place to automatically detect failures and respond to them?

Your site goes down – you don’t know why.  It might be your provider’s fault, it might be your fault.  This is where many people might panic….but you shouldn’t if you have addressed the following:
– Does your provider actively respond to outages, or do you have to notify them first?
– Do you have a phone number for your provider?  Do they answer or provide voicemail services to which they respond in a reasonable timeframe?
– Does your provider have a “support ticket” system where issues can be tracked, or is it all verbal / email based?  Support tickets are a requisite when dealing with anything more than a few hundred customers.
– Does your supplier communicate with you so that you understand what is going on.  They need to speak on your level else there is a risk of miscommunication.
– How many staff does your provider have?  Can they survive at a critical moment without key persons (Murphy’s law applies to hosting in some extreme ways….)

There’s a problem.  Your customers are complaining, but you don’t know what it could be.  This is where you need help!
– Does the provider publish issues, large and small?
– Does the provider accept blame for issues related to them, or do they try to conceal things?
– Does the provider have a technical team able to troubleshoot hardware, network, and software issues?

So now you should just go and take the above list of questions and give them to your prospective service provider to fill in the blanks.  WRONG!
Most large providers will at best send you a services overview PDF, or at worst stick your request in their trash can.  There’s just too many ‘shopping’ customers in this industry who demand way too much for what they’re willing to pay.  So what you NEED to do is browse their website and answer as many questions as you can FIRST.  Then if you find you still have questions, then sure, email a few questions to get clarification.

It’s amazing how many lies are throughout the hosting industry.  Some are hidden, some are blatant.  Some are ‘industry expected’, some are astonishing.  So let’s make a checklist of things you can check yourself:

  1. Do you see the term “UNLIMITED” used on their website? Is your use of that service governed by anything such as bandwidth restrictions (if you’ve got a 10Mbit connection with unlimited traffic, then chances are you’re not going to do much over 3TB of data a month).  If you’re being offered unlimited disk space and you think you’re actually going to use more than average, then look elsewhere.
  2. Fair-use policies. I like to think of these as “This is what we’ll offer you, but don’t expect us to actually provide it” policies.  If you’re expecting to use anything more than an average ‘service’ would use – then look elsewhere.
  3. SLAs. Does the provider state what happens if they fail to meet their 99.9999999% SLA?  No?  Look elsewhere because chances are they don’t know what happens either.  Does the provider offer more than 99.99% SLA?  If so, look elsewhere – it’s obvious their marketing team hasn’t spoken to their finance / legal team, or that their SLA’s are ultimately meaningless to you as a customer.
  4. Backups. What is the company’s back up policy.  How frequently do they back up.  Do they charge to provide you with access to your backup?
  5. Head over to a DNS checking service such as intodns and enter in your provider’s domain name.  Some things to check:
    – “NS records from your nameservers” section should show at LEAST 2 nameservers.  The IP addresses that show up should NOT be very close to each other (i.e. X.X.X.1 and X.X.X.2).  Preferably one or more of those X’s will be different.  This indicates the provider has their own nameservers on redundant networks.
    – “Glue for NS records” section should indicate good things.  While this won’t break anything, it does indicate a provider’s ability to keep their systems running at their best performance.
    – “MX Records” section should have at least 2 mail servers listed there – once again, look for them to be somewhat different IP’s not close together as per before.
  6. ADVANCED LOOKUP: Software version check – fire up telnet and enter their website in as the hostname, and specify port 80 for the port.  Once it is connected, type:
    GET / HTTP/1.1
    Host:  (where is their website URL – followed by two enters)
    You should get a bunch of information up the top of the page which may include Apache / IIS / lighttpd version numbers, PHP version numbers, or other versions.  Use these to look up on the net to see just how old these versions are – you might be surprised at the number of hosting companies running on software 5 or 6 years out of date.  If they don’t maintain their own website, then they certainly won’t maintain yours.
    I should point out here, that less information is better information from a security perspective.  Many audits will frown upon servers that give you version information, so if you don’t get any versions, or don’t recognise anything then it’s probably a GOOD thing.
  7. Google for their name. Do you find more bad reviews than good reviews?  Just remember than complainers are usually a lot louder than praisers, and even the most well run company can NOT satisfy everyone.  Remember than some (many!) hosting companies are into the dodgy practice of posting fake reviews about themselves.  Don’t believe any review unless you can see a customer URL alongside it – and if it is there, check it still exists and isn’t “under maintenance” or simply non existent.

So that’s it.  Not really how I wanted to put all this information, but it’s a start.  Now here’s comes the marketing piece for RackCorp 🙂

  • RackCorp has multiple DNS servers in multiple countries including US, UK, Germany, Canada, and Australia.  We try to localise these where possible so domains from those countries primarily use nameservers in those countries.  Our DNS services have never had a complete failure EVER (or even come close)
  • RackCorp has multiple mail servers running in HOT-HOT redundancy mode in multiple datacenters in multiple countries.  This means if a whole country goes offline (for whatever reason), our customers will STILL be able to access POP/IMAP/SMTP/Webmail services without even realising.  Our email services have NEVER had an outage for more than a few minutes – we have NEVER lost a single customer email due to an outage.
  • RackCorp server monitoring is closely tied in with our DNS system and is configured to automatically change announcements depending upon service availability / performance.  This lets us AUTOMATICALLY switch between webservers, mail servers, CDN networks, and even more depending upon whether those services are available.
  • RackCorp focuses on critical website hosting in multiple countries.  We employ geo-serving technology to protect against localised DDoS attacks, and to better speed up systems.
  • In 2008, our pimary datacentre for US-based services (including DNS, email and our own website) was the H1 datacentre with The Planet.  An explosion occurred at the datacentre rendering it completely offline.  While most of our competitors crossed their fingers and hoped for the datacenter to come back up swiftly, our services, and hundrds of our customer services were back up and running within 5-15 minutes from alternative locations.  The datacenter remained offline for 3 days due to the incident, with many end-customers of our competitors left offline because suppliers had no offsite redundancy, offsite backups, email redundancy, or anything of the such.

We don’t get praise much here at RackCorp – because customers tend not to notice even the most disastrous events that we live through.  I see so many hosting companies have a whinge that it’s not their fault when a datacenter loses power, or when their network provider accidentally stops announcing their routes.  That’s part of this business – it’s about how you prepare for the worst and deal with it that makes you a good provider for critical services.

[] [Digg] [StumbleUpon] [Technorati] [Windows Live]

nginx ncache performance and stability

BACKGROUND: We’ve been running nginx successfully for a long long time now without issue.  We were really pleased with it, and migrated all of our CDN servers across to using it several months ago.  While there were a few little configuration defaults that didn’t play too nicely, we ironed out the problems within a few days and it’s business as usual, serving between 700TB and 1.8PB (In August 2008!) per month!

Now we have the problem that our proprietary systems that actually cache our customer’s sites just aren’t fast enough to handle the fast-changing content that our customers are demanding.  So we’ve been weighing up a few options:

1) Deploy Varnish
2) Deploy Squid
3) Deploy ncache

We actually TRIED to deply varnish after seeing so many recommendations, but at the end of the day it couldn’t keep up.  It really should be noted that our systems are all 32bit, and I get the feeling varnish would perform a lot better on 64bit, but when you have over a hundred servers all running on 32bit processors…..upgrading them all in one hit just isn’t an option.  The problem with varnish is that it would drop connections seemingly because it had run out of memory (although we’re not 100% on this as the debugging wasn’t overly useful).

So…..we tried……we failed.  NEXT

Our next attempt was to look into deploying Squid.  This one proved a bit complex to integrate into our existing CDN network because of squid’s limitations.  We would have to write a bit of custom code to make this one work, so it has been made a last resort.

So, option 3 (which is the whole point of this blog entry), was to try out the ncache nginx module.  So we installed the ncache 2.0 package along with nginx 0.6.32.  We set them all up and things ran absolutely beautifully.  We manually tested caches and it was working great, very fast, very efficient, and well, it was great!

We were extremely happy until the next day when one CDN started reporting high load issues.  In analysing the server, it seems nginx was trying to consume several gig of memory – owch.  So we killed it off and restarted it, and it ran fine again.  Maybe it was just a temporary glitch?

Nope – over the following week, we had several servers experience identical issues – that is, where nginx consumes all available memory until such a point that it simply stops being able to serve files and requires a restart.  Looks like a case of either:

A) ncache 2.0 has a serious memory leak
B) ncache 2.0 doesn’t handle high volumes of cached data very well.

We’ve tried to work through the issues to make sure they’re not configuration issues, but no luck.  At the end of the day it’s going to be cheaper and easier to make some mods to squid and deply that instead.

So for anybody wondering about ncache performance and stability, I’d simply say that it’s a great product, but not really production-ready at this stage if you’re doing high volumes of data.

[] [Digg] [StumbleUpon] [Technorati] [Windows Live]