Tales of an IT Nobody

devbox:~$ iptables -A OUTPUT -j DROP

Certbot on Amazon Linux without using Yum – Fix [Errno 2] No such file or directory May 18, 2019

So let’s say you’re running an aging version of Amazon Linux and don’t want to blow up your system by wedging in yum repos from distributions that aren’t quite in line with the CentOS derived Amazon Linux.
Instructions on the web call on users to use Fedora or RHEL yum repos for CentOS users; but on Amazon Linux, you’re kind of twice-removed.

So long-story short, here’s some fodder for those who want the benefits of LetsEncrypt without the fluff of a repo.

My instructions will be for Apache/HTTPD, but you’ll see the key linch-pin item below.

First, start by downloading Certbot by hand:

Second, back up your Apache HTTPD configuration:

Third, test certbot-auto and let it ‘bootstrap’ dependencies:
** An error is likely here**

Error from certbot – “creating virtual environment” gives an error: “No such file or directory”:
After running the command above – you may see this error after it installs the dependencies for certbot:

After some searching, I’ve found that this is really easy to solve!

To fix, upgrade pip and REMOVE virtualenv:

Now you’ll see that certbot works like a champ!

Once you’ve established a working test configuration with certbot – you should see a LetsEncrypt test certificate on your site, it’s time to run the real command without the --test flag.

certbot-auto --debug --apache

If all goes well, you’ll have a completely valid and proper SSL certificate for free via LetsEncrypt!

I won’t cover the automation aspect as there are already endless write-ups on how to do that.

No Comments on Certbot on Amazon Linux without using Yum – Fix [Errno 2] No such file or directory
Categories: apache servers

Commando style: triage dashboard May 20, 2015

If you’re working on a foreign system, or one that doesn’t have the bells and whistles that make you feel at home, sometimes you need to improvise tools on the spot by chaining together commands, etc.

This little one-snip serves as a “dashboard” approach for quickly assessing consumption,



No Comments on Commando style: triage dashboard
Categories: servers tools

iptables list – a helpful ~/.bashrc alias May 10, 2015

I grow tired of asking iptables to give me my line numbers for insert/deletes, and sometimes, I just want it to “cut to the chase” and give me numbers.

Toss this into your ~/.bashrc for making life easier:

then run source ~/.bashrc  to reload.

Output sample:

Voila! Now you’ve got counters (helpful for debugging btw), numeric IP’s and line numbers at your fingertips!

No Comments on iptables list – a helpful ~/.bashrc alias
Categories: servers

Securing Elasticsearch – Part 1 May 28, 2014

The most frequently asked question for ElasticSearch and security is “how do I require login”?

Once you’ve answered and implemented the answer to that question; a larger, truly more troublesome issue looms. The same principals used to secure ElasticSearch; typically a proxy fronted by Apache/nginx use various auth techniques. If you know what you’re doing, you have different endpoints in that proxy for controlling who can do GET/POST/DELETE requests, possibly pre-determining the index and type.

If you REALLY know what you’re doing, you’ll be far more concerned over the payload.

While reading through the documentation, I was surprised; no, SHOCKED to see that ElasticSearch ships with a security flaw as severe as remote code execution as an intentional feature through the dynamic scripting component of a body payload for ES.

If you’re responsible for running ElasticSearch servers…

You must examine how queries are sent into the server. If your web developers are sending near-verbatim DSL queries to ElasticSearch without any further filtration except auth and index constriction, please read this.

A malicious person could modify the payload directly to read files directly off of your filesystem and serve them up in ES.

The article contains a proof of concept (POC) link – simply download and modify the file and point it to your ES server and see if you’re vulnerable.

I think in most cases, dynamic remote scripting isn’t a big deal to turn off. So I’d strongly suggest following the advice on this page:  script.disable_dynamic: true


Stay tuned for Part 2 of a more obsessive approach to securing ElasticSearch for use on public search interfaces.

2 Comments on Securing Elasticsearch – Part 1
Categories: nosql security servers

Skeletons: Older versions of ntp and not using DNS March 17, 2014

A while ago (years), I reluctantly set up ntp on some servers using an IP address for the source server; at the time, using a DNS name in ntp.conf was incompatible with the ntp/ntpd version and I didn’t want to go out of my way to compile it from scratch.

Today, I realized that I’ve been slowly getting bit in the butt, several years later.
Back then, the IP addresses were supposed to be rock solid ntp references. But over the years, and finally about a month ago, they all came offline.

I would not have caught the drift if it wasn’t for my use of pt-heartbeat (mk-heartbeat) and doing a review of our cacti graphs.

NTP not synchronized

This is what pt-heartbeat will show when your NTP service isn’t working


Usually I check them every monday as a routine, but I’ve been so busy for the last several months I haven’t had time to do that. I figured if it hits the fan, our alerts/thresholds will let me know. Which on a few occasions worked as needed for an Apache server.

pt-heartbeat, a tool of the Percona Toolkit has a common table across replicated servers that each one updates a record in the table with a datetime value.

The time difference between the value for server A, replicated to server B is the ‘lag’. The lag can/should be due to temporary spikes in traffic (or intentional delaying). Needless to say my gut sank when I saw that something weird as going on that was causing a small, yet unrecoverable, and linearly increasing lag time. After quickly confirming that SHOW SLAVE STATUS confirmed that everything’s up to date, it was quickly apparent that the mechanisms involved with the graph generation were at fault. Upon a quick side-by-side examination of server A and server B’s ‘heartbeat’ table, it stood out that the times were off by a few seconds from each other.

I restarted the pt-heartbeat daemons and the issue persisted – the next culprit was simple to identify: ntp.

Output of ntpq -p quickly showed that the ntp server hadn’t synchronized for over a month.

Over the years through periodic apt upgrades; the newer version of ntp came through the pipeline and all that was needed was an update of ntp.conf to use a new host (I opted for stratum 1 ‘time.nist.gov’)




No Comments on Skeletons: Older versions of ntp and not using DNS
Categories: linux servers

MySQL’s max_connect_errors … 6 years later… August 2, 2013

Having recently been bitten by the awful default value (10) for max_connect_errors on a production server – I’m having a very hard time coming to terms with who the heck thought this would be a good way to do it.

This type of “feature” allows you to effecitvely DOS yourself quickly with just one misconfigured script – or Debian’s stupid debian-sys-maint account not replicating.

I’ve been thinking about how I could avoid this scenario in the future – upping the limit was a no brainer. But another item of curiosity: How do I know what hosts are on the list?


Have a look at this, 6 years later: http://bugs.mysql.com/bug.php?id=24906


So up until 5.6 (still brand new in the DBMS world) – there was no practical way to find out who was on this list.

The default needs to be changed, and this feature should be able to be turned off…


No Comments on MySQL’s max_connect_errors … 6 years later…
Categories: linux mysql rant servers

Using Logstash to log SMTP/email bounces like a boss July 26, 2013

I’ve recently worked on a customized emailing suite for a client that involves bulk email (shutter) and thought I’d do a write up on a few things that I thought were slick.

Originally we decided to use AWS SES but were quickly kicked off of the service because my client doesn’t have clean enough email lists (a story for another day on that).

The requirement from the email suite was pretty much the same as what you’d expect from SendGrid except the solution was a bit more catered toward my client. Dare I say I came up with an AWESOME way of dealing with creating templates? Trade secret.

Anyways – when the dust settled after the initial trials of the system and we were without a way to deliver bulk emails and track the SMTP/email bounces. After scouring for pre-canned solutions there wasn’t a whole lot to pick from. There were some convoluted modifications for postfix and other knick knacks that didn’t lend well to tracking the bounces effectively (or implementable in a sane amount of time).


Getting at the bounces…

At this point in the game, knowing what bounced can come from only one place: the maillog from postfix. Postfix is kind enough to record the Delivery Status Notification (‘DSN’) in an easily parsable way.


Pairing a bounce with database entries…

The application architecture called for very atomic analytics. So every email that’s sent is recorded in a database with a generated ID. This ID links click events and open reports on a per-email basis (run of the mill email tracking).  To make the log entries sent from the application distinguishable, I changed the message-id header to the generated SHA1 ID – this lets me pick out which message ID’s are from the application, and which ones are from other sources in the log.

There’s one big problem though:

Postfix uses it’s own queue ID’s to track emails – this is that first 10 hex digits of the log entry. So we have to perform a lookup as such:

  1. Deterimine the queue ID from the application generated message-id log entry
  2. Find the queue ID log entry that contains the DSN


(Those of you know where this is going – skip this next paragraph).

This is a problem because we can’t do two at the same time. The time when a DSN comes in is variable – this would lead to a LOT of grepping and file processing – one to get the queue ID. Another to find the DSN – if it’s been recorded yet.


The plan…

The plan is as follows with the above in mind:

  • Continuously trail the log file
  • Any entries matching message-id=<hash regex> – RECORD: messageId, queueId to a database table(‘holding tank’).
  • Any entries with a DSN – RECORD: queueId, dsn to holding tank table
  • Periodically run a script that will:
    • Process all rows that have a DSN and messageId that is NOT NULL. (This will give us all messages that were sent from our application that have DSN’s recording)
    • Delete all rows that do NOT match that criteria AND are older than 30 days old (they’re a lost cause and hogging up space in the table).

Our database table looks like this:

Making the plan stupid simple, enter: Logstash

We use Logstash where I work for log delivery from our web applications. In my experience with Logstash I learned that it is a tool with so much potential for what I was looking for. Progressive tailing of logs and so many built in inputs, outputs and filters for this kind of work it was a no brainer.

So I set up Logstash and three stupid-simple scripts to implement the plan.

Hopefully this is self explanatory to what the scripts take for input – and where they’re putting that input (see holding tank table above)

Setting up logstash, logical flow:

Logstash has a handy notion of ‘tags’ – where you can have the system’s elements (input, output, filter) enact on data fragments tagged when they match a criteria.

So I created the following configuration logic:



  • bounce – matches log entries containing a status of ‘bounced’ – (not interested in success or defer)
  • send – matches log enries matching the message-id regex for the application generated ID format


  • EXEC: LogDSN.php – for ‘bounce’ tags only
  • EXEC: LogOutbound.php – for ‘send’ tags only 


Full config file (it’s up to you to look at the logstash docs to determine what directives like grep, kv and grok are doing)


Then the third component – the result recorder runs on a cron schedule and simply queries records where the message ID and DSN are not null. The mere presence of the message ID indicates the email was sent from the application. the DSN means there’s something to enact on by the result recorder script.


Simple overview

* The way I implemented this would change depending on the scale – but for this particular environment, executing scripts instead of putting things into an AMQP channel or the likes was acceptable. 

1 Comment on Using Logstash to log SMTP/email bounces like a boss

MySQL command line – zebra stripe admin tool February 7, 2013

I came up with a cool usage for the zebra stripe admin tool.  In MySQL you can set a custom pager for your MySQL CLI output; so one can simply set it to the zebra stripe tool and get the benefit of alternated rows for better visual clarity.

Something like ‘PAGER /path/to/zebra’ should yield you results like the image below.

Zebra stripe tool used as a MySQL pager

Zebra stripe tool used as a MySQL pager


You can always adjust the script to skip more lines before highlighting; you can also modify it if you’re savvy to the color codes to just set the font color instead of the entire background (which may be preferable but not a ‘global’ solution so the script doesn’t do it).

1 Comment on MySQL command line – zebra stripe admin tool
Categories: linux mysql servers tools

pt-online-schema-change and partitions – a word of caution October 30, 2012

Just a quick word of wisdom to those seeking to cleverly change the partitioning on a live table using a tool like pt-online-schema-change from the Percona Toolkit.

You will lose data if you don’t account for ALL of your data’s ranges upfront. (E.g: MAXVALUE oriented partition). The reason being is how online schema changing tools work; they create a ‘shadow copy’ of the table, set up a trigger and re-fill the table contents in chunks.


If you try to manually apply partitions to a table that has data from years ranging from 2005-2012 (And up), say something like:


And it has data from 2012 or above, MySQL will give you this safety error: (errno 1526): Table has no partition for value 2012 – this is a safeguard!


Now, if you use a hot schema tool, the safeguard isn’t in place; the copying process will happily rebuild the data with the new partitions and completely discard everything not covered by the partition range!

Meaning, this command will successfully finish, and discard all data from 2012+:


Just a heads up for those wanting to do something similar, you’ll definitely want to ALWAYS have something like this too fall back on as your final partition:

No Comments on pt-online-schema-change and partitions – a word of caution
Categories: databases mysql servers

Why Rackspace is bad! July 10, 2012

Fanatical support != Customer service, at all!

Recently I’ve migrated a customer that’s been on Rackspace for 6 years, and paying a handsome penny for it at that. The migration was to Amazon Web Services (AWS) and I sent a friendly reminder to the client to cancel the RS account (9 days in advance to the renewal date).

Here’s how things went down:

RS: “We require a 30-day written notice to cancel your account”.

This is on a host that is on a month-to-month basis and the costs have been on an incline. In fact, the cost was to go up $10/month next month. (Suffice it to say, it’s not much compared to the overall monthly bill).

So I’m thinking to myself, well that’s a crappy “policy”. I give them a ring on behalf of my client and see if we can leverage some flexibility. I simply ask to waiver the 30 day ‘penalty’. Not even a pro-rate for the days unused for the month.

The RS rep is quick to tell me how many people call that dislike that policy and try to get somewhere with it, but they stick to it. At this point I’m thinking, wow – this will be a little challenge.

I explained that the cost was going up and is something that wasn’t agreeable and therefore there is good cause to waive that type of penalty. No go.

We go circular a bit on customer service – I blab a bit about how the competition (Linnode, AWS, etc) allow me to do more than they offer, for cheaper and not have a penalty. I also say that it’s odd that in most circumstances you’ll get a counter offer from a retention specialist (you know, we’ll knock off 10% on your hosting). Still kind of a nod-n-smile go screw yourself attitude from this rep.

Then he says “We value your feedback and it helps us become better”.
I respond: “OH REALLY? You start the call by telling me how many pissed off people are calling about this policy, try to stick it to me as well and then give me a line like that?

Enough is enough – I ask for a manager and exclaim I understand if he “has to say that” but this is an unacceptable situation.

The manger’s response: They won’t deal with me. They’ll only talk with my client. (Who in turn, told the client to hose off in the same manner).

All I can say is this:

  1. Rackspace prices aren’t that hot. Look elsewhere.
  2. The fanatical support thing is cute, but the customer service is pure garbage in the above context. I’ve never been treated like that. I’ve had better luck with credit card companies and land lords than this.
  3. There’s this 30-day thing (Beware!)
  4. If they give you support, they’ll want the root password for your server.
  5. Their SLA is a lie, it reports on the 30 minute interval. Which means they can be down for 29 minutes every hour and not record it as downtime.
  6. Their backup system on dedicated hosting is a bloated, un-tamed mess if you let them manage it, they let the ‘rack’ account on my client’s server exceed 60GB of crap that should be cleaned up. E.g.: backup software updates, provisioning/monitoring tools
  7. They ask for root before sudo configuration (See #4)
6 Comments on Why Rackspace is bad!
Categories: rant servers

Why are we spending so much time refuting? July 5, 2012

There’s a nice juicy war going on in the ‘data / web’ sector, that seems more heated than I can remember.

It essentially boils down to sensationalist claims from the likes of MongoDB and MemSQL, which in turn draw refuting remarks from industry professionals that are typically embedded with RDBMS technologies.

The typical responses to these new ‘hipster’ systems are usually transaction/consistency centric – as that’s where the RDBMS systems shine – they can perform wonderfully while being ACID compliant.

Or in the case of Node, refuting the ‘Apache doesn’t have concurrency, node is better’ arguments. I have a hunch 99% of the Node fanboys have a damn clue how capable Apache itself is.

There’s also things like Node.js that rub the seasoned people the wrong way, perhaps it’s the sensationalism without actually proving anything? (Check the first few comments) Or the utter lack of security focus? (That’s what bugs me) – I also think it has to do with their approach to enter the market: guns blazing, criticizing other solutions and hoisting their own as THE single option with more tenacity than appropriate for such an immature project. Guys in the trenches can’t stand that crap, we know it’s just another tool to get the/(‘a’) job done in a particular scenario.

But really, I think about how much time is wasted on these subjects going back and forth, so let’s stop wasting time. Be open minded to the new technologies as tools for a particular job and stop making all or nothing stories out of future tech, like it or not – we all have to share the same space.

No Comments on Why are we spending so much time refuting?

MySQL CPU maxing out due to leap second, and AWS US-E1 outage June 30, 2012

Wow, US-EAST-1 has the worst luck doesn’t it?

I had CPU consumption alerts fire off for ALL of my AWS instances running Percona Server (MySQL).

I couldn’t for the life of me figure it out – I re-mounted all but the root EBS volumes, restarted the services and ensured there was no legitimate load on all of them, yet MySQL stuck to 100% consumption on a single core on all of the machines. I tried everything, lsof, netstat, show processlist, re-mounting the data partitions.

WIERD. It turns out however, AWS was not be the cause of this one, even in light of the recent AZ issue in EAST-1.

My issues started sharply at 00:00 UTC. Hmm, odd since we’re having a leap second today!

A quick touch to NTP and we’re as good as gold.

This server had many cores so it was still running fine (and it was the weekend) – Seems like bad things happen at the least opportune times, like when you’re out on a date with the wife and the baby is with grandma…

At any rate, it’s amusing to watch the haters hate that clearly don’t understand the concept of AWS and the AZ (Availability Zones). Funny to watch them all scoff and huff about how they run a better data center out of their camper than AWS.

If anything, I want to see a good root cause analysis and maybe some restitution for the trouble.

No Comments on MySQL CPU maxing out due to leap second, and AWS US-E1 outage
Categories: aws servers

MySQL COALESCE(), UNION behavior on zerofilled columns June 28, 2012

I’ve just filed a (potential) bug report, depending on your views for MySQL’s COALESCE() behavior:


If you have a zerofill column and perform the COALESCE() function on it, the leading zeros are truncated.

As I mention in the bug report, this may not matter to most – but it does change the output one would expect and is worthy of notice or a mention in the official documentation in my book.

Test scenario 1:

INSERT INTO coalesceTest (paddedNumber) VALUES (5), (10), (200), (3000);

SELECT COALESCE(paddedNumber), paddedNumber from coalesceTest;


Test scenario 2:



No Comments on MySQL COALESCE(), UNION behavior on zerofilled columns
Categories: mysql servers

On MySQL: The latest, far-reaching password circumvention June 11, 2012

By now, everyone has, or will be hearing about this issue.

While it’s an extremely simple hack and covers (dare I say the majority) of MySQL installation version. Let’s not forget to finish reading the entire disclosure:

From the disclosure:

But practically it’s better than it looks – many MySQL/MariaDB builds are not affected by this bug.

Whether a particular build of MySQL or MariaDB is vulnerable, depends on
how and where it was built. A prerequisite is a memcmp() that can return
an arbitrary integer (outside of -128..127 range). To my knowledge gcc builtin memcmp is safe, BSD libc memcmp is safe. Linux glibc sse-optimized memcmp is not safe, but gcc usually uses the inlined builtin version.

As far as I know, official vendor MySQL and MariaDB binaries are not

Regardless, it’s a stupid simple test to see if you’re vulnerable or not so fire one up!

I just tested 5 gcc compiled hosts (mostly pre-5.5.23) and none of them were vulnerable. But regardless, maybe it’s time to re-compile ;)

No Comments on On MySQL: The latest, far-reaching password circumvention
Categories: mysql security servers

Mercurial (hg) checkstyle hook, at last! May 7, 2012

As far as I can tell, there’s not much in the lane of check style hooks for Mercurial.

There’s a lot of hits for git and SVN, but not much for Mercurial.

Check it out in my ‘hg-checkstyle-hook‘ bitbucket repo.

I thought I’d share my (imperfect) rendition of a Mercurial checkstyle hook. It’s meant to be setup for a pretxnchangegroup event.

Basically it does this:

  • Find what files have changed from the beginning of the changegroup to the tip
  • Copy those files to a staging directory in /tmp
  • Run PHPCS ( PHP_CodeSniffer, a PHP checkstyle command) on those files specifically
  • Provide a report on any violations resulting in a non 0 exit code.
  • The script should be configurable for any checkstyle command, as long as it takes a space delimited list of files at the end of it’s arguments.
No Comments on Mercurial (hg) checkstyle hook, at last!

MySQL 5.6 – InnoDB (innodb_file_per_table) and recovery May 1, 2012

All I can say is rejoice!.

There’s a lot of fluff out there that beat around the bush or contain a regurgitated process for recovery using the 5.6 LAB edition of MySQL.

So instead, here’s the info straight from the horses mouth: http://blogs.innodb.com/wp/2012/04/innodb-transportable-tablespaces/  . This will make a huge difference in the world of MySQL and InnoDB users.

No Comments on MySQL 5.6 – InnoDB (innodb_file_per_table) and recovery
Categories: mysql servers

Disable PHP 5.4’s built-in web server, while keeping CLI … February 6, 2012

Administrators: Don’t get blind-sided by PHP 5.4’s CLI web server!

I’ve gone over a similar issue like this before regarding the likes of git/hg. While those are developer tools and are less likely to be present on a production machine.

PHP 5.4 is jumping on the bandwagon to include a ‘cute’ little internal server – which is enabled by default.
The ‘everything needs a standalone server’ thing is starting to get on my security nerves feel silly.

It has limited use, and most developers will have limited use for it due to it’s lack of mod_rewrite (and equiv.) behavior … The worse part is: You can’t disable it if you want to keep cli (e.g.: no pear!)
Wish I spoke up on the list!

Anywho, here’s a hob-knobbed patch (for PHP 5.4.0RC6) that will change that for you.
(GNU/*nix only!) The patch adds a new configure option ‘–disable-cli-server’.

Download the patch here: patch-php5.4.0RC6-no-cli-server.diff
Place it in the PHP source base directory.

In the future I’ll plan on formalizing this patch and propose it in php.internals when I get a chance to make the windows part of the patch.


2 Comments on Disable PHP 5.4’s built-in web server, while keeping CLI …
Categories: linux php security servers

PHP Vulnerability – DJBX33A – Hash table collisions January 14, 2012

Trickling through my RSS feeds this morning was an article with quite the topic “PHP Vulnerability May Halt Millions of Servers“.

In a nutshell: A modest size POST to almost all PHP versions in the wild (Sans 5.3.9+) are in danger of an extremely simple DoS.

The vulnerability exploits the PHP internal hash table function (responsible for managing data structures) – more specifically: the technique used to ‘hash’ (generate a key for the hash table) the key for a key=>value relationship.

Here’s the informative part regarding PHP’s problem in the security advisory for this:

Apache has a built in limit of 8K max request length (that is, maximum size in request URL) by default.
Can the damage from an 8k request (this affects GET) – really cause the mentioned DDoS attack on reasonable hardware?

Additionally – PHP has a limiter on POST data too: max_post_size.
It’s this configuration variable in particular I think should be put in the limelight.

max_post_size is a run-time/htaccess configurable directive that maybe we don’t respect like we should.
Often, administrators (myself included) just tell php.ini to accept a large POST size to allow form based file uploads – It’s not uncommon to see:

– in almost any respectable setup.

Perhaps we should evaluate the underlying effects of this setting; maybe it should be something stupidly low by default (enough to allow a large WYSIWYG CMS article’s HTML and a bit more? 32K?) – and then delegate a higher limit using Apache configuration.

Caveat: these settings are PER DIR meaning:

  • .htaccess use is limited, you can’t set the php_value in a .htaccess with a URL match – you’re stuck using a context sensitive .htaccess (within a dir) or use thedirective – this won’t work for people using front controllers through a single file on their websites/apps.
  • Modifying the actual vhost/host configuration is a sound bet – you can do Location/File matching and set these at will; for situated web apps, this may be a feasible decision to take whitelist or blacklist approach on uploader destinations.

More resources:

  • Here’s the video that thoroughly covers the vulnerability – I’ve shortcut it to their recommended mitigation (outside of polymorphic hashing):
  • A full blown rundown, including proof of concept (USE AT YOUR OWN RISK!)
  • A string of hash collisions targeting DJBX33A for vuln testing (PS: Firefox seems to struggle with this in a GET format, Chrome doesn’t, odd!)
2 Comments on PHP Vulnerability – DJBX33A – Hash table collisions
Categories: php security servers

If you’re not off of Godaddy yet … December 23, 2011

You should be. The Godaddy girls are stupid. The commercials are worse. Bob Parsons is kinda creepy (not just the elephant thing). The ads are terrible. The site is terrible.

Do you need another excuse to move your registrar needs to another company such as Gandi or Namecheap?

You need another excuse? here it is.

You should know what SOPA  * is about between the lines. (Job growth? Puh-leese, the job growth from the .com boom didn’t need SOPA thank you very much!)

Over the last few months I’ve moved a dozen domains off of Godaddy on to others (client’s discretion).
If you’re still on the fence, there’s a pretty good run down of good alternative registrars on this blog post.

ICANN also has a full (but impersonal) list of accredited registrars as well.

PS: Namecheap has a coupon code for a little bit off “SOPASUCKS”.

2 Comments on If you’re not off of Godaddy yet …

ab – Apache Bench, understanding and getting tangible results. December 3, 2011

Apache Bench (AB) is a very powerful tool when used right. I use it as a guideline for how to set up my apache2/httpd.conf files.
All too often I see people boasting that they can get an outrageous number of RPS in AB (the Apace Bench tool).

“OMG, I totally get 3,000 rps in an AWS micro instance!!!” (I’ve seen this on the likes of Serverfault)

Debunking misunderstandings:
Concurrency != users on site at same time
Requests per second != users on site at same time

Apache Bench is meant to give a ‘feel of pants’ diagnostic for the page/s you point it to. 

Every page on a website is different; and may require a different number of requests to load (resources: css, js, images, html, etc).  

Aspects of effective Apache benchmarking:

  1. Concurrent users
  2. Response time
  3. Requests

“Concurrent users” – Have you ever stopped to ask yourself: What the hell is a user? (in the Apache sense) We don’t stop to think about them individually, we just think about them as a ‘request’ or the ‘concurrent’ part of benchmarking.

When a user loads a page, their browser may start X connections at the same time to your server to load resources. This is a complex dynamic, because browsers have a built in limit of how many concurrent requests to make to a host at a given time (a hackable aspect). 

So at any given time, let’s say a user = 6 concurrent connections/requests.

“Response time” – What is an acceptable response time? What is a response? In the context of this article, it’s the round trip process involved with the transfer of a resource. This summarizes the intricacies of database queries and programming logic, etc into a measurable aspect for consideration in your benchmarking.

Is 300ms to load the HTML output of a Drupal website acceptable? 400? 500? 600? 700?

How fast does a static asset load? What is the typical size of an asset for your webpages? 10KiB? 20?

“Requests”Requests happen at the sub-second level, measured in milliseconds.
Let’s say the average page has 15 resources (images, js, css files, html data, etc) – aka: 15 requests per page

This means if a single user comes to load a page, there’s a good chance his/her browser will make 15 requests total.

Another part of this aspect is to be aware that the browser will perform these requests in a ‘trickle’ fashion, meaning one request to get the HTML, then an instant 6 requests (browser concurrency) but the next 8 will happen one at a time once concurrent connections free up.

Putting aspects together:
We have to draw an understanding of how these aspects all tie together to determine the start-to-finish load of a typical page.

Let’s say a page is 15 requests (14 images/js/css files and HTML) with an average payload of 10KB of data each.150KB of data.

A user/browser makes 6 simultaneous requests (All completing at slightly different times, ideally, at the same time).

Response time is the metric we’re interested in.

The questions we ask of ab are – given the current configuration and environmental conditions:

  • What’s the highest level of concurrency I can support loading a given asset in less than X milliseconds
  • How many requests per second can I support at that given level of concurrency

By attempting to answer these questions, we’ll derive the tipping point of the server – and what it’s bottlenecks are. (This article will not cover determining bottlenecks – just how to get meaning from ab)

Naturally these seat of pants numbers are generated on a machine on the same network with plenty of elbow room – making them best case scenarios – in today’s world it’s close enough with how widespread HSI is. It also assumes the network can handle the throughput of the transfers involved with a page. It also assumes all assets are hosted on the same host, nothing cross domain, e.g.: Google Analytics, Web Fonts, etc.

How to test 
First, we need to classify the load. As mentioned a few times, there’s two types of site data: static files, and generated files.

Static files have extremely fast response times,  generated files take longer because they’re performing logic, database work and other things.

A browser doesn’t know what to load until it retrieves the HTML file containing the references to other resources – this changes how we look at timings, and most cases, the HTML document is generated.

Let’s simulate a single user/browser loading a page from an idle server.
First, we must get the HTML data…

So you can see, this took 29ms, at 299KB/s to generate the HTML from Drupal and send it across the pipe.

Now, let’s simulate a browser at 6 concurrent connections loading 15 assets at 10KiB each.

So here you can see, each request completed at 35ms or less – The whole test took about 560 ms.

This means that under pristine conditions, the user would have the entire webpage loaded in 589ms.

Finding the limits…
So, looking at our numbers – it’s clear that 1 user is a piece of cake. 
It’s also clear that there’s two big considerations into acceptable timings:

  1. Time to get HTML (our 1 generated request)
  2. Time to get references from that HTML (our 15 static requests)

We’re going to multiply our numbers and start pushing the server harder – let’s say by 30 times:

So once again, let’s emulate 30 users making 30 requests to our Drupal page:

Every user got the HTML file in ~298ms or less.

Next up, static files – 30 users = 180 concurrent connections, 450 requests.

So, here’s where things start getting interesting – 2% of the requests were slowed down to 600+ms per request. The exact cause of that is out of the scope of this article – could be IO – regardless, the numbers are still good and it’s clear that this is indicating the start of a tipping point for the host.

Take the HTML load time – 298ms
Do some fuzzy math here: 265ms* 2.5 = 662ms (mean total)  
957ms average for entire page to load.

This is ‘fuzzy math’ because I’m not simulating the procedural/trickle loading effect of browsers mentioned above (initial burst of 6 connections, and making them busy as they become available at different times to complete the 15 requests). But instead treating it as 6 requests in sets. Someone more mathematically inspired might calculate better numbers – I use this method because it’s my “buffer factor” that I use for variables (bursts of activity, latency changes, etc).

So from this data, we can say the server can sustain 30 constant users given our assumptions.

Phase two:
So now we have a ballpark figure of where our server will tip over. We’re going to perform two simultaneous ab sessions to put this to the test, this will simulate both worlds of content: generating content and loading the assets. This is the ‘honing’ phase where we dial down/up our configuration by smaller increments. 

  • 5 minutes of load at our proposed 30 users
  • 30 concurrent connections for our generated content page.
  • 180 concurrent connections for static data.

Ding ding ding! We’ve got a tipper!
Ok, so as you can see, for the most part, everything was fairly responsive. If the system could not keep up, most of the percentiles will have the awful numbers like the ones shown in the highest 2%. However, overall these numbers should be improved upon to deal with sporadic spikes above our estimates, and other environmental factors that may have a swaying factor in our throughput. 

How to interpret:
Discard the 100% – longest request; at these volumes it’s irrelevant.
However, the 98-99% matter – your goal should be to make the 98% fall under your target response time for the request. The 99% shows us that we’ve found our happy limit. 

(Remember, at these levels, theres many many variables involved – and the server should never reach this limit, that’s why we’re finding it out!)

Let’s tone our bench down to supplement 25 users and see what happens …

Wrap up
25 simultaneous users may not sound like a lot, but imagine a classroom of 25 students – and they all click to your webpage at the exact same moment – this is the kind of load we’re talking about; to have every one of those machines (under correct conditions) load up the website within 1 second

To turn that into a real requests per second: 375. (25 users @ 15 requests).

The configuration – workload (website code, images) and hardware (2 1Ghz CPU’s…) are capable of performing this at a constant rate – these benchmarks indicate that the hardware (or configuration) should be changed before it gets to this point to supplement growth. These benchmarks indicate that ~430 pageloads out of 21,492 will take longer than a second to load. In reality, the ebbs and flows of request peaks and valleys make these less likely to happen.

As you can see, the static files are served very fast in comparison to the generated content from Drupal.
If this Apache instance was backed by the likes of Varnish, the host would be revitalized to handle quite a bit more (depending on cache retention).

Testing hardware

  • 1x AWS EC2 Large instance on EBS – Apache host
    • 2 virtual cores with 2 EC2 Compute Units each
      • AKA:  (2) 2Ghz Xeon equiv
    • 7GB RAM
  • 1x AWS EC2 4X Large CPU – AB runner
    • 8 virtual cores with 3.25 EC2 Compute Units each 
      • AKA: (8) 3.25Ghz Xeon Equiv
    •  68GB RAM
5 Comments on ab – Apache Bench, understanding and getting tangible results.