Three kinds of outages and what we can do about them

by Michael December 16th, 2013

From your desktop to our servers there are a lot of technical issues that can potentially have a negative impact on your web site. What we would like to talk about in this article are interruptions or outages related to our servers or network. The things that happen on our end of the wire.

Most web hosts avoid discussing these kinds of things, which is understandable. No one wants to draw attention to an unpleasant aspect of something they are trying to sell to you. But whenever an outage occurs many of you will ask us, “What are you going to do to make sure this never happens again?” We think that’s a reasonable question, and one that deserves an honest and detailed answer.

There are essentially three different forms these outages take; maintenance and upgrades, server-specific problems, and provider problems or malicious attacks. What we can do to alleviate or prevent the problems varies depending on which type of outage we’re talking about.

Maintenance and upgrades

Since the Winhost platform is primarily made up of servers running the Windows operating system, a certain amount of downtime for maintenance and upgrades is unavoidable in order to maintain security and provide you with current technology.

We do planned maintenance (when necessary) on Wednesdays, and a general Windows update every month. There is also occasional unplanned maintenance, which is usually an update or fix for a security issue or a problem that is having an immediate negative effect on a group of servers, so the fix is made outside the normal maintenance window.

Server-specific problems

Our servers are consistent across the entire network, so for example, all mail servers have the same configuration, all SQL 2008 servers are the same, all SQL 2012 servers are the same, etc.

However, all of the users on the servers are different, and the number of users per server varies, so even though all Windows 2012 web servers have the same configuration, they can experience different problems.

Additionally, all servers run on hardware, and all hardware is susceptible to component failure. We use only top-of-the-line Dell servers, but no matter how much you pay for them, electronic and mechanical parts still fail. Using virtual servers (which we do in some cases) reduces the likelihood of mechanical failure to a certain extent, but virtual servers still run on physical machines.

So while server-specific problems are bound to happen occasionally, we do a few things to prevent unnecessary issues, such as extensive monitoring of the live servers (via giant monitors in the support and system administration offices, and immediate text messaging to all of the system administrators telephones), and controlling and balancing density, so that the servers always have roughly similar numbers of users and loads.

We also retire all hardware that is past a certain age. So your site, database or email will never be running on a 10 year old server that everyone expects to fail at any minute.

When a specific server fails or is showing signs of stress, our administrators take action immediately, and make things right as quickly as they can.

I should take this opportunity to let you know that every member of the Winhost staff is local, and we all work out of offices in the same building here in Los Angeles. We do not employ remote staff or offshore third-party staff. Communication is easy and efficient in the event of a problem.

Provider problems or malicious attacks

These are usually the most severe of the issues we’re discussing, having an impact on the largest number of users. Unfortunately they are also typically the most difficult to deal with.

Provider problems

A provider problem usually means an issue with one of our Internet backbone connection providers. These are the companies (Internap and Savvis) that provide the connection from the routers in our data center to the Internet backbone.

It is not technically necessary to have multiple connections, but because these giant providers are not perfect (and they need to do maintenance and repairs sometimes too), we have multiple connections to prevent a complete network blackout due to provider issues. We split and balance traffic between the two, and most of the time, all is well.

Then there are the other times.

When one of our two connections goes down unexpectedly, we recalibrate and direct traffic to the connection that is working and do everything we can to insure that all incoming traffic can be accommodated.

The problem is, when a backbone connection goes down, some traffic is not going to route around it properly. It’s certainly supposed to. The Internet was built on the theory that traffic would easily and automatically route itself around unresponsive nodes. But in actual practice, things don’t always work the way they do in theory.

Long story short, if your connection doesn’t route around the dead backbone connection, there won’t be anything we can do on our end to remedy the situation.

That doesn’t happen often and it doesn’t affect everyone. But it will inevitably affect some of you, and if it does happen, from your perspective everything will be down. Everything may in fact be up, but anyone whose request is stopping at the dead backbone connection won’t be able to access our network.

I should also mention that it is theoretically possible for both providers to go down at the same time. The odds of that happening are very slim, but it is possible.

You might wonder, “Well, if that’s a possibility, why not have a dozen backbone connections?” and that’s a reasonable thing to wonder. But the cost for even one additional backbone connection would not come close to the potential, “freak-occurrence” benefit. Not to mention the fact that it wouldn’t improve your service on a day to day basis. We would just be sitting on top of (and paying for) a lot of idle, unused bandwidth.

Malicious attacks

Speaking of bandwidth, I saved DDoS for last.

DDoS is an acronym for distributed denial of service attack. They are brute force attacks that send so much data to a site or server that they effectively “knock it off line.”

The truth of the matter is we cannot prevent a large DDoS attack because we can never know what might trigger a large DDoS attack.

In the old days, a very large DDoS would throw hundreds of megabits of data at a site every second – or at the most, a gigabit (Gbit/s) – and that was usually enough to take it down. But providers got wise and started using DDoS mitigation services (like we do) that temporarily provide huge amounts of bandwidth which make a one or two Gbit/s DDoS ineffective.

But now with exponentially larger and higher bandwidth botnets, the attacks can be so large that we can’t even measure how much traffic the DDoS is sending. 10 Gbit/s hardware switches are saturated and paralyzed, and even larger switches that are used by the backbone providers are swamped and slow to a crawl (which is how a recent DDoS on a Winhost customer slowed down traffic for part of Yahoo!).

Since the methods we used to deal with DDoS in the past are no longer effective, what we do now is try to determine the target site or sites and remove them from our network (by null routing the IP address), then wait for the DDoS to taper off after the site disappears.

In order to locate the target site, our system administrators coordinate with our upstream providers to get the necessary target IP information. Once they have that, they start going through that IP manually, site by site, looking for something that might attract a DDoS.

That is as unscientific as it sounds, and as a result, locating the target can take anywhere from 30 minutes to several hours. And unfortunately, even after we have identified the target, chances are it will continue to affect a number of sites even after the DDoS has ended, because the IP address those sites live on has temporarily been removed from our network.

Keeping you informed

I’ll wrap this up by saying that while it isn’t possible for us to prevent outages completely, we are always working on improving our communication when they do occur. Every outage is different (if they were all the same, this would be easy) and they each teach us something new.

But if you need quick information, and the outage is server-specific (and not generally crippling our network), your best bet is to check the forum for updates. The forum is the first line of public communication for the tech support staff, so it is typically the first place you’ll see information.

For outages that do affect our network, we post information on Twitter, Google+ and Facebook. We may not be able to respond to your questions or comments on social media while the outage is in progress, but we do our best to keep those sources updated.

We do not have a “social media team,” as some larger companies do, so despite our best efforts, you may see spotty updates on any given service during any given outage. That doesn’t mean we don’t care. It usually means the outage is happening at a time of day when we are not all in the office, so the people who are here are dealing with the outage and answering helpdesk tickets (and haven’t called to wake me up yet).

We’ll always do our best to keep you in the loop so you know what’s happening. If you have any suggestions you’d like us to consider, we encourage you to comment on this post and let us know.

Inside Winhost