From your desktop to our servers there are a lot of technical issues that can potentially have a negative impact on your web site. What we would like to talk about in this article are interruptions or outages related to our servers or network. The things that happen on our end of the wire.

Most web hosts avoid discussing these kinds of things, which is understandable. No one wants to draw attention to an unpleasant aspect of something they are trying to sell to you. But whenever an outage occurs many of you will ask us, “What are you going to do to make sure this never happens again?” We think that’s a reasonable question, and one that deserves an honest and detailed answer.

There are essentially three different forms these outages take; maintenance and upgrades, server-specific problems, and provider problems or malicious attacks. What we can do to alleviate or prevent the problems varies depending on which type of outage we’re talking about.

Maintenance and upgrades

Since the WinHost platform is primarily made up of servers running the Windows operating system, a certain amount of downtime for maintenance and upgrades is unavoidable in order to maintain security and provide you with current technology.

We do planned maintenance (when necessary) on Wednesdays, and a general Windows update every month. There is also occasional unplanned maintenance, which is usually an update or fix for a security issue or a problem that is having an immediate negative effect on a group of servers, so the fix is made outside the normal maintenance window.

Server-specific problems

Our servers are consistent across the entire network, so for example, all mail servers have the same configuration, all SQL 2008 servers are the same, all SQL 2012 servers are the same, etc.

However, all of the users on the servers are different, and the number of users per server varies, so even though all Windows 2012 web servers have the same configuration, they can experience different problems.

Additionally, all servers run on hardware, and all hardware is susceptible to component failure. We use only top-of-the-line Dell servers, but no matter how much you pay for them, electronic and mechanical parts still fail. Using virtual servers (which we do in some cases) reduces the likelihood of mechanical failure to a certain extent, but virtual servers still run on physical machines.

So while server-specific problems are bound to happen occasionally, we do a few things to prevent unnecessary issues, such as extensive monitoring of the live servers (via giant monitors in the support and system administration offices, and immediate text messaging to all of the system administrators telephones), and controlling and balancing density, so that the servers always have roughly similar numbers of users and loads.

We also retire all hardware that is past a certain age. So your site, database or email will never be running on a 10 year old server that everyone expects to fail at any minute.

When a specific server fails or is showing signs of stress, our administrators take action immediately, and make things right as quickly as they can.

I should take this opportunity to let you know that every member of the WinHost staff is local, and we all work out of offices in the same building here in Los Angeles. We do not employ remote staff or offshore third-party staff. Communication is easy and efficient in the event of a problem.

Provider problems or malicious attacks

These are usually the most severe of the issues we’re discussing, having an impact on the largest number of users. Unfortunately they are also typically the most difficult to deal with.

Provider problems

A provider problem usually means an issue with one of our Internet backbone connection providers. These are the companies (Internap and Savvis) that provide the connection from the routers in our data center to the Internet backbone.

It is not technically necessary to have multiple connections, but because these giant providers are not perfect (and they need to do maintenance and repairs sometimes too), we have multiple connections to prevent a complete network blackout due to provider issues. We split and balance traffic between the two, and most of the time, all is well.

Then there are the other times.

When one of our two connections goes down unexpectedly, we recalibrate and direct traffic to the connection that is working and do everything we can to insure that all incoming traffic can be accommodated.

The problem is, when a backbone connection goes down, some traffic is not going to route around it properly. It’s certainly supposed to. The Internet was built on the theory that traffic would easily and automatically route itself around unresponsive nodes. But in actual practice, things don’t always work the way they do in theory.

Long story short, if your connection doesn’t route around the dead backbone connection, there won’t be anything we can do on our end to remedy the situation.

That doesn’t happen often and it doesn’t affect everyone. But it will inevitably affect some of you, and if it does happen, from your perspective everything will be down. Everything may in fact be up, but anyone whose request is stopping at the dead backbone connection won’t be able to access our network.

I should also mention that it is theoretically possible for both providers to go down at the same time. The odds of that happening are very slim, but it is possible.

You might wonder, “Well, if that’s a possibility, why not have a dozen backbone connections?” and that’s a reasonable thing to wonder. But the cost for even one additional backbone connection would not come close to the potential, “freak-occurrence” benefit. Not to mention the fact that it wouldn’t improve your service on a day to day basis. We would just be sitting on top of (and paying for) a lot of idle, unused bandwidth.

Malicious attacks

Speaking of bandwidth, I saved DDoS for last.

DDoS is an acronym for distributed denial of service attack. They are brute force attacks that send so much data to a site or server that they effectively “knock it off line.”

The truth of the matter is we cannot prevent a large DDoS attack because we can never know what might trigger a large DDoS attack.

In the old days, a very large DDoS would throw hundreds of megabits of data at a site every second – or at the most, a gigabit (Gbit/s) – and that was usually enough to take it down. But providers got wise and started using DDoS mitigation services (like we do) that temporarily provide huge amounts of bandwidth which make a one or two Gbit/s DDoS ineffective.

But now with exponentially larger and higher bandwidth botnets, the attacks can be so large that we can’t even measure how much traffic the DDoS is sending. 10 Gbit/s hardware switches are saturated and paralyzed, and even larger switches that are used by the backbone providers are swamped and slow to a crawl (which is how a recent DDoS on a WinHost customer slowed down traffic for part of Yahoo!).

Since the methods we used to deal with DDoS in the past are no longer effective, what we do now is try to determine the target site or sites and remove them from our network (by null routing the IP address), then wait for the DDoS to taper off after the site disappears.

In order to locate the target site, our system administrators coordinate with our upstream providers to get the necessary target IP information. Once they have that, they start going through that IP manually, site by site, looking for something that might attract a DDoS.

That is as unscientific as it sounds, and as a result, locating the target can take anywhere from 30 minutes to several hours. And unfortunately,  even after we have identified the target, chances are it will continue to affect a number of sites even after the DDoS has ended, because the IP address those sites live on has temporarily been removed from our network.

Keeping you informed

I’ll wrap this up by saying that while it isn’t possible for us to prevent outages completely, we are always working on improving our communication when they do occur. Every outage is different (if they were all the same, this would be easy) and they each teach us something new.

But if you need quick information, and the outage is server-specific (and not generally crippling our network), your best bet is to check the forum for updates. The forum is the first line of public communication for the tech support staff, so it is typically the first place you’ll see information.

For outages that do affect our network, we post information on Twitter, Google+ and Facebook. We may not be able to respond to your questions or comments on social media while the outage is in progress, but we do our best to keep those sources updated.

We do not have a “social media team,” as some larger companies do, so despite our best efforts, you may see spotty updates on any given service during any given outage. That doesn’t mean we don’t care. It usually means the outage is happening at a time of day when we are not all in the office, so the people who are here are dealing with the outage and answering helpdesk tickets (and haven’t called to wake me up yet).

We’ll always do our best to keep you in the loop so you know what’s happening. If you have any suggestions you’d like us to consider, we encourage you to comment on this post and let us know.

 

electricalWe’ve made a few changes to the Control Panel that should make life easier for some of you power users out there.

First, if you have a large number of sites in an account, we have significantly sped up the Order New Site and Order New Domain Name functions. Now there is no need to go make a sandwich while those pages load. Though, if you think about it, it’s almost always a good time for a sandwich.

Next, if you’ve ever entered a DNS TXT record in Control Panel, you may have run into a 128 character limitation. We have increased the TXT record limit to 512 characters (the maximum the DNS system will accept).

Finally, if you like to mess around with DNS records in general (and really, who doesn’t?), there may have been a time when you thought, “Well, that was fun, but I wish I could just dump all these cool experiments that have made my site redirect to altavista.com and somehow caused my email forward to the White House and just start over with a clean slate…” Well, now there is a Reset DNS button that does just what it claims to do – resets the DNS record for the site to our default settings. It’s cool, it’s powerful, and it will completely remove any customizations you’ve ever made, so use it carefully.

That’s it for now, but we’re always hard at work over here making the world a better place, so let us know if there is something we can do just for you.

 
[youtube=https://www.youtube.com/watch?v=YIBgBHMsqug&rel=0]
 

Give thanks for updated applications! Here are the App Installer updates for November:

  • Acquia Drupal 7.23.25
  • DotNetNuke 7.1.2 Community Edition
  • Gallery Server Pro 3.0.3
  • mediaWiki 1.21.2
  • mojoPortal 2.3.9.9
  • Moodle 2.6
  • phpBB 3.0.12
  • SilverStripe CMS 3.1.1
  • Umbraco CMS 6.1.6
  • WordPress 3.7.1
 

Prompted by a forum post, here are instructions on how to install Elmah, an application-wide Error Logging Module and Handler for ASP.NET on your hosting account here at WinHost.  First, you’ll need to download Elmah at this link.  For this tutorial, I downloaded ELMAH-1.2-sp2-bin-x64.zip.

ElmahHomePage

After you have downloaded the .zip file, extract its contents.  Open the /bin directory and find the .NET Framework libraries you want to use.  Most likely you will be using the assemblies in the net-2.0 -> Release folder.  Upload only the Elmah assemblies (i.e. Elmah.dll, Elmah.pdb, and Elmah.xml) to the /bin folder of your web application.

ElmahAssemblies

You can configure Elmah to store the exception information in different types of databases, but for this tutorial, I will only be showing you how to set it up with Microsoft SQL Server.  If you don’t have a database setup already, follow these instructions to create one:

1) Log into the WinHost Control Panel at https://cp.winhost.com
2) Click on the Sites tab.
3) Click on the Manage link next to the site you want to manage.
4) Click on the MS SQL Manager button.
5) Click on the Add button.
6) Select the database version in the drop down list, name the database, set the quota, and then click on the Create button.

Then log into your database using SQL Server Management Studio.  Select File -> Open -> File… (or hit CTRL-O) and navigate to the /db directory of your Elmah extracted files.  Select the SQLServer.sql file and click on Open.

SQLScript

Hit F5 to execute the script.  This will create the error logging database objects in your database.  The final step is to configure your web.config file.  The configuration will depend on what type Application Pool Pipeline mode you use.  For Classic mode, add the following XML markup to your web.config file to enable Elmah, only substituting WinHost_Database_Connection_String with your actual database connection string.

<configuration>
  <configSections>
    <sectionGroup name="elmah">
      <section name="security" requirePermission="false" type="Elmah.SecuritySectionHandler, Elmah" />
      <section name="errorLog" requirePermission="false" type="Elmah.ErrorLogSectionHandler, Elmah" />
      <section name="errorMail" requirePermission="false" type="Elmah.ErrorMailSectionHandler, Elmah" />
      <section name="errorFilter" requirePermission="false" type="Elmah.ErrorFilterSectionHandler, Elmah" />
    </sectionGroup>
  </configSections>
  <elmah>
    <security allowRemoteAccess="yes" />
    <errorLog type="Elmah.SqlErrorLog, Elmah" connectionStringName="Elmah" />
  </elmah>
  <connectionStrings>
    <clear />
    <add name="Elmah" connectionString="WinHost_Database_Connection_String" />
  </connectionStrings>
  <system.web>
    <httpHandlers>
      <add verb="POST,GET,HEAD" path="elmah.axd" type="Elmah.ErrorLogPageFactory, Elmah" />
    </httpHandlers>
    <httpModules>
      <add name="ErrorLog" type="Elmah.ErrorLogModule, Elmah" />
    </httpModules>
  </system.web>
</configuration>

For Integrated mode, use the following XML markup:

<configuration>
  <configSections>
    <sectionGroup name="elmah">
      <section name="security" requirePermission="false" type="Elmah.SecuritySectionHandler, Elmah" />
      <section name="errorLog" requirePermission="false" type="Elmah.ErrorLogSectionHandler, Elmah" />
      <section name="errorMail" requirePermission="false" type="Elmah.ErrorMailSectionHandler, Elmah" />
      <section name="errorFilter" requirePermission="false" type="Elmah.ErrorFilterSectionHandler, Elmah" />
    </sectionGroup>
  </configSections>
  <elmah>
    <security allowRemoteAccess="yes" />
    <errorLog type="Elmah.SqlErrorLog, Elmah" connectionStringName="Elmah" />
  </elmah>
  <connectionStrings>
    <clear />
    <add name="Elmah" connectionString="WinHost_Database_Connection_String" />
  </connectionStrings>
  <system.webServer>
    <handlers>
      <add name="Elmah" verb="POST,GET,HEAD" path="elmah.axd" type="Elmah.ErrorLogPageFactory, Elmah"/>
    </handlers>
    <modules>
      <add name="ErrorLog" type="Elmah.ErrorLogModule, Elmah" />
      <add name="ErrorMail" type="Elmah.ErrorMailModule, Elmah" />
    </modules>
  </system.webServer>
</configuration>

You can find your WinHost database connection string by:

1) Log into the WinHost Control Panel at https://cp.winhost.com
2) Click on the Sites tab.
3) Click on the Manage link next to the site you want to manage.
4) Click on the MS SQL Manager button.
5) Click on the Manage link next to the database you want to manage.

The connection string information will appear at the bottom.  Remember to replace the ****** in the password section with your actual database password.  If you have forgotten it, use the Edit link next to Database Password to change it.  If you need more details on other Elmah settings, refer to the sample web.config file in the /samples directory.

That’s it!  Elmah is now configured to trap exceptions from your web application.  It also comes with some sample reporting pages in the /samples/Demo directory which you can upload to review the errors trapped.

ElmahSamples