PDA

View Full Version : Server outages this weekend -- What happened and what the plan is to prevent them in the future



Tned
11-10-2008, 10:12 PM
Ok, as most of you have found, we have been very unstable since late night Friday. Here's the deal.

First some tech talk (skip to next Bolded section if you don't care about the tech talk ;)):

BroncosForums.com has been hosted on a high-end VPS. A VPS or Virtual Private Server is a virtual server. This means that the hosting company puts together very high end servers, such as dual quad core CPU's, 32 Gigs of RAM, 15,000 RPM SAS drives in a Raid 10 environment. In other words, very FAST, high end servers. Then, they run a vitalization software (similar to VMware that some of you might be familiar with in the corporate world) that allows them to create virtual servers with dedicated amounts of RAM and CPU allocations based on how much RAM you purchase.

A high end VPS, such as BroncosForums runs on, is about the same price as a low-mid tier dedicated server, but when properly setup, can provide faster performance than an equivalently priced dedicated server, because of the underlying fast disk arrays, and fast, high performance hardware.

The one flaw in the current VPS hosting world is that I/O, or disk access, is not allocated like CPU time, so one server that gets hacked, or has a bad script/program can flood the disk array and slow down all disk access. That was not the problem this time, but back in late May, we were having some problems in this regard, and the host worked with me and moved me to a slightly slower, but 'protected' hardware node where only this VPS, one other customer (that never abused I/O) and the hosting companies corporate and support sites were located.

Since that switch, most SQL errors (usually caused by I/O overloads from another VPS) disappeared, and rarely where their slow downs, except for at night, when I was running backups, or another node was running backups. For instance, you might have noticed that at 12:15am, 6:15am, 12:15pm and 6:15pm the site might appear to stop responding for 30 seconds to 2:00 minutes. That is because I backup the SQL data every six hours, and then do a full site backup every night (around 2:00am) and move it to a second, backup VPS, before finally copying those backups to my PC at home, to make sure that I have current backups in three locations.

So, what went wrong (still tech talk, but less techie, so if you want you can skip to next bolded section):

I posted a message mid week about how there could be a brief server outage, shortly after the conclusion of the Thursday night game. That was because the host was performing a hardware upgrade on all of their servers. While we actually had no down time on Thursday, one of the main servers the host was running 30+ VPS's on failed to come up after the hardware upgrade, so they decided to take the 'protected' node that we were located on, one other VPS and their own sites, and move them to a different production server, then upgrade the 'protected' node with two faster quad core processors, 32 gigs of RAM and more hard drives, so that they could migrate customers on the downed server to the hardware we were located on.

I was assured that moving us off of the original server, and then eventually back to it, would be seamless and we would probably never even know it happened. At worst, a minute or so of down time. Obviously, that wasn't the case. They made the initial move Friday night, and we were down for 30 minutes or so. Then, overnight (Friday night/Saturday morning) the server restarted a few times, and we started routinely getting database errors. Saturday was spotty, and then Sunday we were down most of the day (thankfully, the Broncos were on Thursday night). Today, we have been down several times, and MySQL has failed numerous times, and I have had to manually restart the VPS/Server.

Where do we go from here:

While 'theoretically' this problem is now over, and the server that failed is now backup, and they are moving the customers back to that server, I am done. The plan was to move us back to the 'protected' hardware node, similar to how it was before, but now use it as a 'backup' machine for when they have problems. Meaning, while things might go back to normal, we could see these bumpy stretches again.

Therefore, I had three choices:


Continue with the same host, with the same high end VPS.
Move to another host providing high end VPS's, but then we could be in the board we were in back in the first half of the year, with sometimes the server runs great, but if another VPS is hacked or floods the the disk arrays with I/O, then we will get DB errors or other slow downs.
Move to a mid-level dedicated server, moving to a non-managed (less support for server administration) solution, in order to get more hardware for the same dollars (most VPS providers are managed, meaning they will complete most server administration tasks).


After spending a lot of time thinking about it Friday night (stayed up until 3:30am reading reviews on various dedicated server providers and checking prices), and continued mulling the options on Saturday, when Sunday's extended outages came, I placed an order for a dedicated server.

Currently, I have the primary VPS, plus a slightly less powerful 'backup' VPS, which I both transfer backups of the forum to, have it host little sites like Totalbroncos.com, and have standing by to switch over to in case there was ever an extended outage (days straight). Combined, these two VPS's cost around $140 a month (plus over costs like server monitoring to page me if the site goes down, plus annual costs like the chat room, and vBulletin license, etc.).

In order to move to a mid level dedicated, I will have to eliminate the main and backup VPS's, and the new dedicated will cost just under $200 a month. So, a large jump, but no other VPS/Site/server outage should impact us, only problems that could crop up on our server.

As we speak, the new dedicated server is being built, and hopefully will be online sometime tonight. Once that occurs, I need to configure it, load a backup copy of the forums, and then run a stress testing program against it, to simulate hundreds of users being on it at once, to make sure it is at least as fast as the current VPS when it is running well. This will also give me a feel for how many concurrent users the server will support, so I know if I bought the right level server, or if I have too much, too little. I expect it to easily handle our current number of active users, and hopefully double our size, but the stress test will allow me to confirm that.

The program I have can simulate up to 2,000 users hitting the server at one time, so I will be able to get a good feel for its capabilities.

The migration process

Last night, I adjusted the DNS settings, which tells your ISP how often to check back for the current server IP address. I lowered this to 15 minutes. So, each time you connect to BF right now, if it has been over 15 minutes, your ISP should check to see if the BroncosForums.com server IP address has changed. This setting will take a day or two to get to all of your ISPs, but once it does, then it should make the move to the new server much more seamless.

Sometime, Wednesday or Thursday, I will shut down this message board, and you will see a message about the transfer in process. I would expect this to last for 30-60 minutes or so. Then, I will bring the message board online on the new server, and change all the DNS information. At that point, I will change the message to say that BroncosForums.com has moved to a new server, and if you are seeing this message, your ISP does not have the new server information, and that you can reach it on a temporary basis by going to http://xxx.xxx.xxx.xxx (that will be the new IP address).

At some time, from 15 minutes to 24 hours or so in the US, and from 1-5 days in Europe (where they don't update the DNS info as quickly), you will no longer get the message and have to use the temporary IP address to access the message board. You will know when everything is normal for you when typing www.broncosforums.com brings you to the new server, rather than the message telling you the server has moved.

Bottom line, assuming no more outages between now and the move, the message board should only be offline for 30-60 minutes, and for many of you, you will never even know the site changed servers, for some of you, you might have to use the IP address on a temporary basis, until your ISP's DNS information is updated and BroncosForums.com starts pointing to the new server again.

Sorry for the downtime, but I am working hard to have things stabilized and be on the new server well before the game on Sunday.

T

WhoDey
11-10-2008, 10:19 PM
Ok, as most of you have found, we have been very unstable since late night Friday. Here's the deal.

First some tech talk (skip to next Bolded section if you don't care about the tech talk ;)):

BroncosForums.com has been hosted on a high-end VPS. A VPS or Virtual Private Server is a virtual server. This means that the hosting company puts together very high end servers, such as dual quad core CPU's, 32 Gigs of RAM, 15,000 RPM SAS drives in a Raid 10 environment. In other words, very FAST, high end servers. Then, they run a vitalization software (similar to VMware that some of you might be familiar with in the corporate world) that allows them to create virtual servers with dedicated amounts of RAM and CPU allocations based on how much RAM you purchase.

A high end VPS, such as BroncosForums runs on, is about the same price as a low-mid tier dedicated server, but when properly setup, can provide faster performance than an equivalently priced dedicated server, because of the underlying fast disk arrays, and fast, high performance hardware.

The one flaw in the current VPS hosting world is that I/O, or disk access, is not allocated like CPU time, so one server that gets hacked, or has a bad script/program can flood the disk array and slow down all disk access. That was not the problem this time, but back in late May, we were having some problems in this regard, and the host worked with me and moved me to a slightly slower, but 'protected' hardware node where only this VPS, one other customer (that never abused I/O) and the hosting companies corporate and support sites were located.

Since that switch, most SQL errors (usually caused by I/O overloads from another VPS) disappeared, and rarely where their slow downs, except for at night, when I was running backups, or another node was running backups. For instance, you might have noticed that at 12:15am, 6:15am, 12:15pm and 6:15pm the site might appear to stop responding for 30 seconds to 2:00 minutes. That is because I backup the SQL data every six hours, and then do a full site backup every night (around 2:00am) and move it to a second, backup VPS, before finally copying those backups to my PC at home, to make sure that I have current backups in three locations.

So, what went wrong (still tech talk, but less techie, so if you want you can skip to next bolded section):

I posted a message mid week about how there could be a brief server outage, shortly after the conclusion of the Thursday night game. That was because the host was performing a hardware upgrade on all of their servers. While we actually had no down time on Thursday, one of the main servers the host was running 30+ VPS's on failed to come up after the hardware upgrade, so they decided to take the 'protected' node that we were located on, one other VPS and their own sites, and move them to a different production server, then upgrade the 'protected' node with two faster quad core processors, 32 gigs of RAM and more hard drives, so that they could migrate customers on the downed server to the hardware we were located on.

I was assured that moving us off of the original server, and then eventually back to it, would be seamless and we would probably never even know it happened. At worst, a minute or so of down time. Obviously, that wasn't the case. They made the initial move Friday night, and we were down for 30 minutes or so. Then, overnight (Friday night/Saturday morning) the server restarted a few times, and we started routinely getting database errors. Saturday was spotty, and then Sunday we were down most of the day (thankfully, the Broncos were on Thursday night). Today, we have been down several times, and MySQL has failed numerous times, and I have had to manually restart the VPS/Server.

Where do we go from here:

While 'theoretically' this problem is now over, and the server that failed is now backup, and they are moving the customers back to that server, I am done. The plan was to move us back to the 'protected' hardware node, similar to how it was before, but now use it as a 'backup' machine for when they have problems. Meaning, while things might go back to normal, we could see these bumpy stretches again.

Therefore, I had three choices:


Continue with the same host, with the same high end VPS.
Move to another host providing high end VPS's, but then we could be in the board we were in back in the first half of the year, with sometimes the server runs great, but if another VPS is hacked or floods the the disk arrays with I/O, then we will get DB errors or other slow downs.
Move to a mid-level dedicated server, moving to a non-managed (less support for server administration) solution, in order to get more hardware for the same dollars (most VPS providers are managed, meaning they will complete most server administration tasks).


After spending a lot of time thinking about it Friday night (stayed up until 3:30am reading reviews on various dedicated server providers and checking prices), and continued mulling the options on Saturday, when Sunday's extended outages came, I placed an order for a dedicated server.

Currently, I have the primary VPS, plus a slightly less powerful 'backup' VPS, which I both transfer backups of the forum to, have it host little sites like Totalbroncos.com, and have standing by to switch over to in case there was ever an extended outage (days straight). Combined, these two VPS's cost around $140 a month (plus over costs like server monitoring to page me if the site goes down, plus annual costs like the chat room, and vBulletin license, etc.).

In order to move to a mid level dedicated, I will have to eliminate the main and backup VPS's, and the new dedicated will cost just under $200 a month. So, a large jump, but no other VPS/Site/server outage should impact us, only problems that could crop up on our server.

As we speak, the new dedicated server is being built, and hopefully will be online sometime tonight. Once that occurs, I need to configure it, load a backup copy of the forums, and then run a stress testing program against it, to simulate hundreds of users being on it at once, to make sure it is at least as fast as the current VPS when it is running well. This will also give me a feel for how many concurrent users the server will support, so I know if I bought the right level server, or if I have too much, too little. I expect it to easily handle our current number of active users, and hopefully double our size, but the stress test will allow me to confirm that.

The program I have can simulate up to 2,000 users hitting the server at one time, so I will be able to get a good feel for its capabilities.

The migration process

Last night, I adjusted the DNS settings, which tells your ISP how often to check back for the current server IP address. I lowered this to 15 minutes. So, each time you connect to BF right now, if it has been over 15 minutes, your ISP should check to see if the BroncosForums.com server IP address has changed. This setting will take a day or two to get to all of your ISPs, but once it does, then it should make the move to the new server much more seamless.

Sometime, Wednesday or Thursday, I will shut down this message board, and you will see a message about the transfer in process. I would expect this to last for 30-60 minutes or so. Then, I will bring the message board online on the new server, and change all the DNS information. At that point, I will change the message to say that BroncosForums.com has moved to a new server, and if you are seeing this message, your ISP does not have the new server information, and that you can reach it on a temporary basis by going to http://xxx.xxx.xxx.xxx (that will be the new IP address).

At some time, from 15 minutes to 24 hours or so in the US, and from 1-5 days in Europe (where they don't update the DNS info as quickly), you will no longer get the message and have to use the temporary IP address to access the message board. You will know when everything is normal for you when typing www.broncosforums.com brings you to the new server, rather than the message telling you the server has moved.

Bottom line, assuming no more outages between now and the move, the message board should only be offline for 30-60 minutes, and for many of you, you will never even know the site changed servers, for some of you, you might have to use the IP address on a temporary basis, until your ISP's DNS information is updated and BroncosForums.com starts pointing to the new server again.

Sorry for the downtime, but I am working hard to have things stabilized and be on the new server well before the game on Sunday.

T

I went to the temporary IP address, but I got a bunch of porn! :eek:

underrated29
11-10-2008, 10:27 PM
Sooo.

What exactly did yout do?



Sorry i am a lazy ass and didnt want to read through all vps stuff.

WhoDey
11-10-2008, 10:27 PM
They did some stuff, and then some things happened. The end.

(Thanks, Tned!)

Reidman
11-11-2008, 12:24 AM
Shouldn't have to worry about 2000 hits at one time this season...:D


Thanks Tned, you ROCK!

Tned
11-11-2008, 12:29 AM
Shouldn't have to worry about 2000 hits at one time this season...:D


Thanks Tned, you ROCK!

lol, probably not. I doubt the server would support 2000 concurrent, but the testing app I am using can stress test it to about that level. So, I'll run it up to that level and see where it chokes, which will also stress test the cPU/ram/HDD's

Lonestar
11-11-2008, 02:59 AM
lol, probably not. I doubt the server would support 2000 concurrent, but the testing app I am using can stress test it to about that level. So, I'll run it up to that level and see where it chokes, which will also stress test the cPU/ram/HDD's

make sure you know cpr before you do it or have a EMT crew near by.. :laugh:

Tned
11-11-2008, 05:45 AM
make sure you know cpr before you do it or have a EMT crew near by.. :laugh:

Yea, that's for sure:

UPDATE:

I received a page a little while ago from the monitoring company, because BroncosForums was once again briefly down. While I was up checking on the message board, I checked and the new server is build. So, I going through some initial setup, before going back to sleep for a couple hours.

I'll keep you up to date as things progress.

OB
11-11-2008, 10:34 AM
Damn it im outraged - what kind of service is this - i pay good money to have access to this site 24/7 and if i tried i couldnt get on - appalling - just appalling :coffee:









Oh wait thats right - its free :D and I didnt even notice that it went down - tehehehehe

Seriously did anyone complain :confused:

God i hope not :tsk:

eessydo
11-11-2008, 11:06 AM
I would recommend Openfiler (http://www.openfiler.com) (free opensource NAS or SAN) for the home backups instead of your pc. It runs on conary linux, is very stable and you can chunk together a bunch of old hardware to make it run. Supports Rsync and can run on VMware ESXi (now free) if you feel the need to virtualize your home environment. Also supports remote replication over WAN.

I bought 4 1 tb "green" sata drives with a RAID 5 hw/config with Hot swap (supports RAID 5 s/w config if you want). 2 tb of redundent storage for pretty much the cost of the drives ($150 a piece).

Used an old P4 board and box. Stable as hell and only cost about $5.36/month average to run 24/7. Has all of the tools if you need and supports both file level and block level storage.