Tned
10-15-2010, 09:35 AM
I was working late last night and slept in. When I work up, I found that the server was down. I'm still not sure why the server hung, but I know we had failures on multiple fronts in terms of why it was down for so long.
I pay a support company quite a bit of money to immediately (within 10 minutes) login and deal with any problems that cropped up. Within a few minutes of the server being down, they couldn't connect to the server, so attempted to reboot the server through the datacenter's control panel/software.
About three weeks ago, the datacenter where the server is located moved to new support software. I had sent the new login details to the support company, and had them try logging in and creating a support ticket with the datacenter, and that all worked. However, someone at the support company 'forgot' to overwrite the old login details with these new ones, and this morning the tech working the case didn't know about the new login info. So, they couldn't login to reboot the server, or put a ticket in with the hardware/Datacenter people. They also didn't call me to alert me of the problem.
I have another company monitor the server, and if it's down for more than about 5 (to eliminate false alarms from temp network outages, router reboots, etc.) it sends me a text message. For some reason, my phone never received the "down" text message. This has never happened before.
So, while the server crashed and apparently needed a reboot (not sure exactly what happened yet), the only reason it was down for as long as it was is because there were a series of major failures on the support/alerting side.
Sorry for the problems, hopefully this is an isolated event.
I pay a support company quite a bit of money to immediately (within 10 minutes) login and deal with any problems that cropped up. Within a few minutes of the server being down, they couldn't connect to the server, so attempted to reboot the server through the datacenter's control panel/software.
About three weeks ago, the datacenter where the server is located moved to new support software. I had sent the new login details to the support company, and had them try logging in and creating a support ticket with the datacenter, and that all worked. However, someone at the support company 'forgot' to overwrite the old login details with these new ones, and this morning the tech working the case didn't know about the new login info. So, they couldn't login to reboot the server, or put a ticket in with the hardware/Datacenter people. They also didn't call me to alert me of the problem.
I have another company monitor the server, and if it's down for more than about 5 (to eliminate false alarms from temp network outages, router reboots, etc.) it sends me a text message. For some reason, my phone never received the "down" text message. This has never happened before.
So, while the server crashed and apparently needed a reboot (not sure exactly what happened yet), the only reason it was down for as long as it was is because there were a series of major failures on the support/alerting side.
Sorry for the problems, hopefully this is an isolated event.