BlogUnsure of Your Source: We've Got You Covered
The System Is Down? An Example of How to Handle Failure
Our online project management service (Basecamp) had a hosting issue recently and went down for about two hours. This is how they handled it.
37signals System Status
Basecamp, Backpack, Highrise, Campfire, Writeboard, Ta-da List, and our blogs are all offline.
UPDATE: The load balancer has been swapped and is currently being configured. We should be in the home stretch now. Again, we're incredibly sorry for this disruption. This is not how Fridays are supposed to be.
— 11:18am CST (17:18 GMT) on January 18, 2008
UPDATE: The technicians at our service provider are still working on installing the new load balancer. We're breathing down their neck as heavily as we can. And we profusely apologize for this unacceptable interruption of service.
— 10:56am CST (16:56 GMT) on January 18, 2008
UPDATE: We have located the problem to be with the load balancer setup. A new unit is being installed. We should be back shortly. Again, we're terribly sorry for this disruption of service.
— 10:28am CST (16:28 GMT) on January 18, 2008
All systems are currently offline as we're experiencing network outage from our provider. We're working on it right now. No data has been lost, all our machines are still working, but they're not accessible from the internet. Sorry for the inconvenience.
— 10:03am CST (16:03 GMT) on January 18, 2008
There are great lessons in this -- the first is humor and reality. I realize these are people and they care about this issue and they're trying to fix it.
The second is information -- by letting us know what's going on, we're able to calm down about the inconvenience.
The third is humility -- they don't try to blame, they recognize they failed and they're sorry for it.