Why was the site down today?

Handruin

Administrator
Joined
Jan 13, 2002
Messages
13,927
Location
USA
Here is a followup from the event that occurred:

Reason for Outage Follow-up (8/10/11)


Dear Colo4 Customers,

Thank you for your patience and understanding with our equipment failure this week. We apologize for the disruption to your business and the stress and frustration that you experienced. As promised, we have compiled this Reason for Outage report as part of our after-action assessment.

What Happened: On Wednesday, August 10, 2011 at 11:01AM CDT, the Colo4 facility at 3000 Irving Boulevard experienced an equipment failure with one of the automatic transfer switches (ATS) at service entrance #2, which supports some of our long-term customers. The ATS device was damaged and did not allow either commercial or generator power automatically -- or through bypass mode. Thus, to restore the power connection, a temporary replacement ATS was required to be put into service.

Colo4’s standard redundant power offering has commercial power backed up by diesel generator and UPS. Each of our six ATSs reports to its own generator and service entrance. The five other ATSs and service entrances at the facility were unaffected.

The ATS failure at service entrance #2 affected customers who had single circuit connectivity (one power supply). For customers who had redundant circuits (or A/B dual power supplies), they access two ATS switches, so the B circuit automatically handled the load. (A few customers with A/B power experienced initial downtime due to a separate switch that was connected to two PDUs and the same service entrance. Power was quickly restored.)

Response Actions: As soon as this incident occurred we worked to mobilize the proper professionals in our facility and extended team. Our on-site electrical contractors and technical team, worked quickly with the general contractors and UPS contractors to assess the situation and determine fastest course of action to bring customers back online.

As part of our protocol, we first conducted a thorough check of the affected ATS as well as the supporting PDU, UPS, transformer, generator, service entrance, HVAC, and electrical. It was determined that all other equipment was functioning properly and that the failure was limited to the ATS device. This step was important for us to ensure that the problem did not affect other equipment or replicate at other service entrances.

It was further determined that the ATS would need extensive repairs and that the best scenario for our customers would be to install a temporary ATS. As the ATS changeover involved high-voltage power, it was important that we moved cautiously and deliberately to ensure the safety of our employees, contractors and customers in the building as well as our customers’ equipment. Safely bringing the new unit online was our top priority.

After the temporary ATS was installed and tested, the team brought up the HVAC, UPS and PDU individually to ensure that there was no damage to those devices. Then, the team restored power to customer equipment. Power was restored as of 6:31PM CDT.

The UPSs were placed in bypass mode on the diesel generator to allow the batteries to fully charge. The transition from diesel generator to commercial power occurred at 9:00PM CDT with no customer impact.

Colo4 technicians worked with customers to help bring any equipment online that did not come back on with the power restore or to help reset devices where breakers tripped during the power restoration. This process continued throughout the evening.

Assessment: As part of our after-action assessment, the Colo4 management team has debriefed with all on-site technical team and electrical contractors as well as the equipment manufacturer, UPS contractors and general contractors to provide assessments on the ATS failure. While an ATS failure is rare, it is even rarer for an ATS to fail and not allow it to go into bypass mode.

While the ATS could be repaired, we made the decision to order a new replacement ATS. This is certainly a more expensive option, but it is the option that provides the best solution for the long-term stability for our customers.

Lessons Learned: Thankfully we’ve experienced few issues during our 11 years in business though any issue is one too many. As part of our after-action review, we have made additional improvements to our existing emergency/disaster recovery plans.

Our technical team, HVAC, electrical and general contractors brought exceptionally fast, sophisticated thinking and action to get our customers back in business as quickly as possible. The complexity of working with power of that size and scale at any time, but especially under pressure, shows the level of merit, knowledge and resolve that these individuals have. Thank you to the technical team and all our contractors for a job well done to safely restore power for our customers.

As part of the debrief, all Colo4 network gear in both facilities was checked to ensure all equipment was on redundant power, and all is connected properly.

Unfortunately, we weren’t well prepared on the customer service side. Our customers were stressed and needed more frequent updates from us along the way. We very much wanted to provide you with an ETA earlier. Due to the extent and complexity of the failure, we were unable to provide a proper ETA quickly and did not want to send out false information or set the wrong expectation.

For any future scenarios, we plan to provide process updates along the way even if we are unable to provide an exact ETA at that moment. We hope that this step will provide insight into the assessment period efforts that are occurring.

We will continue to send direct emails to affected customers and post website status updates. As the website received heavy hits during the incident, we are upgrading the website server to better handle requests. Based on our web server stats for the past year, the server had excellent capacity, but in this case, we experienced a heavier load from our customers and our customers’ customers. We will move some equipment to secondary offsite locations.

We’ve also set up a Twitter account @colo4 to post future updates and more timely responses. As you may have noticed, we began using Twitter during that afternoon.

Next Steps: Once we receive and test the new ATS, we will schedule a maintenance window to replace the equipment. We will provide at least three days advance notice and timelines to minimize any disruption.


Thank you again for your patience and understanding. We take our relationships very seriously and realize that you rely on us to keep your business online. We’re sorry that our equipment failure caused challenges this week.

Please let us know if you have any questions or need assistance.

Sincerely,


Paul Estes Paul VanMeter
CEO CTO
*********************************************
***************
 

ddrueding

Fixture
Joined
Feb 4, 2002
Messages
19,729
Location
Horsens, Denmark
The switch to bypass on ours is mechanical (generator/commercial is electronic and automatic). Obviously several orders of magnitude smaller, but it struck me as the most reliable way to do it.
 

ddrueding

Fixture
Joined
Feb 4, 2002
Messages
19,729
Location
Horsens, Denmark
We have a fairly large line interactive UPS that lives between the generator/commercial power and our equipment. Typically it just runs the server room and a single outlet in each office dedicated to the computers, but it is capable of running the whole building (except the elevator, provided everyone turns off their damned space heaters) for at least 60 minutes. I've managed to reduce the server load through virtualization that the runtime on just servers is colossal, even though the generator is programmed to start after just 30 seconds of downtime.
 
Top