The criticality of data centres continues to rise, leading more industry-wide discussions regarding the meaning of ‘mission critical’ and how data centre owners can minimise the risk of downtime in their facilities.
“Mission critical is a broad term that refers to any activity, device, service or system whose failure or disruption will cause a failure in business operations. The impact of mission critical failures on business can have huge financial repercussions,” says Greaves.
In 2009, PayPal was impacted by a network hardware failure in one of their data centres, which resulted in approximately an hour of downtime where millions of merchants were unable to make online transactions. According to website monitoring company, Pingdom, the outage cost its customers between 7 and 32 million USD. In 2012, a data centre outage crashed Virgin Airlines, Tiger Airlines and Jetstar Airlines’ check-in system, resulting in major delays at Australian airports and unhappy clients taking to microblogging service Twitter to voice their complaints about the affected airlines.
According to research by Emerson Network Power and Ponemon Institute, the average cost of a data centre outage is USD 7 900 per minute and the average length of an outage is 86 minutes, making the average cost of a single event USD 679 400. Almost half (48%) of data centre outages are caused by human error.
“As you can see from the above statistics and examples, the financial implications and the discomfort caused to companies and their customers are huge, which has led data centre users, owners and operators to ask what can be done. While failure and loss is sometimes inevitable in such operations, there are many actions that companies can take to minimise risks to their mission critical data centres,” says Greaves.
In order to build a highly efficient data centre, a specialist build team needs to be appointed to determine design criteria that have to be customised for the specific data centre.
“Different data centres require different designs. A colocation facility, for example, will require a vastly different design than a data centre for a telecommunications company. Appointing a specialist build team can ensure that everything from equipment to floor layout is planned for so that the risk of downtime is minimised,” says Greaves.
Generators, Uninterrupted Power Supply (UPS), cooling systems servers and other equipment have to be maintained periodically and therefore need to be installed in such a way that the maintenance activities can be executed with ease. Any electrical or mechanical component that forms part of the facility’s infrastructure can fail, so if you build for concurrent maintainability you can prevent failures by providing easy access to ongoing maintenance procedures.
“This fits in with hiring a specialist build team who, for instance, can ensure that there are backup versions of generators and UPSs, with the result that these components can be maintained or even replaced should either maintenance be necessary or a failure occur,” says Greaves.
There are a number of design strategies that can be implemented to minimise the risk of downtime and failures. A typical example is following recognised standards such as the Uptime institute Tier ratings or TIA 942, and incorporating sustainability rating tools such as Leadership in Energy and Environmental Design (LEED) or National Australian Built Environment Rating System (NABERS). However, all design should be modular and scalable to allow data centres to be easily modified for the ever-changing business requirements.
Intelligent monitoring systems can give data centre operators the insights they need to make strategic, time-sensitive decisions. These systems can monitor rack conditions, power, cooling systems and batteries in such a way that alerts are created before a failure occurs.
“A specialist build team can strategically place sensors so that the monitoring system is able to collect the information that is needed to make the data centre more efficient and reduce failures,” says Greaves.
As mentioned above, a large proportion (48%) of data centre outages are caused by human error, not equipment failure. For example, accidental shutdowns are still a leading cause of data centre outages.
“A number of measures can be put in place to reduce human error. Besides properly training staff, data centre operators can also enforce stricter food/beverage policies by ensuring people don’t drink and eat near equipment, shield emergency ‘off’ buttons, and document maintenance procedures. Everyone working in a data centre should have knowledge of the IT equipment within the facility,” says Greaves.
Scenario planning can be complex as it has to address a wide range of possible disruptions to the data centre. Detailed scenario planning needs to address everything from the physical infrastructure and building location to power generation, critical systems and network infrastructure.
“When working with a client, we often go through the planned operations of the data centre highlighting typical disruptions in order for the operators to understand which systems can be impacted by a specific event. We also define potential planned disruptions, such as capacity expansion, scheduled maintenance or end-of-life replacement and create an action plan for each event. This will be followed by a walk-through and rehearsal of this action plan, which is then refined and improved,” says Greaves.
“BIM modelling can also be used to simulate the ’What if’ scenarios for clients. This type of virtualisation can give clients insights into the strategies they need to implement to avoid system failures,” he says.
Aurecon has developed a Work Method Statement that is being used successfully by data centres to minimise failures.
“We’ve found that a combination of engineering specifications, design documents, operational matrixes and implementation plans need to be developed to help clients understand how their facility is meant to operate, how the individual systems can fail, how this failure impacts the entire data centre and how they will identify these failures. Understanding all the various aspects is a critical part of avoiding mission critical outages,” asserts Greaves.