The criticality of data centres continues to rise, leading to more industry-wide discussions regarding the meaning of ‘mission critical’ and how data centre owners can minimise the risk of downtime in their facilities.
’Mission critical’ is a broad term that refers to any activity, device, service or system whose failure or disruption would cause a failure in business operations. The impact of mission critical failures on business can have huge financial repercussions.
In July 2019, Mark Zuckerberg’s three online platforms – Facebook, Instagram and WhatsApp – experienced simultaneous outages due to ‘routine maintenance’ causing havoc across the globe for their social media users. In February 2020, more than 100 flights to and from one of the world’s busiest airports – Heathrow – were disrupted due to a technical issue.
According to research by Emerson Network Power and Ponemon Institute, the average cost of a data centre outage is USD9000 per minute with the average cost of a single event USD740 000. Uninterrupted Power Supply (UPS) system failure is the number one cause of outages, with cybercrime being the fastest growing area of data centre outages.
The financial implications and the discomfort caused to companies and their customers are huge, which has led data centre users, owners and operators to ask what can be done. While failure and loss are sometimes inevitable in such operations, there are many actions that companies can take to minimise risks to their mission critical data centres.
To build a highly efficient data centre, a specialist build team needs to be appointed to determine design criteria that need to be customised for the specific data centre.
Different data centres require different designs. A co-location facility, for example, will require a vastly different design than a data centre for a telecommunications company. Appointing a specialist build team can ensure that everything from equipment to floor layout is planned, so the risk of downtime is minimised.
Generators, UPS, cooling systems, servers and other equipment need to be maintained periodically and therefore should be installed in such a way that the maintenance activities can be executed with ease. Any electrical or mechanical component that forms part of the facility’s infrastructure can fail, so building for concurrent maintainability can prevent failures by providing easy access to ongoing maintenance procedures.
This aligns with hiring a specialist build team who, for instance, can ensure that there are backup versions of generators and UPSs, with the result that these components can be maintained or even replaced should either maintenance be necessary or a failure occur.
There are a number of design strategies that can be implemented to minimise the risk of downtime and failures. A typical example is following recognised standards such as the Uptime institute Tier ratings or TIA 942, and incorporating sustainability rating tools such as Leadership in Energy and Environmental Design (LEED) or National Australian Built Environment Rating System (NABERS).
However, all design should be modular and scalable to allow data centres to be easily modified for the ever-changing business requirements.
Intelligent monitoring systems can give data centre operators the insights they need to make strategic, time-sensitive decisions. These systems can monitor rack conditions, power, cooling systems and batteries in such a way that alerts are created before a failure occurs.
A specialist build team can strategically place sensors so that the monitoring system is able to collect the information that is needed to make the data centre more efficient and reduce failures.
Data centre outages can be caused by human error, not equipment failure. For example, accidental shutdowns are still a leading cause of data centre outages.
A number of measures can be put in place to reduce human error. Besides properly training staff, data centre operators can also enforce stricter food/beverage policies by ensuring people don’t drink and eat near equipment, shield emergency ‘off’ buttons, and document maintenance procedures. Everyone working in a data centre should have knowledge of the IT equipment within the facility.
Scenario planning can be complex as it needs to address a wide range of possible disruptions to the data centre. Detailed scenario planning should address everything from the physical infrastructure and building location to power generation, critical systems and network infrastructure. Scenario planning should entail:
BIM modelling can also be used to simulate the ’What if’ scenarios. This type of virtualisation can give insights into the strategies needed to implement to avoid system failures.
Aurecon has developed a Work Method Statement that is being used successfully by data centres to minimise failures.
A combination of engineering specifications, design documents, operational matrixes and implementation plans need to be developed to help data centre operators understand how their facility is meant to operate, how the individual systems can fail, how this failure impacts the entire data centre and how they will identify these failures. Understanding all the various aspects is a critical part of avoiding mission critical outages.
Shayne Parkin is a Technical Director and Aurecon’s Buildings Electrical Practice Leader in Victoria, focused on high reliability and complex engineering projects.
Please change your browser to one of the options below to improve your experience.