How Thermal Camera Surveys Protect Crtiical Data Centre Operations

Data centres and
server rooms provide a managed and protected environment for server operations.
Within these environments are complex infrastructure systems that must be
inspected and maintained regularly but just how do you spot problems before
they affect operational resilience? Regular maintenance and consumable
replacement are one answer but even this can miss potential issues. The answer
is a thermal camera survey.

Thermal Camera Surveys

A thermal camera is
also known as an infrared (IR) camera and captures what cannot be seen with the
naked eye i.e. an image of the radiation being emitted by the different
infrastructure components within a building. Camera images can be taken at the
building incomer, substation transformer, LV switchboard and all the way along
the critical power path to uninterruptible power supplies, batteries (VRLA and
lithium) power distribution units and electrical sub-distribution and wiring.
HVAC (heating and ventilation air conditioning) system can also be photographed
including chillers and cooling units as well as server racks and containment
arrangements.

For some
organisations a thermal camera survey may be mandatory e.g. as part of the
annual insurance review or certification status such as the Uptime Institute’s
Tier-rating system. For others the addition of a thermal camera survey to a
preventative maintenance visit can provide more peace of mind and especially
where older systems are deployed that may be approaching their design or useful
end-of-life.

It is important to
follow a set and documented procedure when carrying out a thermal camera
survey. The survey itself can be a standalone service or one coupled to another
such as a preventative maintenance visit, data centre audit or risk assessment
project.

Critical Infrastructure ‘Hot-Spots’

In the data centre
world the term ‘hot-spots’ is often used to refer to high temperature areas
within a server rack. This can often be caused by several factors including
poor air flow management and the arrangement of servers and UPS systems within
the rack.

In the electrical
work the term is also used in the same way but may not necessarily result from
poor equipment layout or air flow. ‘Hot-spots’ in a valve regulated lead acid
(VRLA) battery or set of AC or DC capacitors indicate ageing or poor
manufacture. The heat rises due to areas of internal resistance that if not
dealt with could lead to a potential fire risk. Within electrical switchgear
high temperatures can indicate load imbalances, underrated devices and
electrical harmonics. Again issues that if not tackled can lead to fire risks
and system breakdowns.

Thermal Image Records

Thermal imagers
such as a Fluke camera, measure actual surface temperatures and can store
two-dimensional images of an object for comparative purposes. Captured images
can then be used to identify temperature anomalies and areas that are either
hotter or colder than others around them or than expected.

As well as
identifying ‘hot-spots’ the images can be stored digitally in a Cloud service
and/or submitted with a visit report. The benefit of retaining the images being
that they can provide a thermal audit record for changes in temperature over
the life of an asset or component within the building’s infrastructure. Changes
and anomalies can identify a need for investigation, maintenance or system
upgrade or swap-out.

A Data Centre Thermal Survey Checklist

Any survey must be
carried out by a suitably qualified engineer and to a set survey procedure. For
any data centre or server room, the survey must be comprehensive in order to
ensure that no critical infrastructure component that could prove to be a
single point of failure is missed.

  1. Substation Transformers: the transformer may outside the building and the property of the local electricity district network operator (DNO) or be on the plant asset register of the data centre. It is important to monitor changes in thermal temperature and Delta-Ts annually with respect to windings and lug connections.
  2. HV/LV Switchboards: LV switchboards, like substation transformers can have a design life measured in decades but the switchgear will require some maintenance. Active harmonic filters for example will have capacitors that require replacement around every 7-8 years. Thermal temperature rises can indicate ageing components that require swap out.
  3. Electrical Wiring and Sub-distribution Panels: from the LV switchgear power is provided on sub-circuits via sub-distribution panels. These will have circuit breakers that and connecting cables that must be sized and rated correctly for the voltage and currents calculated. Higher than normal thermal temperatures can indicate overheating and underrating both of which can undermine discrimination and fault paths in the event of downstream short-circuit paths.
  4. Backup Power, AMF Panels and Static Transfer Switches (STS): most facilities will have some form of back-up power and may have a static transfer switch arrangement (A and B supplies). For accurate temperature measuring the back-up generators will need to ‘imaged’ during their standby and power-on operations as will the static transfer switches.
  5. UPS Systems and Batteries: the facility may have a large centralised uninterruptible power supply or decentralised power protection plan. Each UPS and their battery set (lead acid or lithium-ion) must be surveyed under load conditions.
  6. Energy Storage Systems: there is an increasing trend for larger operators to store power locally (from renewable power sources) or to generate revenue with demand side response (DSR) programmes. An energy storage system is like a UPS system and will have a lithium-ion type battery. Lithium batteries have more complex battery management systems than lead acid but will require thermal camera inspection at least annually to help identify potential issues.
  7. Power Distribution Units: within the server racks, PDUs provide the final power point of connection of the server and IT loads to the critical power paths. PDUs will experience potential thermal overloads than could lower their reliability when operated within server rack ‘hot-spots’ or when overloaded but within their thermal trip settings. Poor connections and faulty wiring can also be exposed by a thermal survey.
  8. HVAC Systems: the cooling system should surveyed with the same comprehensive approach as the critical power path and include each sub-component and external chillers and heat exchangers to expose any potential problems and failure points.
  9. Server Racks and Containment: thermal imaging can provide a quick and accurate assessment of the efficiency of cooling within server racks and containment systems. What must be considered here is the air intakes and exhaust areas in relation to the power densities and therefore heat generated by the servers themselves. HPC and blade type servers in highly dense deployments generate significant amounts of heat. The survey should help to identify whether the rack air flow or containment arrangement is operating as expected in terms of preventing the cold and hot-aisle air flows from mixing and potentially weakening the overall cooling design. Coupled with a measurement of the air flow can help to map out the overall air flow in and around the areas.
  10. Raised Access Floors and Ceiling Voids: under floor may hide a plethora of issues which are masked if the void is used for cooling and air flow. The survey should take pictures where possible to identift any thermal issues which could indicate cable damage, poor connections or simply poor layout of the areas.

Summary

Air flow design and
thermal management are becoming increasingly complex within data centre and
server room environments. Air flow and thermal temperatures issues can arise
from changes in the design concept as new technologies are deployed, as well as
due to ageing components within the electrical infrastructure. Thermal camera
surveys are increasingly becoming more widely accepted either as separate
thermal audits or as additions to preventative maintenance and fault-finding
visits. Whilst the cameras are relatively low-cost devices, their use and
application require formal training in order to ensure the survey is
comprehensive and does not miss that single point of failure that could catastrophically
fail and interrupt data centre operations.

Most surveys start from the incoming point to the building and then follow the critical power and cooling route into the server room or data hall. Timing is important as the greatest heat images will be capture during peak operational and workload times i.e. there is little point carrying out a thermal survey during off-peak or maintenance periods unless there are suspect and aged systems such as old transformer-based UPS and battery sets in operation.

The post How Thermal Camera Surveys Protect Crtiical Data Centre Operations appeared first on Server Room Environments.