Development and implementation of a temperature monitoring system for HPC systems
In the context of high-performance computing (HPC), the removal of released heat is one challenging topic due to the continuously increasing density of computing power. A temperature monitoring system provides insight into the heat development of an HPC cluster. The effectiveness of this is directly related to the number of sensors, their placing and the accuracy of the temperature measurements. Monitoring is important not only to investigate the efficiency of the cooling system for purposes of detecting defective operation of the HPC system, but also to improve the cooling of the servers and by this the achievable performance. The main purpose of a fine-grained and unified temperature monitoring is the possibility to optimize the applications and their execution regarding the temperature spreading on HPC systems. Based on this, we present a highly flexible and scalable – in terms of cable length and number of sensors – and at the same time budget-friendly monitoring infrastructure. It is based on low-cost components such as Raspberry Pi as monitoring client and a setup using the DS18B20 digital thermometer as temperature sensor. Focus is given on the selection of adequate emperature sensors and we explain in detail how the sensors are assembled and the quality assurance is done before these are used in the monitoring setup.