A short note on an incident with an LSI RAID controller overheating in a server in a cold data center

TLDR; Setting the operating mode of the cooling system of the Supermicro Optimal server does not ensure the stability of the MegaRAID 9361-8i LSI controller in a cold data center.

We try not to use hardware RAID controllers, but we have one customer who prefers LSI MegaRAID configurations. Today we encountered overheating of the MegaRAID 9361-8i card due to the fact that the platform didn't feel overheating, and its RAID controller felt.

The view of the platform with a RAID card is shown in the figures below:

A short note on an incident with an LSI RAID controller overheating in a server in a cold data center

A short note on an incident with an LSI RAID controller overheating in a server in a cold data center

A few important points related to this server and operating environment:

The engineer who assembled the platform specifically placed two fans in front of the card, because he knows that LSI controllers get very hot. Pay attention to the motherboard, it practically does not go under the controller, ending 3 cm after the PCI-E slot.

As you can see, all fans are connected properly to the Supermicro motherboard and in the Optimal "blow" depending on the sensors on it, the temperature of the CPU.

This platform has a Xeon E-2236 - a very cold CPU, which, apparently, did not get very hot at the client.

The data center where this server is located is very cold - the cold corridor gives 18-20 degrees.

The combination of these factors led to a very interesting phenomenon - overheating of the RAID controller.

Possible chain of how it happened

  1. a cold processor and motherboard told the fans that it was possible to blow weakly.
  2. there was no motherboard under the RAID and there were no sensors that would detect overheating.
  3. The fans, when configured, blew weakly in Optimal mode, according to the needs of the motherboard and CPU.
  4. The controller, not receiving enough airflow, has overheated.

What did

We switched the fans to the "Standard" mode, if necessary, we will transfer to a higher performance mode.

Conclusions

Chances are if the cold aisle of the data center weren't as cold, or if the client was using the CPU heavily, this issue might not have occurred as the fans would run more intensively.

For ourselves, we decided to definitely change the fan operation mode on servers from RAID from Optimal to a mode with an increased rotational speed.

Source: habr.com

Add a comment