Bug in AMD EPYC 7002 CPU freezes after 1044 days of operation

The AMD EPYC 2018 ("Rome") series of server processors based on the "Zen 7002" microarchitecture shipped since 2 has a bug that causes the processor to hang after 1044 days of operation without a state reset (system reboot). As workarounds to block the issue, it is recommended to disable CC6 power saving mode support or restart the server more than once every 1044 days (approximately 2 years 10 months).

According to information released by AMD, the hang is caused by a glitch that occurs when the processor core tries to wake up from CC6 power-saving mode (core-C6, lowers the voltage when idle) when the timer reaches the value of 1044 days after the last CPU state reset (the manifestation time may vary depending on on the REFCLK frequency).

AMD does not provide a more detailed explanation of the cause of the failure. Judging by the assumption published on Reddit, the hang occurs when the counter in the TSC (Time Stamp Counter) register, which counts the number of working cycles after reset, at a frequency of 2800 MHz reaches the value 0x380000000000000 (2800 MHz * 10**6 * 1042.5, i.e. after 1042 days and 12 hours).

The bug fix is ​​not going to be published. The problem remained unnoticed for a long time, since multi-year uptimes are not typical for servers that, in order to keep up to date, periodically have to be restarted to install kernel updates or to switch to a new release of the operating system. However, Linux distributions' non-reboot kernel upgrade methods, as well as long maintenance cycles (Ubuntu, RHEL, and SUSE are supported for 10 years), can result in servers being found for a long time without a reboot.

Source: opennet.ru

Add a comment