NVDIMMの修正不能なエラーが原因でH610Sノードがオフラインになり、ブートループが発生している
環境
- NetApp SolidFire H610S(BIOS 3B06搭載)
- NetApp Elementソフトウェア12.3.X以前
問題
- 複数のノードまたは単一のノード がオフラインでブートループにある
- ノードはブートを試行するが、 Elementのロード前に失敗する
- NetAppスプラッシュ画面の直後に再起動が発生する
- BMCシステムイベントログ(SEL)には次の情報が表示されます。
[CATERR] Machine Check Exception (MCERR)
[MCERR] Uncorrectable Error - Machine Check Error
[Memory Error] Uncorrectable ECC(CPU0_<xx>)
- ボリュームのオフラインまたはデグレードに関するメッセージが表示される可能性
例:複数のノードが影響を受けた場合のActive IQエラーアラート
The following volumes are offline. [X, X, X, X, X, X]
The SolidFire Application cannot communicate with Storage node having node ID 11.
Cluster Block Data is in a degraded state, and the auto-heal process to restore full block data redundancy cannot proceed. Either too many nodes or block services are offline, or the cluster block services are too full.
例:BMC Web GUIからのSEL
1160 Sep/8/2022 20:16:41 [Information] [Power Unit] [Power Unit] Power Off / Power Down - Deasserted 1159 Sep/8/2022 20:16:36 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted 1158 Sep/8/2022 20:16:36 [Information] [Power Unit] [Power Unit] Power Off / Power Down - Asserted 1157 Sep/8/2022 20:16:35 [Warning] [Additional MCE Error] [OEM Record C2] ManufacturerID:001C4C, Extra Information : 0 MSCOD:0010 MCACOD:0134 1156 Sep/8/2022 20:16:35 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted 1155 Sep/8/2022 20:16:35 [Critical] [MCERR] [Processor] Uncorrectable Error - Machine Check Error: Bank 1/CPU 0/Core 2 - Asserted 1154 Sep/8/2022 20:16:35 [Critical] [Memory Error] [Memory] Uncorrectable ECC(CPU0_F1) - Asserted
注:NVDIMMは、H610Sモデルの特定のスロットに搭載されています。 H610S1 / S2-CPU0_C0およびCPU0_F0、 H610S4-CPU0_C1およびCPU0_F1
例: ipmitoolの出力からのSEL
SEL Record ID : 0482 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0001 EvM Revision : 04 Sensor Type : Memory Sensor Number : 87 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : a1ff29 Description : Uncorrectable ECC SEL Record ID : 0483 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0001 EvM Revision : 04 Sensor Type : Processor Sensor Number : a8 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : ab0102 Description : Uncorrectable machine check exception SEL Record ID : 0484 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0020 EvM Revision : 04 Sensor Type : Processor Sensor Number : 74 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 0bffff Description : Uncorrectable machine check exception SEL Record ID : 0485 Record Type : c2 (OEM timestamped) Timestamp : 09/08/2022 20:16:35 Manufactacturer ID : 001c4c OEM Defined : 000010003401 [......] SEL Record ID : 0486 Record Type : 02 Timestamp : 09/08/2022 20:16:36 Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 77 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 00ffff Description : Power off/down SEL Record ID : 0487 Record Type : 02 Timestamp : 09/08/2022 20:16:36 Generator ID : 0020 EvM Revision : 04 Sensor Type : Processor Sensor Number : 74 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 0bffff Description : Uncorrectable machine check exception SEL Record ID : 0488 Record Type : 02 Timestamp : 09/08/2022 20:16:41 Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 77 Event Type : Sensor-specific Discrete Event Direction : Deassertion Event Event Data : 00ffff Description : Power off/down