メインコンテンツまでスキップ

NVDIMMの修正不能なエラーが原因でH610Sノードがオフラインになり、ブートループが発生している

Views:
Visibility:
Public
Votes:
0
Category:
element-software
Specialty:
solidfire
Last Updated:

環境

  • NetApp SolidFire H610S(BIOS 3B06搭載)
  • NetApp Elementソフトウェア12.3.X以前

問題

  • 複数のノードまたは単一のノード がオフラインでブートループにある
    • ノードはブートを試行するが、 Elementのロード前に失敗する
    •  NetAppスプラッシュ画面の直後に再起動が発生する 
  • BMCシステムイベントログ(SEL)には次の情報が表示されます。
    • [CATERR] Machine Check Exception (MCERR) 
    • [MCERR] Uncorrectable Error - Machine Check Error
    • [Memory Error] Uncorrectable ECC(CPU0_<xx>)
  • ボリュームのオフラインまたはデグレードに関するメッセージが表示される可能性 

:複数のノードが影響を受けた場合のActive IQエラーアラート 

The following volumes are offline. [X, X, X, X, X, X]

The SolidFire Application cannot communicate with Storage node having node ID 11.

Cluster Block Data is in a degraded state, and the auto-heal process to restore full block data redundancy cannot proceed. Either too many nodes or block services are offline, or the cluster block services are too full.

:BMC Web GUIからのSEL

 1160 Sep/8/2022 20:16:41 [Information] [Power Unit] [Power Unit] Power Off / Power Down - Deasserted 1159 Sep/8/2022 20:16:36 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted 1158 Sep/8/2022 20:16:36 [Information] [Power Unit] [Power Unit] Power Off / Power Down - Asserted 1157 Sep/8/2022 20:16:35 [Warning] [Additional MCE Error] [OEM Record C2] ManufacturerID:001C4C, Extra Information : 0 MSCOD:0010 MCACOD:0134 1156 Sep/8/2022 20:16:35 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted 1155 Sep/8/2022 20:16:35 [Critical] [MCERR] [Processor] Uncorrectable Error - Machine Check Error: Bank 1/CPU 0/Core 2 - Asserted 1154 Sep/8/2022 20:16:35 [Critical] [Memory Error] [Memory] Uncorrectable ECC(CPU0_F1) - Asserted 

:NVDIMMは、H610Sモデルの特定のスロットに搭載されています。  H610S1 / S2-CPU0_C0およびCPU0_F0、  H610S4-CPU0_C1およびCPU0_F1

:  ipmitoolの出力からのSEL

SEL Record ID : 0482 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0001 EvM Revision : 04 Sensor Type : Memory Sensor Number : 87 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : a1ff29 Description : Uncorrectable ECC SEL Record ID : 0483 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0001 EvM Revision : 04 Sensor Type : Processor Sensor Number : a8 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : ab0102 Description : Uncorrectable machine check exception SEL Record ID : 0484 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0020 EvM Revision : 04 Sensor Type : Processor Sensor Number : 74 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 0bffff Description : Uncorrectable machine check exception SEL Record ID : 0485 Record Type : c2 (OEM timestamped) Timestamp : 09/08/2022 20:16:35 Manufactacturer ID : 001c4c OEM Defined : 000010003401 [......] SEL Record ID : 0486 Record Type : 02 Timestamp : 09/08/2022 20:16:36 Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 77 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 00ffff Description : Power off/down SEL Record ID : 0487 Record Type : 02 Timestamp : 09/08/2022 20:16:36 Generator ID : 0020 EvM Revision : 04 Sensor Type : Processor Sensor Number : 74 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 0bffff Description : Uncorrectable machine check exception SEL Record ID : 0488 Record Type : 02 Timestamp : 09/08/2022 20:16:41 Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 77 Event Type : Sensor-specific Discrete Event Direction : Deassertion Event Event Data : 00ffff Description : Power off/down

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.