ディスクシェルフの電源障害が原因でHAペアパニックイベントが発生してリブート
環境
- ONTAP 9
- NS224
問題
- ディスクへのアクセスが失われたためにHAペアの両方のノードがリブートします。
Sat Nov 25 10:17:05 +0000 [netapp01-01: fmmbx_instanceWorker: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).
Sat Nov 25 10:17:06 +0000 [netapp01-02: fmmbx_instanceWorker: sk.panic:alert]: Panic String: Permanent errors on all HA mailbox disks (while marshalling header) in SK process fmmbx_instanceWorker on release 9.11.1P8 (C)
- シェルフに接続されているストレージポートのリンク停止アラート:
Sat Nov 25 10:15:39 +0000 [netapp01-01: kernel: netif.linkDown:info]: Ethernet e10b: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-01: intr: netif.linkDown:info]: Ethernet e10b-30: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-01: kernel: netif.linkDown:info]: Ethernet e2a: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-01: intr: netif.linkDown:info]: Ethernet e2a-30: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-02: kernel: netif.linkDown:info]: Ethernet e2a: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-02: intr: netif.linkDown:info]: Ethernet e2a-30: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-02: kernel: netif.linkDown:info]: Ethernet e10b: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-02: intr: netif.linkDown:info]: Ethernet e10b-30: Link down, check cable.
- AutoSupportのEMSログにシェルフの電源障害が記録されないことがある
- シェルフログは、電源マネージャからのPSUアラートを報告します。
Sat Nov 25 10:14:42 2023 ( 148+23:59:17.135); 030B0060; S1; ENC_MGT; power_manager; 04; PCM 2 local fan power restored
Sat Nov 25 10:14:42 2023 ( 148+23:59:17.135); 030B0084; S1; ENC_MGT; power_manager; 02; Clearing PSU AC Missing (non-redundant) alarm
Sat Nov 25 10:14:43 2023 ( 148+23:59:18.126); 030B005C; S1; ENC_MGT; power_manager; 04; PCM 2 fault cleared, assume power restored (1600W)
Sat Nov 25 10:14:43 2023 ( 148+23:59:18.126); 030B0078; S1; ENC_MGT; power_manager; 02; Clearing PSU Fail (non-redundant) alarm
Sat Nov 25 10:14:51 2023 ( 148+23:59:26.123); 030B006F; S1; ENC_MGT; power_manager; 02; PCM 1 DC FAILURE Fault Detected
Sat Nov 25 10:14:51 2023 ( 148+23:59:26.123); 030B0072; S1; ENC_MGT; power_manager; 02; Setting FAIL MIN REDUNDANT alarm for PCM 1
Sat Nov 25 10:14:51 2023 ( 148+23:59:26.123); 030B005B; S1; ENC_MGT; power_manager; 04; PCM 1 faults indicate loss of power (1600W)
Sat Nov 25 10:14:52 2023 ( 148+23:59:27.124); 030B005C; S1; ENC_MGT; power_manager; 04; PCM 1 fault cleared, assume power restored (1600W)
Sat Nov 25 10:14:52 2023 ( 148+23:59:27.124); 030B0076; S1; ENC_MGT; power_manager; 02; Clearing PSU Fail (min-redundant) alarm
Sat Nov 25 10:14:55 2023 ( 148+23:59:30.135); 030B006F; S1; ENC_MGT; power_manager; 02; PCM 2 PCM FAILURE Fault Detected
Sat Nov 25 10:14:55 2023 ( 148+23:59:30.135); 030B0072; S1; ENC_MGT; power_manager; 02; Setting FAIL MIN REDUNDANT alarm for PCM 2
Sat Nov 25 10:14:55 2023 ( 148+23:59:30.135); 030B006F; S1; ENC_MGT; power_manager; 02; PCM 2 TURNED OFF Fault Detected