sp.heartbeat.stoppedは、サービスプロセッサのネットワークオーバーロードが原因です
- Views:
- 82
- Visibility:
- Public
- Votes:
- 0
- Category:
- fas-systems<a>翻訳用20099333</a><a />
- Specialty:
- hw
- Last Updated:
環境
- FASモデル
- AFF モデル
問題
システムで次の現象が1つ以上発生する可能性があります。
- AutoSupport アラートの例
HA Group Notification (Health Monitor process cphm: CriticalFruMultiFaultAlert[XXXXXXXXXXXX]) ALERT
HA Group Notification (SP HBT STOPPED) ALERT
HA Group Notification (CONTROLLER TAKEOVER COMPLETE HALT) NOTICE- コンソール出力の例
Initializing System Memory ...
Loading Device Drivers ...
Waiting for SP ...
SP failure. Resetting SP from primary FW. This can take a few minutes Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes Waiting for SP ...
Failed to recover SP
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
IPMI:Read midplane FRU common header:timeout
Failed to recover SP
IPMI:Read midplane FRU common header:failed
Configuring Devices ...
IPMI PCI Slot Control failed.
BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT
LOADER-A>
IPMI:Read midplane FRU 0 product info:timeout
IPMI:Read midplane FRU 0 product info:failed
Waiting for SP ...
IPMI:Get midplane FRU 1 inventory:timeout
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
Failed to recover SP
IPMI:Get midplane FRU 1 inventory:failed
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
Configuring Devices ...
IPMI PCI Slot Control failed.
Waiting for PIDS: /usr/sbin/ypbind 729.
Waiting for PIDS: 695.
Terminated
.
Uptime: 28d13h49m5s
System powering down...
System halting...
BIOS version: 9.3
Portions Copyright (c) 2011-2014 NetApp. All Rights Reserved
Waiting for SP ...
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
Failed to recover SP
IPMI PCI Slot Control failed.
IPMI:Get controller FRU inventory:failed
BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT
Initializing System Memory ...
Loading Device Drivers ...
Waiting for SP ...
IPMI:Enable PCI slots:timeout
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP recovered successfully after a reset from primary FW image
Waiting for SP ...
IPMI:Enable PCI slots:timeout
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
SP recovered successfully after a reset from backup FW image
Waiting for SP ...
IPMI:Enable PCI slots:timeout
Failed to recover SP
IPMI PCI Slot Control failed.
IPMI PCI Slot Configuration failed.
Configuring Devices ...
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
IPMI:Get NVRAM FRU inventory:failed
BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT- ONTAP コマンドラインの出力例
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP rebooting false 3.0.2 -
cluster1-02 SP rebooting false 3.0.2 -
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP online false 3.0.2 -
cluster1-02 SP rebooting false 3.0.2 -
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP online false 3.0.2 -
cluster1-02 SP unknown false 3.0.2 -
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP online false 3.0.2 -
cluster1-02 SP degraded true 3.0.2 0.0.0.0
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP online false 3.0.2 -
cluster1-02 SP offline true 3.0.2 0.0.0.0- ONTAP イベントログの例
Sat May 08 22:20:24 +0100 [node-1: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (Multiple fans failed)
Sat May 08 22:20:27 +0100 [node-1: mgwd: mgwd.notify.halt.result:info]: MGWD able to notify CLAM on its HA partner node that this node is undergoing a planned shutdown (reason: E). Error: -
Sat May 08 22:20:34 +0100 [node-1: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Status of fans is unknown for 90 seconds. Shutting down now.
Mon May 24 10:07:52 GMT [nvram.hw.initWarn:WARNING]: NVRAM hardware initialization: Failed to get Battery FRU info.
May 24 10:11:19 [node-1:sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.
May 24 10:13:19 [node-1:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the SP)
Feb 20 09:53:59 [cluster1-02:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (SP IPMI Dead)
sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.- SPログやBMCログの例
Record 718: Wed Dec 25 01:38:49.000000 2019 [SysFW.notice]: Waiting for SP ...
Record 719: Wed Dec 25 01:38:49.000000 2019 [SysFW.notice]: IPMI:Read midplane FRU common header:device busy. Retrying
Record 720: Sun Jan 01 00:00:33.660000 2017 [BMC.notice]: Running primary version 11.4
Record 807: Thu Jan 01 00:00:36.931067 1970 [Agent.notice]: 000.267: 152 : Midplane I2C Local Buffers Not Ready Internal MLER[6] de-asserted
Record 797: Mon Oct 17 08:52:11.001689 2016 [Agent.notice]: 919.800: 148 : Midplane Local Grant Timeout Internal MLER[2] asserted
Record 1287: Tue Apr 14 14:34:05.000000 2020 [SysFW.notice]: IPMI:Read midplane FRU common header:timeout - retrying
Record 1288: Tue Apr 14 14:34:10.000000 2020 [SysFW.notice]: IPMI:Read midplane FRU common header:timeout
Record 1289: Tue Apr 14 14:34:13.000000 2020 [SysFW.notice]: Failed to recover SP
Record 1290: Tue Apr 14 14:34:13.000000 2020 [SysFW.critical]: IPMI:Read midplane FRU common header:failed
Record 1291: Sun Jan 01 00:02:58.340000 2017 [Trap Event.critical]: hwassist post_error (26)
Record 1292: Tue Apr 14 14:34:14.000000 2020 [SysFW.critical]: IPMI PCI Slot Control failed.
Record 1293: Sun Jan 01 00:02:59.310000 2017 [Trap Event.critical]: hwassist post_error (26)
Record 1296: Tue Apr 14 14:34:20.000000 2020 [Boot Loader.critical]: Abort Autoboot due to BIOS POST failure.
Record 1297: Tue Apr 14 14:34:20.280000 2020 [Trap Event.critical]: hwassist post_error (26)