メインコンテンツまでスキップ

sp.heartbeat.stoppedは、サービスプロセッサのネットワークオーバーロードが原因です

Views:
23
Visibility:
Public
Votes:
0
Category:
fas-systems<a>翻訳用20099333</a><a />
Specialty:
hw
Last Updated:

環境

  • FASモデル
  • AFF モデル

問題

システムで次の現象が1つ以上発生する可能性があります。

AutoSupport アラートの例
HA Group Notification (Health Monitor process cphm: CriticalFruMultiFaultAlert[XXXXXXXXXXXX]) ALERT
HA Group Notification (SP HBT STOPPED) ALERT
HA Group Notification (CONTROLLER TAKEOVER COMPLETE HALT) NOTICE
コンソール出力の例
Initializing System Memory ...
Loading Device Drivers ...
Waiting for SP ...
SP failure. Resetting SP from primary FW. This can take a few minutes Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes Waiting for SP ...
Failed to recover SP
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed

IPMI:Read midplane FRU common header:timeout
Failed to recover SP
IPMI:Read midplane FRU common header:failed
Configuring Devices ...
IPMI PCI Slot Control failed.
BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT
LOADER-A>

IPMI:Read midplane FRU 0 product info:timeout
IPMI:Read midplane FRU 0 product info:failed
Waiting for SP ...

IPMI:Get midplane FRU 1 inventory:timeout
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
Failed to recover SP
IPMI:Get midplane FRU 1 inventory:failed

IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
Configuring Devices ...
IPMI PCI Slot Control failed.

Waiting for PIDS: /usr/sbin/ypbind 729.
Waiting for PIDS:  695.
Terminated
.
Uptime: 28d13h49m5s
System powering down...
System halting...
BIOS version: 9.3
Portions Copyright (c) 2011-2014 NetApp. All Rights Reserved
Waiting for SP ...
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
Failed to recover SP
IPMI PCI Slot Control failed.
IPMI:Get controller FRU inventory:failed
BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT

Initializing System Memory ...
Loading Device Drivers ...
Waiting for SP ...
IPMI:Enable PCI slots:timeout
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP recovered successfully after a reset from primary FW image
Waiting for SP ...

IPMI:Enable PCI slots:timeout
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
SP recovered successfully after a reset from backup FW image
Waiting for SP ...

IPMI:Enable PCI slots:timeout
Failed to recover SP
IPMI PCI Slot Control failed.

IPMI PCI Slot Configuration failed.

Configuring Devices ...
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
IPMI:Get NVRAM FRU inventory:failed

BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT
ONTAP コマンドラインの出力例
cluster1::> system service-processor show
IP           Firmware
Node          Type Status      Configured   Version   IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01  SP   rebooting   false        3.0.2     -
cluster1-02  SP   rebooting   false        3.0.2     -

cluster1::> system service-processor show
IP           Firmware
Node          Type Status      Configured   Version   IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01  SP   online      false        3.0.2     -
cluster1-02  SP   rebooting   false        3.0.2     -

cluster1::> system service-processor show
IP           Firmware
Node          Type Status      Configured   Version   IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01  SP   online      false        3.0.2     -
cluster1-02  SP   unknown     false        3.0.2     -

cluster1::> system service-processor show
IP           Firmware
Node          Type Status      Configured   Version   IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01  SP   online      false        3.0.2     -
cluster1-02  SP   degraded    true         3.0.2     0.0.0.0

cluster1::> system service-processor show
IP           Firmware
Node          Type Status      Configured   Version   IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01  SP   online      false        3.0.2     -
cluster1-02  SP   offline     true         3.0.2     0.0.0.0
ONTAP イベントログの例
Sat May 08 22:20:24 +0100 [node-1: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (Multiple fans failed)
Sat May 08 22:20:27 +0100 [node-1: mgwd: mgwd.notify.halt.result:info]: MGWD able to notify CLAM on its HA partner node that this node is undergoing a planned shutdown (reason: E). Error: -
Sat May 08 22:20:34 +0100 [node-1: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Status of fans is unknown for 90 seconds. Shutting down now.

Mon May 24 10:07:52 GMT [nvram.hw.initWarn:WARNING]: NVRAM hardware initialization: Failed to get Battery FRU info.

May 24 10:11:19 [node-1:sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.
May 24 10:13:19 [node-1:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the SP)

Feb 20 09:53:59 [cluster1-02:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (SP IPMI Dead)

sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.
SPログやBMCログの例
Record 718: Wed Dec 25 01:38:49.000000 2019 [SysFW.notice]: Waiting for SP ...
Record 719: Wed Dec 25 01:38:49.000000 2019 [SysFW.notice]: IPMI:Read midplane FRU common header:device busy. Retrying
Record 720: Sun Jan 01 00:00:33.660000 2017 [BMC.notice]: Running primary version 11.4

Record 807: Thu Jan 01 00:00:36.931067 1970 [Agent.notice]: 000.267: 152 : Midplane I2C Local Buffers Not Ready Internal MLER[6] de-asserted
Record 797: Mon Oct 17 08:52:11.001689 2016 [Agent.notice]: 919.800: 148 : Midplane Local Grant Timeout Internal MLER[2] asserted

Record 1287: Tue Apr 14 14:34:05.000000 2020 [SysFW.notice]: IPMI:Read midplane FRU common header:timeout - retrying
Record 1288: Tue Apr 14 14:34:10.000000 2020 [SysFW.notice]: IPMI:Read midplane FRU common header:timeout
Record 1289: Tue Apr 14 14:34:13.000000 2020 [SysFW.notice]: Failed to recover SP
Record 1290: Tue Apr 14 14:34:13.000000 2020 [SysFW.critical]: IPMI:Read midplane FRU common header:failed
Record 1291: Sun Jan 01 00:02:58.340000 2017 [Trap Event.critical]: hwassist post_error (26)
Record 1292: Tue Apr 14 14:34:14.000000 2020 [SysFW.critical]: IPMI PCI Slot Control failed.
Record 1293: Sun Jan 01 00:02:59.310000 2017 [Trap Event.critical]: hwassist post_error (26)
Record 1296: Tue Apr 14 14:34:20.000000 2020 [Boot Loader.critical]: Abort Autoboot due to BIOS POST failure.
Record 1297: Tue Apr 14 14:34:20.280000 2020 [Trap Event.critical]: hwassist post_error (26)

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

Scan to view the article on your device

 

  • この記事は役に立ちましたか?