環境上の理由によりシャットダウンし、SP が応答しなくなりました
環境
- AFF A300
- サービスプロセッサ (SP) ファームウェア 5.11P2
問題
その
PSU1
のシャーシ内Node1
重大なエラーが発生しましたが、しばらくして回復しました。
EMSログ:
[?] Fri May 16 12:42:00 +0000 [Node1: monitor: monitor.globalStatus.critical:EMERGENCY]: Power Supply Status Critical: PSU1.
[?] Fri May 16 12:42:50 +0000 [Node1: spsm_listener: sp.heartbeat.stopped:error]: Have not received a IPMI heartbeat from the Service Processor (SP) in last 20 seconds.
[?] Fri May 16 12:43:14 +0000 [Node1: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'adapterName': '1', 'debug_string': 'Adapter debug dump is being collected'}
[?] Fri May 16 12:43:14 +0000 [Node1: pmcsas_asyncd_1: sas.adapter.debug:info]: params: {'adapterName': '0a', 'debug_string': 'Adapter debug dump is being collected'}
[?] Fri May 16 12:45:02 +0000 [Node1: spsm_listener: sp.heartbeat.resumed:info]: Received IPMI heartbeat from the Service Processor (SP).
[?] Fri May 16 12:46:11 +0000 [Node1: power_low_monitor: monitor.chassisPowerSupplies.ok:info]: Chassis power supplies OK.
[?] Fri May 16 12:47:00 +0000 [Node1: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
[?] Fri May 16 12:55:11 +0000 [Node1: env_mgr: monitor.chassisPowerSupply.degraded:notice]: Chassis power supply 1 is degraded: PSU1 Fan2 Fault is Unreadable
[?] Fri May 16 12:55:21 +0000 [Node1: power_low_monitor: monitor.chassisPower.degraded:alert]: Chassis power is degraded: Power Supply Status Critical: PSU1.
[?] Fri May 16 12:55:21 +0000 [Node1: power_low_monitor: callhome.chassis.power:error]: Call home for CHASSIS POWER DEGRADED: Power Supply Status Critical: PSU1.
[?] Fri May 16 12:56:33 +0000 [Node1: env_mgr: monitor.chassisPowerSupply.ok:info]: Chassis power supply 1 is OK.
[?] Fri May 16 12:56:41 +0000 [Node1: power_low_monitor: monitor.chassisPowerSupplies.ok:info]: Chassis power supplies OK.
[?] Fri May 16 12:57:00 +0000 [Node1: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
[?] Fri May 16 12:57:33 +0000 [Node1: env_mgr: callhome.chassis.ps.ok:notice]: Call home for CHASSIS POWER SUPPLY OK: PS 1
- しばらくして、
Node1
環境上の理由により緊急停止が発生しました。
SP システムログ:
May 16 13:23:00 [Node1:sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.
May 16 13:25:00 [Node1:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the SP)
- SPが応答せず、読み取れない
system sensors
:
SP Node1> system sensors
Sensor Name | Current | Unit | Status | LCR | LNC | UNC | UCR
-----------------+------------+------------+------------+-----------+-----------+-----------+-----------
Error: Unable to establish LAN session
Get Device ID command failed
Unable to open SDR for reading
- 複数のインスタンス
SP load is high
観察されるevents all
。
Record 339: Thu Jan 1 00:01:01 1970 [SP.notice]: Running primary version 5.11P2
Record 340: Thu Jan 1 00:01:17 1970 [SP.normal]: Heartbeat started
Record 341: Thu Jan 1 00:01:17 1970 [Heartbeat.notice]: Heartbeat start: Set SP time. Old time: Thu
Jan 1 00:01:17 1970. New time: Fri May 16 13:22:23 2025.
Record 342: Fri May 16 13:22:23 2025 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old t
ime: Thu Jan 1 00:01:17 1970. New time: Fri May 16 13:22:23 2025.
Record 343: Fri May 16 13:23:19 2025 [SP.notice]: IPMI not ready & run /usr/local/bin/notify 4
Record 344: Fri May 16 13:25:40 2025 [ONTAP.notice]: Appliance user command reboot.
Record 345: Fri May 16 13:25:50 2025 [SP.critical]: Filer Reboots
Record 346: Fri May 16 13:25:55 2025 [SysFW.notice]: Waiting for SP ...
Record 347: Fri May 16 13:28:17 2025 [SP.notice]: Switch is running on latest version 16
Record 348: Fri May 16 13:31:16 2025 [IPMI.warning]: FRUID 1 Access error
Record 349: Fri May 16 13:31:42 2025 [SP.notice]: Failure on battery wake up attempt
Record 350: Fri May 16 13:36:09 2025 [SP.notice]: SP load is high: 3.12 3.06 2.02
Record 351: Fri May 16 13:36:29 2025 [SP.critical]: Heartbeat stopped
Record 352: Fri May 16 13:41:57 2025 [IPMI.warning]: FRUID 2 Access error
Record 353: Fri May 16 13:54:10 2025 [SP.notice]: SP load is high: 3.03 3.11 2.79
Record 354: Fri May 16 13:55:18 2025 [IPMI.warning]: FRUID 3 Access error
Record 355: Fri May 16 14:04:30 2025 [IPMI.warning]: FRUID 4 Access error
Record 356: Fri May 16 14:11:11 2025 [IPMI.warning]: FRUID 5 Access error
Record 357: Fri May 16 14:13:11 2025 [IPMI.warning]: PSU FRUID 6 Access error, retry 5 times
Record 358: Fri May 16 14:15:12 2025 [IPMI.warning]: PSU FRUID 7 Access error, retry 5 times
Record 359: Fri May 16 14:15:19 2025 [IPMI.notice]: IPMI session creation failed - err(0x0021)
8400 | 02 | EVT: 0300ffff | Sensor 61 | Assertion Event, "State Deasserted"
Record 360: Fri May 16 14:15:19 2025 [IPMI.notice]: IPMI session creation failed - err(0x0021)
8500 | 02 | EVT: 6fc203ff | Sensor 109 | Assertion Event, "Memory Init Done"
Record 361: Fri May 16 14:15:19 2025 [IPMI.notice]: IPMI session creation failed - err(0x0021)
8600 | 02 | EVT: 0901ffff | Sensor 183 | Assertion Event, "Device Enabled"
Record 362: Fri May 16 14:25:10 2025 [SP.notice]: SP load is high: 3.14 2.96 2.72
Record 363: Mon May 19 09:00:45 2025 [SP CLI.notice]: cs_admi "log in from 192.168.180.10"