L2 watchdogリセットが原因で、AFF A400、FAS8300、FAS8700がリブートします
環境
- ONTAP 9
- AFF A400
- FAS 8300
- FAS 8700
問題
- L2 watchdogリセットが原因でノードが予期せずリブートしました。
- サバイバーパートナーノードからのONTAPイベントメッセージ(EMS)エラー:
NOTICE cf.hwassist.takeoverTrapRecv: hw_assist: Received takeover hw_assist alert from partner(node-01), system_down because reset_via_sp.
NOTICE cf.hwassist.takeoverTrapRecv: hw_assist: Received takeover hw_assist alert from partner(node-01), system_down because l2_watchdog_reset.
または
[node-1: cf_hwassist: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(node_name-2), system_down because power_off_via_sp.
- 影響を受けるノードからのONTAPパニックメッセージ:
[node-2: send_boot_msg_thread: mgr.stack.string:notice]: Panic string: watchdog nmi on cpu 8, hang cpu is 1 in process idle: cpu8 on release...
- BMCログでNMIエラーが報告される:
BMC> system log sel
df | 11/06/2021 | 01:58:24 | System Event #0xff | Timestamp Clock Sync | Asserted
e0 | 11/06/2021 | 02:12:53 | Watchdog 2 #0xb1 | Timer interrupt (NMI/SMS/OS) | Asserted
e1 | 11/06/2021 | 02:12:53 | Critical Interrupt #0xb0 | NMI/Diag Interrupt | Asserted
e2 | 11/06/2021 | 02:12:56 | Watchdog 2 #0xb1 | Hard reset (NMI/SMS/OS) | Asserted
e3 | 11/06/2021 | 02:12:56 | Power Unit #0xb2 | Power reset | Asserted | from channel 15