CHW-1873: AFF A900、ASA A900、またはFAS9500がリブート文字列やエラーメッセージなしで異常にリブートする
問題
- AFF A900、ASA A900、またはFAS9500システムがエラーメッセージなしで異常リブートを経験します。
- HAパートナーノードがテイクオーバーを開始し、以下のイベントが報告されます。
[Cluster-01: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.
[Cluster-01: cf_main: cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER
[Cluster-01: cf_takeover: ha.takeover.stateChng:debug]: params: {'old_state': 'NOT_IN_TAKEOVER', 'new_state': 'IN_CFO_TAKEOVER'}
[Cluster-01: cf_takeover: cf.fm.takeoverStarted:notice]: Failover monitor: takeover started
- BMCシステムイベントログは次の内容を報告します。
Record 2258: . [IPMI.notice]: 06b3 | 02 | EVT: ef03ffff | PCM_Status | Deassertion Event, "Power Good"
Record 2259: . [IPMI.notice]: 06b4 | 02 | EVT: 0150aeff | Bat_Curr | Assertion Event, "Lower Non-critical going low " | Reading: -4.100 | Threshold: -0.050
Record 2260: . [IPMI.notice]: 06b5 | 02 | EVT: 0152aefe | Bat_Curr | Assertion Event, "Lower Critical going low " | Reading: -4.100 | Threshold: -0.100
.
Record 2266: . [IPMI.notice]: 06bb | 02 | EVT: 6f02ffff | PCM_Status | Assertion Event, "Fault"
Record 2268: . [IPMI.notice]: 06bd | 02 | EVT: 815200fe | Bat_Curr | Deassertion Event, "Lower Critical going low " | Reading: 0.000 | Threshold: -0.100
Record 2269: . [IPMI.notice]: 06be | 02 | EVT: 815000ff | Bat_Curr | Deassertion Event, "Lower Non-critical going low " | Reading: 0.000 | Threshold: -0.050
.
Record 2275: . [BMC.critical]: Filer Reboots <<<<<< System reboot captured by BMC
- bmc status -dの出力は次のとおりです。
Sep 15 01:53:36 BMCxxxx root: eventfifod 47586.00981(n ): 171(0xc0ab) : CPU Catastrophic Error asserted
Sep 15 01:53:36 BMCxxxx root: eventfifod 47586.00981(o): 171(0x90ab) : CPU Catastrophic Error de-asserted
Sep 15 01:53:36 BMCxxxx root: eventfifod 47659.00887(n ): 17(0xc011) : PCH Platform reset asserted
Sep 15 01:53:36 BMCxxxx root: eventfifod 47659.00887(s): 22(0xe016) : LPC Bus reset asserted
Sep 15 01:53:36 BMCxxxx root: eventfifod 47659.00887(s): 23(0xe017) : TPM Reset asserted
Sep 15 01:53:37 BMCxxxx root: eventfifod 47659.00887(s): 24(0xe018) : NIC0 Reset asserted
Sep 15 01:53:37 BMCxxxx root: eventfifod 47659.00887(s): 25(0xe019) : NIC1 Reset asserted
Sep 15 01:53:37 BMCxxxx root: eventfifod 47659.00887(s): 27(0xe01b) : NVME reset asserted}}
- この問題は、いずれかのCPUで観測されたCatastrophic Error状態が原因です。