BMC は頻繁にリブートし、複数のセンサーエラーが発生します
環境
- FAS2750
- FAS2720
- AFF A220
- FAS2650
- FAS2620
- BMC ファームウェア11.6
- IOM12E fw2.20 以下
問題
- EMS エラーアラート:
Sun May 09 13:29:30 CEST [node_name: env_mgr: callhome.c.fan.fru.fault:error]: Call home for CHASSIS FAN FRU FAILED: Multiple fans have failed
- BMC のイベントメッセージ:
Record 1746: Sun May 09 11:42:16.460000 2021 [BMC.critical]: Rebooting SP due to loss of ACP comms
Record 1747: Sun May 09 11:42:17.570000 2021 [ASUP.notice]: First notification email | (INVALID CHASSIS CONFIGURATION (Incompatible Partner PCM)) CRITICAL | Send failed
failed
記録 1748 : Sun Jan 01 00 : 00 : 22.270000 2017 [ IPMI.notice]: 0019 | c0 | OEM : ffff70005100 | ManualfId: 150300 | BMC Reset internally
- さまざまなコンポーネントについて複数の EMS エラーが報告されますが、数秒後に「修正」されたエラーもあります。例:
Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 4 Temp) is not readable.
Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 1 Temp) is not readable.
Sun May 09 12:27:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Chassis temperature is too high..
Sun May 09 12:27:10 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 12:28:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 11: not installed or failed. Current temperature: 41 C (105 F). This module is on the rear of the shelf at the top left, on shelf module A.
Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 12: not installed or failed. Current temperature: 24 C (75 F). This module is on the rear of the shelf at the top left, on shelf module A.
Sun May 09 13:29:00 CEST [node_name: env_mgr: monitor.fan.warning:notice]: multiple fans have failed. Replace it to avoid overheating
Sun May 09 13:30:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Multiple fans has failed. Chassis temperature is too high..
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 11: normal status.
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 12: normal status.
Sun May 09 13:33:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU1 is not readable.
Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU2 is not readable.
Sun May 09 14:00:00 CEST [node_name: statd: monitor.fan.failed:alert]: Multiple fans has failed.
Sun May 09 14:01:55 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU1 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU2 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fan.ok:notice]: All fans are OK.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 14:02:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module B Expander Temp) is not readable.
Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module A Expander Temp) is not readable.
- 複数のファンの障害が原因でノードがパニック状態になる可能性があります