シャーシファンFRU障害: クラスタ間LIFからのトラフィックが多いため、複数のファンが故障しました

最後の更新
PDFとして保存

Views:: 31

Visibility:: Public

Votes:: 0

Category:: ontap-9

Specialty:: HW

Last Updated:

環境

ONTAP 9
FAS 26xxシリーズ
FAS 27xxシリーズ

問題

HA ペアの 1 つのノードが温度センサーの読み取りに関する問題を報告しますが、これは自動的に修正されます。

[node1: env_mgr: monitor.fan.warning:notice]: multiple fans have failed. Replace it to avoid overheating [node1: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module B Expander Temp) is not readable. [node1: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module A Expander Temp) is not readable. [node1: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 4 Temp) is not readable. [node1: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 3 Temp) is not readable. [node1: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 2 Temp) is not readable. [node1: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 1 Temp) is not readable. [node1: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Ambient Temp) is not readable. [node1: env_mgr: callhome.c.fan.fru.fault:error]: Call home for CHASSIS FAN FRU FAILED: Multiple fans have failed [node1: env_mgr: monitor.fan.ok:notice]: All fans are OK. [node1: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok. [node1: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.

影響を受けるノードでは特定の温度センサーが使用できなくなりますが、パートナーノードでは正常であると報告されます。

cluster1::> system node run -node * -command environment status 2 entries were acted on.

Node: node1

Shelf temperatures by element: [10] 28 C (82 F) Normal temperature range [11] Unavailable [12] Unavailable [13] 29 C (84 F) Normal temperature range

Node: node2

Shelf temperatures by element: [10] 33 C (91 F) Normal temperature range [11] 48 C (118 F) Normal temperature range [12] 26 C (78 F) Normal temperature range [13] 34 C (93 F) Normal temperature range

問題がアクティブな場合、FRU ステータスは低下として表示されますが、これは 1 つのノードからのみ発生します。

cluster1::> system controller fru show Node FRU Name Subsystem Status ------------------ ---------------------------- ------------------ ----------- node1 PSU1 FRU Environment degraded node1 PSU2 FRU Environment degraded node2 PSU1 FRU Environment ok node2 PSU2 FRU Environment ok

影響を受けるノードの BMC は頻繁に再起動し、SP イベントログに次のメッセージが報告されます。

Record 1250: Sat Jan 28 10:04:27.990616 2023 [BMC.critical]: Rebooting SP due to loss of ACP comms Record 1251: Sun Jan 01 00:00:23.962645 2017 [IPMI.notice]: 00fd | c0 | OEM: ffff70005100 | ManufId: 150300 | BMC Reset Internally

発行時のポート e0M のパケットトレース分析により、クラスタ間論理インターフェイス (IC LIF) からの着信トラフィックが多いことがわかります。