トランシーバ問題が原因で「ディスクの冗長性に失敗しました」

最後の更新
PDFとして保存

Views:: 28

Visibility:: Internal

Votes:: 0

Category:: metrocluster

Specialty:: metrocluster

Last Updated:

環境

MetroCluster IP
Ciscoバックエンドスイッチ

問題

次のASUPアラートがトリガーされる

HA Group Notification (DISK REDUNDANCY FAILED) ERROR

2.エラーメッセージ：

Tue Sep 03 04:51:31 +0200 [ClusterA-02: wafl_exempt09: mirror.stream.qp.error:debug]: params: {'mirror': 'DR PARTNER', 'qp_name': 'WAFL', 'error': 'NVMM_ERR_MIRROR_POLL_TIMEOUT'}Tue Sep 03 04:51:31 +0200 [ClusterA-02: wafl_exempt09: nvmm.mirror.aborting:debug]: mirror of sysid 2, partner_type DR PARTNER and mirror state NVMM_MIRROR_ONLINE is aborted because of reason NVMM_ERR_MIRROR_POLL_TIMEOUT. Tue Sep 03 04:51:31 +0200 [ClusterA-02: nvmm_error: mirror.stream.qp.error:debug]: params: {'mirror': 'DR PARTNER', 'qp_name': 'WAFL', 'error': 'NVMM_ERR_MIRROR_COMPLETION'} Tue Sep 03 04:51:31 +0200 [ClusterA-02: nvmm_error: ems.engine.suppressed:debug]: Event 'rdma.rlib.event.error' suppressed 11 times in last 263 seconds. Tue Sep 03 04:51:31 +0200 [ClusterA-02: nvmm_error: rdma.rlib.event.error:debug]: QP wafl event error: client disconnect. Tue Sep 03 04:51:31 +0200 [ClusterA-02: nvmm_error: nvmm.mirror.offlined:debug]: params: {'mirror': 'DR_PARTNER'} Tue Sep 03 04:51:31 +0200 [ClusterA-02: DR_heartbeat_thread: cf.ic.xferTimedOut:error]: HA interconnect: MCC_DRSOM transfer timed out.

その後、次のような再試行が成功します。

Tue Sep 03 04:51:32 +0200 [ClusterA-02: iw_cm_wq: rdma.rlib.connected:debug]: wafl:DR:A QP is now connected.

3.多数の異なるディスク（すべてリモートディスク）への適用が成功した再試行と混在したエラーメッセージ：

Tue Sep 03 04:51:34 +0200 [ClusterA-02: doneq0: scsi.mcc.adt.ioTransportError:error]: mcc_adt[2] - Transport error during execution of command: HA status 0x13: CAM transport status 0x1b: cdb 0x28:356b73b3:000d. Tue Sep 03 04:51:34 +0200 [ClusterA-02: doneq0: scsi.mcc.adt.ioTransportError:error]: mcc_adt[2] - Transport error during execution of command: HA status 0x13: CAM transport status 0x1b : cdb 0x28:356b6555:000d....Tue Sep 03 04:51:34 +0200 [ClusterA-02: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0m.i1.2L17: Command aborted by host adapter: HA status 0x13: cdb 0x28:356b73b3:000d. Tue Sep 03 04:51:34 +0200 [ClusterA-02: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0m.i1.2L17: Command aborted by host adapter: HA status 0x13: cdb 0x28:356b6555:000d. Tue Sep 03 04:51:34 +0200 [ClusterA-02: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0v.i1.0L17: request successful after retry #1/#0: cdb 0x28:356b73b3:000d (1967)

4.スイッチのログを確認すると、ポートが「Transceiver Absent」でフラッピングしていることがわかります。

2025 Jan 30 11:54:46 mbh-metro-sw-01 %ETHPORT-5-IF_DOWN_ADMIN_DOWN: Interface Ethernet1/21/4 is down (Administratively down) 2025 Jan 30 11:58:33 mbh-metro-sw-01 %ETHPORT-5-IF_HARDWARE: Interface Ethernet1/21/1, hardware type changed to No-Transceiver 2025 Jan 30 11:58:33 mbh-metro-sw-01 %ETHPORT-5-IF_HARDWARE: Interface Ethernet1/21/2, hardware type changed to No-Transceiver 2025 Jan 30 11:58:33 mbh-metro-sw-01 %ETHPORT-5-IF_HARDWARE: Interface Ethernet1/21/3, hardware type changed to No-Transceiver 2025 Jan 30 11:58:33 mbh-metro-sw-01 %ETHPORT-5-IF_HARDWARE: Interface Ethernet1/21/4, hardware type changed to No-Transceiver 2025 Jan 30 11:58:33 mbh-metro-sw-01 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/21/2 is down (Transceiver Absent) 2025 Jan 30 11:58:33 mbh-metro-sw-01 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/21/3 is down (Transceiver Absent) 2025 Jan 30 11:58:33 mbh-metro-sw-01 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/21/4 is down (Transceiver Absent) 2025 Jan 30 11:58:37 mbh-metro-sw-01 %ETHPORT-5-IF_HARDWARE: Interface Ethernet1/21/1, hardware type changed to auto 2025 Jan 30 11:58:37 mbh-metro-sw-01 %ETHPORT-3-IF_UNSUPPORTED_TRANSCEIVER: Transceiver on interface Ethernet1/21/1 is not supported 2025 Jan 30 11:58:37 mbh-metro-sw-01 %ETHPORT-5-IF_HARDWARE: Interface Ethernet1/21/2, hardware type changed to auto