AFF-A400:予期せぬテイクオーバーが繰り返される
環境
- AFF-A400
- ONTAP 9
- スイッチレスクラスタ
問題
- AFF-A400 HAペアで、原因不明のコントローラ テイクオーバーが繰り返し発生
- クラスタ インターコネクトポート(e3a/e3b)とHAインターコネクトポート(e0a/e0b)で同時に発生したリンクダウン イベントが常にトリガーされていたため、ノード間のハートビートが失われて自動フェイルオーバーが発生していました
例
[?] Sat May 24 09:55:24 [node-01: kernel: netif.linkDown:info]: Ethernet e0a: Link down, check cable.
[?] Sat May 24 09:55:24 [node-01: intr: rlib.ifconfig.linkEvent:notice]: params: {'ifname': 'e0a', 'eventType': 'DOWN'}
[?] Sat May 24 09:55:24 [node-01: mcc_cfd_rnic: mirror.stream.qp.error:debug]: params: {'mirror': 'HA Partner', 'qp_name': 'RAID', 'error': 'NVMM_ERR_POLL'}
[?] Sat May 24 09:55:24 [node-01: mcc_cfd_rnic: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_ONLINE is aborted because of reason NVMM_ERR_POLL.
[?] Sat May 24 09:55:24 [node-01: mcc_cfd_rnic: mirror.stream.qp.error:debug]: params: {'mirror': 'HA Partner', 'qp_name': 'MISC', 'error': 'NVMM_ERR_POLL'}
[?] Sat May 24 09:55:24 [node-01: kernel: netif.linkDown:info]: Ethernet e0b: Link down, check cable.
[?] Sat May 24 09:55:24 [node-01: intr: rlib.ifconfig.linkEvent:notice]: params: {'ifname': 'e0b', 'eventType': 'DOWN'}
[?] Sat May 24 09:55:24 [node-01: nvmm_error: ems.engine.suppressed:debug]: Event 'rdma.rlib.event.error' suppressed 5 times in last 89403 seconds.
[?] Sat May 24 09:55:24 [node-01: nvmm_error: rdma.rlib.event.error:debug]: QP wafl event error: client disconnect.
[?] Sat May 24 09:55:24 [node-01: nvmm_error: nvmm.mirror.offlined:debug]: params: {'mirror': 'HA_PARTNER'}
[?] Sat May 24 09:55:24 [node-01: rastrace_dump: rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20250524_09:55:24:830194.dmp.
[?] Sat May 24 09:55:25 [node-01: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node-01 by node-02 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).
[?] Sat May 24 09:55:25 [node-01: cf_firmware: cf.fm.partnerFwTransition:info]: params: {'prevstate': 'SF_UP', 'newstate': 'SF_UNKNOWN', 'progresscounter': '0'}
[?] Sat May 24 09:55:26 [node-01: kernel: netif.linkDown:info]: Ethernet e3a: Link down, check cable.
[?] Sat May 24 09:55:26 [node-01: vifmgr: vifmgr.portdown:notice]: A link down event was received on node node-01, port e3a.
[?] Sat May 24 09:55:26 [node-01: vifmgr: vifmgr.clus.linkdown:EMERGENCY]: The cluster port e3a on node node-01 has gone down unexpectedly.
[?] Sat May 24 09:55:26 [node-01: vifmgr: vifmgr.lifmoved.linkdown:notice]: LIF node-01_clus1 (on virtual server xxxxxxxxxx), IP address 169.254.125.245, is being moved to node node-01, port e3b.
[?] Sat May 24 09:55:26 [node-01: kernel: netif.linkDown:info]: Ethernet e3b: Link down, check cable.
[?] Sat May 24 09:55:27 [node-01: nvmm_mirror_sync: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_LAYOUT_SYNCING is aborted because of reason NVPM_ERR_MSG_SEND_FAILED.
[?] Sat May 24 09:55:27 [node-01: nvmm_error: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_OFFLINE is aborted because of reason NVMM_ABORT_SYNCING_MIRROR.
[?] Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=0b063568234d2002) with Tag CTRAN: Connection from local lif 1024 to remote IP 169.254.128.145 is closed.
[?] Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=0b063568234d2002) with Tag CTRAN: Connection from local lif 1023 to remote IP 169.254.210.21 is closed.
[?] Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=10063568233579a4) with Tag CTRAN_SAN: Connection from local lif 1024 to remote IP 169.254.128.145 is closed.
[?] Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=10063568233579a4) with Tag CTRAN_SAN: Connection from local lif 1023 to remote IP 169.254.210.21 is closed.
[?] Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=0f063568233579a4) with Tag CTRAN_BULK: Connection from local lif 1024 to remote IP 169.254.128.145 is closed.
[?] Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=0f063568233579a4) with Tag CTRAN_BULK: Connection from local lif 1023 to remote IP 169.254.210.21 is closed.
[?] Sat May 24 09:55:29 [node-01: kltp: clam.heartbeat.state.change:info]: Heartbeats to node (name=node-02, ID=1001) are Failing.
[?] Sat May 24 09:55:29 [node-01: cf_main: cf.fsm.partnerNotResponding:notice]: Failover monitor: partner not responding
Sat May 24 09:55:29 [node-01: cf_main: fm_lastHeartbeatInfo_1:debug]: params: {'current_time': '5692845290', 'time_since_htbt_read_upd_seq_ic': '5000', 'time_since_htbt_read_upd_seq_mb': '4974', 'time_since_htbt_read_success_ic': '5000', 'time_since_htbt_read_success_mb': '4974', 'time_since_htbt_read_attempt_ic': '1000', 'time_since_htbt_read_attempt_mb': '4974', 'time_since_htbt_write_ic': '0', 'time_since_htbt_write_mb': '16', 'time_since_ic_upd_major_seq': '484080000', 'time_since_mb_upd_major_seq': '484077516', 'time_since_ic_upd_minor_seq': '0', 'time_since_mb_upd_minor_seq': '16', 'time_since_firmware_written': '0', 'time_since_firmware_read': '4980', 'time_since_firmware_rcvd': '5000', 'time_since_first_htbt_write_mb_drop': '5692767954', 'time_since_recent_htbt_write_mb_drop': '5692767954', 'mb_htbt_drop_count': '1', 'partner_minor_seq_num_ic': '7899824', 'partner_major_seq_num_ic': '1747558659', 'partner_minor_seq_num_mb': '7899823', 'partner_major_seq_num_mb': '1747558659'}
[?] Sat May 24 09:55:29 [node-01: cf_main: cf.fsm.takeoverCountdown:info]: Failover monitor: takeover scheduled in 10 seconds
[?] Sat May 24 09:55:29 [node-01: vifmgr: vifmgr.lifmoved.linkdown:notice]: LIF node-01_clus1 (on virtual server xxxxxxxxxx), IP address 169.254.125.245, is being moved to node node-01, port e3a.
[?] Sat May 24 09:55:29 [node-01: vifmgr: vifmgr.lifmoved.linkdown:notice]: LIF node-01_clus2 (on virtual server xxxxxxxxxx), IP address 169.254.211.94, is being moved to node node-01, port e3b.
[?] Sat May 24 09:55:29 [node-01: vifmgr: vifmgr.cluscheck.notassigned:alert]: Cluster LIF node-01_clus2 (node node-01) is not assigned to any port.
[?] Sat May 24 09:55:29 [node-01: vifmgr: callhome.clus.net.degraded:alert]: Call home for CLUSTER NETWORK DEGRADED: Cluster LIF Not Assigned to Any Port - Cluster LIF node-01_clus2 (node node-01) is not assigned to any port
[?] Sat May 24 09:55:29 [node-01: vifmgr: vifmgr.cluscheck.notassigned:alert]: Cluster LIF node-01_clus1 (node node-01) is not assigned to any port.
[?] Sat May 24 09:55:39 [node-01: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.
[?] Sat May 24 09:55:39 [node-01: cf_main: cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER
[?] Sat May 24 09:55:39 [node-01: cf_takeover: ha.takeover.stateChng:debug]: params: {'old_state': 'NOT_IN_TAKEOVER', 'new_state': 'IN_CFO_TAKEOVER'}
[?] Sat May 24 09:55:39 [node-01: cf_takeover: cf.fm.takeoverStarted:notice]: Failover monitor: takeover started