クラスタネットワークの劣化: FAS2750 の単一クラスタポートリンクのフラッピング
環境
- ONTAP 9
- FAS2750
- スイッチレス クラスタ
問題
CLUSTER NETWORK DEGRADED
単一クラスター ポート リンクのフラッピングによりエラーが検出されました。
Tue May 13 02:04:26 +0000 [Node1B: kernel: netif.linkDown:info]: Ethernet e0a: Link down, check cable.
Tue May 13 02:04:26 +0000 [Node1B: vifmgr: vifmgr.portdown:notice]: A link down event was received on node Node1B, port e0a.
Tue May 13 02:04:26 +0000 [Node1B: vifmgr: vifmgr.clus.linkdown:EMERGENCY]: The cluster port e0a on node Node1B has gone down unexpectedly.
Tue May 13 02:04:26 +0000 [Node1B: vifmgr: vifmgr.lifmoved.linkdown:notice]: LIF nrtxsz04-02_clus1 (on virtual server 4294967293), IP address 169.254.208.163, is being moved to node Node1B, port e0b.
Tue May 13 02:04:28 +0000 [Node1B: kernel: netif.linkUp:info]: Ethernet e0a: Link up.
Tue May 13 02:04:28 +0000 [Node1B: vifmgr: vifmgr.portup:notice]: A link up event was received on node Node1B, port e0a.
Tue May 13 02:06:53 +0000 [Node1B: kernel: netif.linkDown:info]: Ethernet e0a: Link down, check cable.
Tue May 13 02:06:53 +0000 [Node1B: vifmgr: vifmgr.portdown:notice]: A link down event was received on node Node1B, port e0a.
Tue May 13 02:06:53 +0000 [Node1B: vifmgr: vifmgr.clus.linkdown:EMERGENCY]: The cluster port e0a on node Node1B has gone down unexpectedly.
Tue May 13 02:06:53 +0000 [Node1B: vifmgr: vifmgr.lifmoved.linkdown:notice]: LIF nrtxsz04-02_clus1 (on virtual server 4294967293), IP address 169.254.208.163, is being moved to node Node1B, port e0b.
Tue May 13 02:06:56 +0000 [Node1B: vifmgr: vifmgr.lifsuccessfullymoved:notice]: LIF nrtxsz04-02_clus1 (on virtual server 4294967293), IP address 169.254.208.163, is now hosted on node Node1B, port e0b.
Tue May 13 02:06:56 +0000 [Node1B: kernel: netif.linkUp:info]: Ethernet e0a: Link up.
Tue May 13 02:06:56 +0000 [Node1B: vifmgr: vifmgr.portup:notice]: A link up event was received on node Node1B, port e0a.
Tue May 13 02:06:56 +0000 [Node1B: vifmgr: vifmgr.port.monitor.failed:error]: The "link_flapping" health check for port e0a (node Node1B) has failed. The port is operating in a degraded state.
Tue May 13 02:06:56 +0000 [Node1B: vifmgr: callhome.clus.net.degraded:alert]: Call home for CLUSTER NETWORK DEGRADED: Frequent Link Flapping - Cluster port e0a on node Node1B has experienced multiple link down notifications.
- CRC エラーが見つかりました:
-- interface e0a (17 days, 23 hours, 36 minutes, 28 seconds) --
RECEIVE
Total frames: 1450m | Frames/second: 0 | Total bytes: 412g
Bytes/second: 0 | Total errors: 260k | Errors/minute: 0
Total discards: 0 | Discards/minute: 0 | Multi/broadcast: 130k
Non-primary u/c: 0 | CRC errors: 259k | Runt frames: 0
Fragment: 0 | Long frames: 0 | Jabber: 420
Length errors: 914 | No buffer: 0 | Xon: 0
Xoff: 0 | Jumbo: 5550k | Noproto: 0
Error symbol: 26258 | Illegal symbol: 22600 | Bus overruns: 0
Queue drops: 0 | LRO segments: 1446m | LRO bytes: 405g
LRO6 segments: 0 | LRO6 bytes: 0 | Bad UDP cksum: 0
VIGMGR ログから: ポート e0a がダウンし、RDB ユニットがオフラインになりました。
- ノードが OOQ (Out of Quorum) を報告し、VIFMGR がオフラインになりました。Tue May 13 2025 00:40:38 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [EventMgr::executeLegacyEvent] Periodic Cluster Network Verification: ping-cluster and MTU verifications successful
Tue May 13 2025 01:42:35 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [EventMgr::executeLegacyEvent] Periodic Cluster Network Verification: ping-cluster and MTU verifications successful
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [FailoverMgr::setPortHealthUnknown] Port 9 health status is currently Unknown
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [FailoverMgr::linkDown] Cluster Link Down: Port 9 has gone down unexpectedly
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [FailoverMgr::linkDown] cluster port (e0a) is now link down, dispatching switchless update
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [Net::RdbLifHandle::avoidDownPorts] LIF lif:rdb:4294967293:nrtxsz04-02_clus1 (1011) is assigned to a down port (nrtxsz04p01b:e0a). Attempting to reassign.
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] W [src/rdb/TM.cc 5329 (0x80c3ff100)]: handleCoordBeginTranStatus: beginTran status UNIT_OFFLINE, txn request epoch 64
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [May 13 02:04:26]: 0x80c3ff100: 0: ERR: rdb_tran_glue: create: tid=0x80c3ff100 failed to create transaction for label='Net::RdbLifHandle::commitConfig': UNIT_OFFLINE
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [Net::TransactionHelper::create] Failed to create transaction Net::RdbLifHandle::commitConfig: Node "nrtxsz04p01b" on ring "VifMgr" is offline. Check the health of the cluster using the "cluster show" command. For further assistance, contact technical support.
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [LinkFlappingHealthMonitor::linkDown] linkdown event received on port 9 at time 1461140s, ignore flap status: false
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] W [src/rdb/TM.cc 5329 (0x80c3ff100)]: handleCoordBeginTranStatus: beginTran status UNIT_OFFLINE, txn request epoch 64
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [May 13 02:04:26]: 0x80c3ff100: 0: ERR: rdb_tran_glue: create: tid=0x80c3ff100 failed to create transaction for label='Net::RdbLifHandle::commitConfig': UNIT_OFFLINE
Tue May 13 2025 02:04:26 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [Net::TransactionHelper::create] Failed to create transaction Net::RdbLifHandle::commitConfig: Node "nrtxsz04p01b" on ring "VifMgr" is offline. Check the health of the cluster using the "cluster show" command. For further assistance, contact technical support.
Tue May 13 2025 02:04:28 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [FailoverMgr::setPortHealthUnknown] Port 9 health status is currently Unknown
Tue May 13 2025 02:04:28 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [FailoverMgr::linkUp] cluster port (e0a) is now link up, dispatching switchless update
Tue May 13 2025 02:04:28 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [FailoverMgr::linkUp] clearing arp cache for cluster ports
Tue May 13 2025 02:04:28 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [LinkFlappingHealthMonitor::linkUp] linkup event received on port 9 at time 1461142s, ignore flap status: falseTue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] ******* OOQ QM mtrace dump END *********
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] A [src/rdb/TM.cc 1883 (0x80ec32f00)]: _changeRole: TM 1001: change role at epoch 0 to 0x40 recovery transaction number 0
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] A [src/rdb/TM.cc 1620 (0x80ec32f00)]: _triggerOnlineStatusCallback: TM 1001: Report UNIT_IS_OFFLINE (epoch 0, master 0).
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] A [src/rdb/TM.cc 1624 (0x80ec32f00)]: _triggerOnlineStatusCallback: FAILOVER rdb: Local unit VifMgr offline
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] A [src/rdb/HAM.cc 1183 (0x80ec32f00)]: reportLocalOffline: HAM: new goal HAM_GOAL_ACTIVATE
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] A [src/rdb/cluster_events.cc 88 (0x80ec32f00)]: Report: Cluster event: node-event, epoch 0, site 1001 [local node offline].
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] A [src/rdb/HAM.cc 1624 (0x80ec32800)]: _hamThreadFunc: HAM: daemon goal change from HAM_GOAL_NONE to HAM_GOAL_ACTIVATE or shutdown 0
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] Notice: online_status_callback: RDB unit is offline
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] [May 13 02:04:29]: 0x80ec33600: 0: INFO: RDB::callback::registrar: callback:src/rdb/rdb_online_registrar.cc:80 rdb_callbacks::ONLINE::BEGIN:: 1461143820
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] [May 13 02:04:29]: 0x80ec33600: 0: INFO: RDB::callback::registrar: callback:src/rdb/rdb_online_registrar.cc:105 rdb_callbacks::ONLINE::END:: 1461143820
Tue May 13 2025 02:04:29 +00:00 [kern_vifmgr:info:7661] [0x80c3fce00] [EventMgr::onlineCallback] received UNIT_IS_OFFLINE from RDB
Tue May 13 2025 02:04:30 +00:00 [kern_vifmgr:info:7661] [0x80c3fd500] [FailoverMgr::cluster_check] Starting cluster ping test...
Tue May 13 2025 02:04:31 +00:00 [kern_vifmgr:info:7661] [May 13 02:04:31]: 0x80c3ff100: 0: ERR: rdb_tran_glue: create: tid=0x80c3ff100 failed to create transaction for label='Net::RdbLifHandle::commitConfig': UNIT_OFFLINE
Tue May 13 2025 02:04:31 +00:00 [kern_vifmgr:info:7661] [0x80c3ff100] [Net::TransactionHelper::create] Failed to create transaction Net::RdbLifHandle::commitConfig: Node "nrtxsz04p01b" on ring "VifMgr" is offline. Check the health of the cluster using the "cluster show" command. For further assistance, contact technical support.
Tue May 13 2025 02:04:32 +00:00 [kern_vifmgr:info:7661] [0x80c3fd500] [FailoverMgr::cluster_check] large pkt : 0% packet loss when pinging from nrtxsz04-02_clus1 ( 169.254.208.163 ) on nrtxsz04p01b -> nrtxsz04-02_clus2 ( 169.254.91.200 ) on nrtxsz04p01b
Tue May 13 2025 02:04:32 +00:00 [kern_vifmgr:info:7661] A [src/rdb/quorum/qm_states/ooq/FailedState.cc 51 (0x80ec32100)]: state2: WS_Failed -> WS_WaitingForVotes- ケーブルを差し直しても問題は解決しません。