CX6 NIC X91153Aのリンクリセットメッセージが繰り返される
環境
- AFF-A900
- ONTAP 9
- CX6 PSIDカード
問題
- 2024年6月30日以降、スロット2のノードnode-01でLink Resetting メッセージが繰り返し発生しています
SYSCONFIG -A
slot 2: Dual 40G/100G/200G Ethernet Controller CX6SYSCONFIG -AC
sysconfig: slot 2 OK: X91153A: 2p 40G/100G RoCE QSFP28EMS
(2024年6月)
Sun Jun 30 00:17:51 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) has generated a register dump in /mroot/etc/mlx5log : Link Resetting.Sun Jun 30 00:17:51 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) failed to generate a register dump with error = 17 : Link Resetting.Sun Jun 30 00:17:51 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2b(pci0:51:0:1) has generated a register dump in /mroot/etc/mlx5log : Link Resetting.Sun Jun 30 00:17:51 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2b(pci0:51:0:1) failed to generate a register dump with error = 17 : Link Resetting.(2025年...)
Thu Sep 25 20:00:55 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) failed to generate a register dump with error = 17 : Link Resetting.Thu Sep 25 20:08:50 +0900 [node-01: CCMA-Worker: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) failed to generate a register dump with error = 17 : Link Resetting.Thu Sep 25 20:11:05 +0900 [node-01: CCMA-Worker: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) has generated a register dump in /mroot/etc/mlx5log : Link Resetting.Thu Sep 25 20:15:27 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2b(pci0:51:0:1) failed to generate a register dump with error = 17 : Link Resetting.Thu Sep 25 20:17:42 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2b(pci0:51:0:1) has generated a register dump in /mroot/etc/mlx5log : Link Resetting.- ONTAP 9.12.1P7から9.15.1P14へのNDUアップグレード中に、この不安定なCX6 NICを搭載したノードnode-01でパニックが発生しました
cluster::*> storage failover takeover -ofnode node-01cluster::*> Files /cfcard/x86_64/freebsd/image1/VERSION and /var/VERSION differERROR: /var cannot be downgraded.Waiting for PIDS: 1392.Terminated.Setting default boot image to image1...done.Uptime: 722d2h54m27sPANIC : peg_nvmeof_qpair_flush_request: Failed to move RDMA qp (0xfffff804eac60c00) to error state: -60version: 9.12.1P7: Fri Sep 15 02:00:51 EDT 2023conf : x86_64.optimizecpuid = 3KDB: stack backtrace:vpanic() at vpanic+0x429/frame 0xfffffe121d094210panic() at panic+0x42/frame 0xfffffe121d094270peg_nvmeof_qpair_flush_request() at peg_nvmeof_qpair_flush_request+0x74a/frame 0xfffffe121d094360peg_nvmeof_ctrlr_fail_task() at peg_nvmeof_ctrlr_fail_task+0xa8/frame 0xfffffe121d094390stack_zero() at stack_zero+0x137/frame 0xfffffe121d0943f0taskqueue_thread_loop() at taskqueue_thread_loop+0x9b/frame 0xfffffe121d094430fork_exit() at fork_exit+0xb2/frame 0xfffffe121d094470fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe121d094470--- trap 0, rip = 0, rsp = 0, rbp = 0 ---Uptime: 722d2h56m51sPANIC: peg_nvmeof_qpair_flush_request: Failed to move RDMA qp (0xfffff804eac60c00) to error state: -60 in process peg nvmeof taskq_31 on release 9.12.1P7 (C) on Thu Sep 25 20:19:51 KST 2025version: 9.12.1P7: Fri Sep 15 02:00:51 EDT 2023- パニック・リブート後、ノード node-01 の CX6 NIC は sysconfig -a 出力で認識されなくなりました
NDU前:
slot 1: Dual 40G/100G/200G Ethernet Controller CX6slot 2: Dual 40G/100G/200G Ethernet Controller CX6e2a MAC Address: xx:xx:xx:xx:xx:90 (auto-100g_cr4-fd-up)e2b MAC Address: xx:xx:xx:xx:xx:91 (auto-100g_cr4-fd-up)slot 3: Quad 10G/25G Ethernet Controller CX5NDU後:
slot 1: Dual 40G/100G/200G Ethernet Controller CX6slot 3: Quad 10G/25G Ethernet Controller CX5