RoCEポートで予期しないLIFが停止しています
環境
- ONTAP 9.13.1以降
- RDMA / RoCE経由のNFS
- Mellanox / NVIDIA CX5 / CX6 / CX6-LX 10 / 25GbEまたは40 / 100GbE NIC
問題
- 1つのRoCE対応 ポートにすでに127を超えるNFSデータLIFが設定されている場合は、次の手順を実行します。
- LIFのフェイルオーバーや移行 が 原因でLIFがエラーなく動作停止状態になることがある
- LIFの作成 は成功するが、 LIFの 動作は停止して おり、エラーがvifmgr.logに記録されている
clustershell::> network interface create -vserver vs0 -lif vs0_test -service-policy default-data-files -address 10.75.140.127 -netmask 255.255.255.0 -home-node node-02 -home-port e4a Info: LIF "vs0_test" on Vserver "vs0" was created successfully but could not be successfully configured on either its home port or any of its failover targets. The LIF's operational status will be reported as "down" until one or more failover targets becomes available. Use the "network interface show -vserver vs0 -lif vs0_test -failover" command to review the LIF's current failover configuration.
- vifmgr.logのエラー
例:
(03/26/2024 16:41:03): > [Net::LifStackAdapter::installLif] vserverId=3, lifId=1278, address=10.95.86.122, portName=e3a, lifProtocols=0x1 (03/26/2024 16:41:03): > [SkStackMgr::addLif] PARAM lifId 1278, portName e3a, address 10.95.86.122, ipspaceId 4294967295, vserverId 3, lifUuid 98cd9a48-ea28-11ee-ad09-d039eaa9ecf3, isMccRequest false, lifProtocols 0x001, serviceMask 0x000000013D000804, homeNode perfqa-vino-03 (03/26/2024 16:41:03): > [NbladeWriter::addLif] PARAM: lifId: 1278, address 10.95.86.122, netmask 255.255.255.0, ipspaceId: 4294967295, vserverId: 3, portName: e3a, isMccRequest: false, protocolMask: 00000001, serviceMask: 0x000000013D000804, homeNode^I: perfqa-vino-03(ccdcca33-ea25-11ee-ad09-d039eaa9ecf3) (03/26/2024 16:41:03): > [NbladeWriter::nitroPcpRpcCall] procNum=3, isIdemp=false (03/26/2024 16:41:03): > [DelayTracker::add_sample] ENTRY: object=nblade, delay_ms=53 (03/26/2024 16:41:03): < [DelayTracker::add_sample] EXIT: object=nblade, state=NORMAL (03/26/2024 16:41:03): < [NbladeWriter::nitroPcpRpcCall] elapsed time: 0s) (03/26/2024 16:41:03): [NbladeWriter::ScopedNitroRequest::sendRequest] RPC for procedure 3 completed, but returned error: NbladeWriter Error type unknown: 12046 (03/26/2024 16:41:03): < [NbladeWriter::ScopedNitroRequest::sendRequest] (03/26/2024 16:41:03): < [NbladeWriter::addLif] retval: NbladeWriter Error type unknown: 12046 (03/26/2024 16:41:03): [SkStackMgr::addLif] Unexpected error adding the LIF to the stack: NbladeWriter Error type unknown: 12046 (03/26/2024 16:41:03): < [SkStackMgr::addLif] complete, returning Unexpected error "NbladeWriter Error type unknown: 12046" encountered as a result of adding the LIF. (03/26/2024 16:41:03): [Net::LifStackAdapter::installLif] Failed to add the requested LIF: Unexpected error "NbladeWriter Error type unknown: 12046" encountered as a result of adding the LIF. (03/26/2024 16:41:03): [Net::AbortableHandle::commit] Caught an unexpected exception: Unexpected error "NbladeWriter Error type unknown: 12046" encountered as a result of adding the LIF. (03/26/2024 16:41:03): ERR{ commit() at src/framework/objects/base/AbortableHandle.cc:65 }
- ポートは RoCEオフロード機能を備えたNIC上にある(Mellanox / NVIDIA CX5/CX6/CX6-LXなど)
例:
::> network port show -node node-02 -fields rdma-protocols node port rdma-protocols -------- ---- -------------- node-02 e0M - node-02 e1a roce node-02 e1b roce node-02 e3a roce node-02 e3b roce node-02 e3c roce node-02 e3d roce 7 entries were displayed.
- NFSサーバでRDMA が有効になっている (ONTAP 9.10.1以降ではデフォルト)
メモ: 確認するには、 vserver nfs show -fields rdma