NFS アクセスと iSCSI アクセスがアラート nodewatchdog.svc.rpc.noresp で失敗しました
環境
- ONTAP 9.7
- 2 ノードスイッチレスクラスタ
問題
- NFS または iSCSI 経由で接続されているクライアントまたはホストは、ネットアップストレージとの接続を切断しました
- Unified Manager のアラート:
Node: Node-01
Time: Wed, Sep 23 18:24:46 2020 +0200
Severity: ALERT
Message: nodewatchdog.svc.rpc.noresp: The vifmgr service internal to Data ONTAP that is required for continuing data service was unavailable. The service failed, but was unsuccessfully restarted.
Description: This message occurs when a service critical to data access fails to respond to service monitoring and is restarted. Data ONTAP(R) might have experienced a serious error and might operate in a degraded mode.
Corrective Action: If the message reports that the service has "restarted" then no action is required. If the status is "not restarted" or "unsuccessfully restarted," then reboot the node using the "system node reboot" command with the "-dump true" option, and then contact NetApp technical support.
Source: nodewatchdog
Sequence#: 602147
- RDB ログには、次のような「明らかな枯渇」メッセージが頻繁に記録されています。
Vifmgr の例:
Wed Sep 23 2020 18:24:00 +02:00 [kern_vifmgr:info:6763] rdb::qm:Wed Sep 23 18:18:18 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80b8d7100) QmState::doWorkerWork RDB QM main voter thread slept for 14303 ms (apparent starvation)
MGWD の例:
Wed Sep 23 2020 18:24:43 +02:00 [kern_mgwd:info:14611] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x82ddb7800)]: doWorkerWork: RDB QM main voter thread slept for 3711 ms (apparent starvation).
Wed Sep 23 2020 18:24:43 +02:00 [kern_mgwd:info:14611] A [src/rdb/cluster_events.cc 88 (0x82ddb7800)]: Report: Cluster event: node-event, epoch 36, site 1001 [apparent starvation in QM main voter thread: slept 3711 ms].
例 VLDB :
Wed Sep 23 2020 18:23:38 +02:00 [kern_vldb:info:6843] rdb::qm:Wed Sep 23 18:21:01 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80ab31900) QmState::doWorkerWork RDB QM main voter thread slept for 4175 ms (apparent starvation)
Wed Sep 23 2020 18:25:15 +02:00 [kern_vldb:info:6843] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x80ab31900)]: doWorkerWork: RDB QM main voter thread slept for 2498 ms (apparent starvation).
Bcomd の例:
Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] rdb::qm:Wed Sep 23 18:22:25 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80b757700) QmState::doWorkerWork RDB QM main voter thread slept for 7637 ms (apparent starvation)
Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x80b757700)]: doWorkerWork: RDB QM main voter thread slept for 116272 ms (apparent starvation).
Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] A [src/rdb/cluster_events.cc 88 (0x80b757700)]: Report: Cluster event: node-event, epoch 18, site 1001 [apparent starvation in QM main voter thread: slept 116272 ms].
- MGWD ログには ' テーブル client_DEVICE_VIEW の SMF トレースダンプが頻繁に含まれています
Fri Sep 23 2020 18:15:47 +02:00 [kern_mgwd:info:60301] client_device_view.next_imp() returns: success (latency: 60.256860s) [-,switch-01(DCOV13007),10.10.10.10,SNMPv2c,true,-,cshm1!,OTHER,management-network,Cisco Nexus Operating System (NX-OS) Software, Version 14.1(2u),Invalid SNMP Settings,CDP/ISDP,-,false,Unknown,yes,no,OTHER] (memory: 434290 net, 445502 max, 5447768 allocated, 5013478 freed, 28040 allocations)
注 : これは SMF トレースダンプの最後の行ですトレースダンプ全体が大きすぎて、ここに含めることができません。
- ホスト OS ドメインから CPU 利用率の上昇が見られる。ただし、これは上位の貢献者ではない可能性があります。
netapp::*> node run -node netapp-01 sysstat -M 1
ANY1+ ANY2+ ANY3+ ANY4+ ANY5+ ANY6+ ANY7+ ANY8+ ANY9+ ANY10+ ANY11+ ANY12+ ANY13+ ANY14+ ANY15+ ANY16+ AVG
100% 100% 100% 99% 98% 96% 94% 91% 86% 81% 76% 70% 64% 57% 48% 37% 81%
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
78% 76% 77% 83% 82% 83% 82% 82% 82% 82% 83% 84% 83% 82% 83% 82%
Nwk_Excl Nwk_Lg Nwk_Exmpt Protocol Cluster Storage Raid Raid_Ex Target Kahuna WAFL_Ex(Kahu)
3% 2% 450% 0% 0% 49% 2% 136% 0% 4% 511%( 94%)
WAFL_XClean SM_Exempt Cifs Exempt SSAN_Ex Intr Host Ops/s CP
0% 0% 0% 112% 0% 28% 201% 47111 0%
- ホスト OS の利用率が高いことから、 mgwd と vifmgr のプロセスが抜け出ています。
::*> systemshell -node <NodeName> top
::*> systemshell -node <NodeName> ps auxww | head
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 6379 root 107 40 0 194M 40272K uwait 1 117.2H 59.47% vifmgr 2169 root 311 67 0 658M 148M uwait 1 174.3H 8.64% mgwd