CPおよびvNVRAMのフラッシュ遅延が長いと、ONTAP Select ノードが予期せずにリブートします
環境
- NetApp ONTAP の選択
- ONTAP 9
問題
- ONTAP Select ノードがパニック文字列で予期せずリブートしました:
received completion for unknown cmd in process irqXXX: nvme0
- 参照されるデバイスは
nvmeX
、通常NVMeバックエンドなしで構成されるnvme0
- ONTAP 側のログシーケンスによりパニックが発生:
Sat Jul 02 03:10:28 +0200 [node-01: ctlg_flxlg_mirror: vnvram.dma.long.wait:alert]: vNVRAM flush taking over 10 seconds.
Sat Jul 02 03:10:29 +0200 [node-01: wafl_exempt03: wafl.cp.toolong:error]: Aggregate aggr0 experienced a long CP.
Sat Jul 02 03:10:30 +0200 [node-01: irq282: nvme0: cf.fm.localFwTransition:debug]: params: {'progresscounter': '1031', 'newstate': 'SF_DUMPCORE', 'prevstate': 'SF_UP'}
Sat Jul 02 03:10:30 +0200 [node-01: irq282: nvme0: ha.panicInfoSent:notice]: Node successfully sent a panic information message to its HA partner. Partner name: . Partner system ID: 1234567890.
Sat Jul 02 03:10:30 +0200 [node-01: irq282: nvme0: sk.panic:alert]: Panic String: received completion for unknown cmd in process irq282: nvme0 on release 9.9.1P8 (C)
- ESXi側
vmware.log
のシーケンス:
2022-07-02T01:10:30.121Z| vcpu-0| | I005: NVME-VMM: Controller level reset via CC.EN bit transition on nvme0
2022-07-02T01:10:30.121Z| vcpu-0| | I005: NVME-CORE: Doing a partial reset of controller regs and queues.
2022-07-02T01:10:33.353Z| vcpu-0| | I005: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/.../ontapselect-n02/ontapselect-n02.vmdk'