CPU 0 のスレッド (ontap: cpu0) が 300001 ミリ秒間ハングしました
環境
- ONTAP 9
- FAS / AFFプラットフォーム
問題
からSP-
Console-log
以下のログが表示されます。
- ノードはパニックに陥り、次のようになりました。
PANIC : thread (ontap: cpu0) on cpu 0 hung for 300001 milliseconds
version: 9.14.1P11: Wed Jan 22 06:55:28 EST 2025
conf : x86_64.optimize
cpuid = 0
KDB: stack backtrace:
vpanic() at vpanic+0x602/frame 0xfffffe00935fcdb0
panic() at panic+0x42/frame 0xfffffe00935fce10
check_starvation_internal() at check_starvation_internal+0xb5/frame 0xfffffe00935fce40
hardclock() at hardclock+0x45/frame 0xfffffe00935fce90
resumectx() at resumectx+0x427/frame 0xfffffe00935fcef0
lapic_handle_timer() at lapic_handle_timer+0xa2/frame 0xfffffe00935fcf20
Xtimerint() at Xtimerint+0x128/frame 0xfffffe00935fcf20
--- interrupt, rip = 0xffffffff80d3cd91, rsp = 0xfffffe01a6945a70, rbp = 0xfffffe01a6945a70 ---
bzero_sse2_nt() at bzero_sse2_nt+0x51/frame 0xfffffe01a6945a70
vm_hw_module_init() at vm_hw_module_init+0x6c1/frame 0xfffffe01a6945b40
sk_init_mem() at sk_init_mem+0x39/frame 0xfffffe01a6945b80
startup_boot_processor() at startup_boot_processor+0x56/frame 0xfffffe01a6945b90
psm_processor_start() at psm_processor_start+0x2e/frame 0xfffffe01a6945bb0
fork_exit() at fork_exit+0xb6/frame 0xfffffe01a6945bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01a6945bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 7m24s
ahcich0: AHCI reset done: devices=00000001
Dumper is not yet registered. A coredump will not be available at this time.
System halting...
- Pelog もダンプされる可能性があります:
Found platform error in ONTAP region.
region 2 header
Sig(0x544f414e)
Size(28672)
Ver(1)
Tail(0)
DataLen(100)
DataCRC(0x6edc)
HdrCRC(0xe223)
Rec(1) @ 0x14, flag(0x0) len(24) tstamp(0x6832f1ea)
log(7) msg(UECC Addr 0x18da78340)
Rec(2) @ 0x38, flag(0x0) len(8) tstamp(0x6832f1ea)
node(0) chan(3) dimm(1)
rank(1) bank(0x2) row(0x864) col(0x308)
Rec(1) @ 0x4c, flag(0x0) len(32) tstamp(0x6832f1ea)
log(7) msg(devtag(0x2), correrr(0x50ba))
- ECC エラーが表示されます:
ECC error at DIMM-12: CE-04-2002-42C12E13,ADDR 0x18da78340,(Node(0), Memory controller(1), CH(3), DIMM(1), Rank(1), Bank Group(3), Bank(0x2), Row(0x864), Col(0x308)), devtag(0x2), correrr(0x50ba) Uncorrectable Machine Check Error at CPU3. BDWL_HA1 Error: STATUS<0xfe0003c000010091>(Val,OverF,UnCor,Enable,MiscV,AddrV,PCC,CorrSts(0),CorrCnt(0xf),ExtErr(0x1),ErrCode(Channel 1, Read),ErrCode(0x91)),MISC<0x0000000140560e86>(HaDbBank(0),PE(0),ReqOpcode(0xa),RNID(0),RTID(0x2b),HTID(0x7))