メインコンテンツまでスキップ

非常に特定のワークロード条件下でST16000NM002Gドライブの障害発生率が高い

Views:
13
Visibility:
Public
Votes:
0
Category:
e-series-santricity-os-controller-software<a>2008759385</a>
Specialty:
esg
Last Updated:

環境

  • E5760
  • GPFS
  • SANtricity OS 11.60.2R1-11.70.2
  • Seagate ST16000NM002GドライブファームウェアNE00および/またはNE01

問題

これまでのところ、問題が観察されたのは、複数のE5760 EシリーズストレージアレイにまたがるIBM GPFSファイルシステムで、特定の負荷条件下でのみです。
ドライブベンダーによるドライブ分析では、書き込みの99.99%がドライブの0.01%に格納され、1.6GBの範囲内に収まっています。
LBA範囲の低い一部のホットスポットに最大106MB/秒の書き込みを行います。
 
症状は次のとおりです。
  • ドライブ側タイムアウトが原因でドライブチャネルがデグレード状態になりました
  • 複数のドライブの書き込みタイムアウトがタイムアウトする()IOP_FAST_TIMEOUT_ERROR
  • PIエラー
  • 読み取り不能セクター(URS /データ損失)の報告

Eシリーズのデグレード状態のドライブチャネルおよび複数のドライブがデグレード状態のパス KBに記載されている定期的なトラブルシューティング手順では、問題が解決されません。

 
問題は複数のシェルフ/ドロワー/ドライブベイで発生しており、チェーン内に障害が発生している共通のコンポーネントがありません。
すべてのドライブとヘビケーブル(または上記KBのその他のトラブルシューティング手順)を抜き差ししても、改善はありません。
ダイビングは1年未満(5歳未満)で、同じスロットの交換されたドライブも同じ症状/失敗を示しています。
 
メジャーイベントログには、次のようなイベントが表示されます。
 
A:11/30/21, 3:31:03 AM (03:31:03) 2206 1209 Drive channel set to Degraded - Drive-side: channel 3 <--CRITICAL
A:11/30/21, 3:31:03 AM (03:31:03) 2205 1513 Individual drive - Degraded path - Drive-side: channel 3 <--CRITICAL
A:11/30/21, 3:30:55 AM (03:30:55) 2204 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:30:46 AM (03:30:46) 2203 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x189fae400, Blocks: 0x400 - Recovered
----> Flags: 0x40202001 = READ: Read Operation, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:30:43 AM (03:30:43) 2202 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0xb49c7358, Blocks: 0x8 - Recovered
----> Flags: 0x40202001 = READ: Read Operation, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:30:43 AM (03:30:43) 2201 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:30:06 AM (03:30:06) 2200 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:49 AM (03:29:49) 2199 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:41 AM (03:29:41) 2198 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x21639800, Blocks: 0x400 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:29:39 AM (03:29:39) 2197 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:38 AM (03:29:38) 2196 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x1538dcec0, Blocks: 0x10 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:29:35 AM (03:29:35) 2195 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x1266587a0, Blocks: 0x8 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
 
A:12/31/21, 9:31:45 AM (09:31:45) 52721 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239814b <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x84047314b
A:12/31/21, 9:31:44 AM (09:31:44) 52720 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239814a <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x84047314a
A:12/31/21, 9:31:42 AM (09:31:42) 52719 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c2398149 <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x840473149
A:12/31/21, 9:31:41 AM (09:31:41) 52718 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c2398148 <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x840473148
A:12/31/21, 9:31:41 AM (09:31:41) 52717 201e VDD repair started - Shelf 30, Bay A - SSID: 33, Devnum: 0xffffff
A:12/31/21, 9:31:41 AM (09:31:41) 52716 201f VDD repair completed - Shelf 30, Bay A - SSID: 33, Devnum: 0x010217 LBA: 0x12c2399800
----> Flags: 0x202005 = READ: Read Operation, ERROR: IO Compl. w. Err, NOLOCK: Prevent lock during read err., PI: Error coding in effect - Error: 0x844 = UA_MISCORRECTED_DATA_ERROR
A:12/31/21, 9:31:40 AM (09:31:40) 52715 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239994f <--CRITICAL
----> Physical Drive in Tray 32 Slot 24, LBA: 0x3eed7314f
A:12/31/21, 9:31:40 AM (09:31:40) 52714 1012 Destination driver error - Shelf 32, Drawer 2, Bay 11
A:12/31/21, 9:31:40 AM (09:31:40) 52713 1016 Drive returned unrecoverable media error - Shelf 32, Drawer 2, Bay 11
----> Sense 3/11/0 = Medium Error - Unrecovered read error - CDB: 0x7f(0x9) = Read(32) - LBA: ~0x3eed7314f
A:12/31/21, 9:31:37 AM (09:31:37) 52712 1016 Drive returned unrecoverable media error - Shelf 32, Drawer 2, Bay 11
----> Sense 3/11/0 = Medium Error - Unrecovered read error - CDB: 0x7f(0x9) = Read(32) - LBA: ~0x3eed7314f
 
A:12/25/21, 7:58:16 AM (07:58:16) 47154 100d Timeout on drive side of controller - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47153 2215 Drive marked failed - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47152 226c Drive failure - Shelf 33, Drawer 4, Bay 5 - Cause: 3 = Write failure; Drive WWN: 5000c500cadc69b7; SN: ZL29F9KB0000C107BKS5 <--CRITICAL
B:12/25/21, 7:58:40 AM (07:58:40) 47151 2226 Drive spun down - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47150 7e05 Drive recovery criteria not met - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:39 AM (07:58:39) 47149 100d Timeout on drive side of controller - Shelf 33, Drawer 4, Bay 5

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.