メインコンテンツへスキップ

S3/Swift要求は、ノードの運用停止とともに ServiceUnavailable 障害を返します

Views:
28
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

環境

StorageGRID OS 11.6

問題

  • S3/Swift 要求がノードの運用停止とともに ServiceUnavailableエラーを返します。
  • その間、以下のアラームも発生します:
    • SLSA    (CPU Load Average)
    • RORQ(Outbound Replications - Queued)
    • RIRQ    (Inbound Replications - Queued)
  • Bycast ログは、次の理由で要求が失敗したことを示しています Cassandra TimeoutException
    • HTTP Status Code=503, ErrorMsg=ServiceUnavailable, ErrorType=Client, CustomErrorMessage={<none>}, Details={<none>}
    • OBDI: checkForPreExistingObject Cassandra TimeoutException (Failed to execute cql at consistency TWO: SELECT event_time, event, last_access_time, object_lock_mode, object_lock_retain_until_time, object_lock_legal_hold, user_metadata, writetime(user_metadata), content_type, writetime(content_type), restore_start_time, restore_expiry_time, retier_time, object_partially_tiered FROM storagegrid.object_by_uuid WHERE uuid = 5595C096-928D-4CAF-B8D8-E03A4865304F - Cassandra Driver Error(Read timeout):'Operation timed out - received only 14 responses.' Detailed Info:[consistency: ALL, responses_received: 14, responses_required: 15, data_present: 1])
  • Prometheus データが示す
  1. 運用停止中の特定のノードの CPU 使用率が異常です。
    sum by (instance) (sum by (instance, mode) (irate(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}[5m])) / count by (instance, mode)(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}))
    : st は、すべてのストレージ ノードに共通する頭文字です。
    S3/Swift requests return ServiceUnavailable failure
  2. この特定のノードの iowait は運用停止に伴って5倍(10%から50%)に増えるため、ディスク システムがボトルネックになっています。
    sum by (mode)(irate(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'}[5m])) * 100 / count by (mode)(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'})
    ServiceUnavailable failure
  3. この特定のノードのすべてのディスクの使用率はほぼ100%です。 
    irate(node_disk_io_time_seconds_total{instance="issued storage node name",device=~'^sd.*'}[5m])*100
    S3/Swift requests return ServiceUnavailable failure along with node decommissioning
  • 運用停止後にファイルシステムの空きバイト数が増加したことについて、運用停止された2つのノードを比較すると、初期段階で不良ノードで急激な変化が見られ、運用停止の初期段階では、問題のあるノードの読み取りと切り捨てのアクティビティが多くなっていることがわかります。 
    • sum(node_filesystem_free_bytes{instance="node name",mountpoint=~"/var/local/rangedb/.*"})
      • 2023/7/5/13:16 GMT ~ 2023/7/5/14:36 GMT
        • 不良ノード:    724.45TB - 724.18TB = 0.27TB = 270GB
        • 良好なノード: 528.47TB - 528.45TB = 0.02TB = 20GB
      • 2023/7/5/13:16 GMT ~ 2023/7/6/02:04 GMT
        • 不良ノード:    725.00TB - 724.18TB = 0.82TB = 820GB
        • 良好なノード: 528.57TB - 528.45TB = 0.12TB = 120GB
          • node filesystem free bytes
  • 発行されたノードと日次ASUP内の別のノードのパフォーマンス データを比較すると、問題のあるノードの方がIOPSとスループットが高いため、読み取り / 書き込みレイテンシが高くなっています:
ASUP -> STATE-CAPTURE-DATA
Executing ionShow(99,0,0,0,0,0,0,0,0,0) on controller A:

不良ノード:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  2 Hst :  51070465 3050503068160   23246  1869666 :  24067972 379745803264   45470  13645260 :   0
  3 Hst :  50889777 3049366095360   23310  1760814 :  24248943 380225977344   45183  13645220 :   0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  0 Drv : 256171408 35181547092992   17239   852896 :  82234342 1336298067456    2512   286906 :   0
  4 Drv :    288   294912    4258    4241 :     0      0     0     0 :   0
 
Seconds since statistics cleared: 86411

良好なノード:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  2 Hst :  27647780 2876604737536    5274   829929 :  11826653 237424963584    131   511517 :   0
  3 Hst :  27509975 2877446842368    5303   826519 :  12073420 238340426240    131   620620 :   0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  0 Drv : 136207478 28042508481024    3965   325577 :  7641267 528941565952    4254   45393 :   0
  4 Drv :    288   294912    4301    4219 :     0      0     0     0 :   0
 
Seconds since statistics cleared: 86411

注意

  • R E A D S = S3/SwiftのGETリクエスト
  • W R I T E S = S3/SwiftのPUTリクエスト
  • ByteXfered = スループット
  • Success = IOPS 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.