メインコンテンツまでスキップ

S3 / Swift要求からServiceUnavailableエラーが返され 、ノードの運用停止が返されます。

Views:
20
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale<a>2009636960</a>
Specialty:
sgrid
Last Updated:

環境

StorageGRID OS 11.6

問題

  • S3 / Swift要求は ServiceUnavailable  、ノードの運用停止とともに失敗します。
  • 以下のアラームも同時に発生します。
    • SLSA  (CPU Load Average)
    • RORQ(Outbound Replications - Queued)
    • RIRQ  (Inbound Replications - Queued)
  • Bycastログは  Cassandra TimeoutException 、次の理由で要求が失敗したことを示します。
    • HTTP Status Code=503, ErrorMsg=ServiceUnavailable, ErrorType=Client, CustomErrorMessage={<none>}, Details={<none>}
    • OBDI: checkForPreExistingObject Cassandra TimeoutException (Failed to execute cql at consistency TWO: SELECT event_time, event, last_access_time, object_lock_mode, object_lock_retain_until_time, object_lock_legal_hold, user_metadata, writetime(user_metadata), content_type, writetime(content_type), restore_start_time, restore_expiry_time, retier_time, object_partially_tiered FROM storagegrid.object_by_uuid WHERE uuid = 5595C096-928D-4CAF-B8D8-E03A4865304F - Cassandra Driver Error(Read timeout):'Operation timed out - received only 14 responses.' Detailed Info:[consistency: ALL, responses_received: 14, responses_required: 15, data_present: 1])
  • Prometheusデータの意味
  1. 運用停止中の特定のノードのCPU使用率は未処理です。
    sum by (instance) (sum by (instance, mode) (irate(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}[5m])) / count by (instance, mode)(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}))
    : st は、すべてのストレージノードの共通の頭文字です。
    S3 / Swift要求からServiceUnavailableエラーが返される
  2. iowait この 特定のノードのうち、 運用停止処理に伴って5倍(10~50%)増加し、ディスクシステムがボトルネックになります。
    sum by (mode)(irate(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'}[5m])) * 100 / count by (mode)(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'})
    ServiceUnavailableエラー
  3. この  ノードのすべてのディスクの使用率はほぼ100%です。 
    irate(node_disk_io_time_seconds_total{instance="issued storage node name",device=~'^sd.*'}[5m])*100
    S3 / Swift要求からServiceUnavailableエラーが返され、ノードの運用停止が返されます。
  • 発行されたノードと 日次ASUPに含まれる別のノードのパフォーマンスデータを比較すると、IOPSとスループットが高く、読み取り/書き込みレイテンシが高くなっています。
ASUP -> STATE-CAPTURE-DATA
Executing ionShow(99,0,0,0,0,0,0,0,0,0) on controller A:

不正なノード:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  2 Hst :  51070465 3050503068160   23246  1869666 :  24067972 379745803264   45470  13645260 :   0
  3 Hst :  50889777 3049366095360   23310  1760814 :  24248943 380225977344   45183  13645220 :   0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  0 Drv : 256171408 35181547092992   17239   852896 :  82234342 1336298067456    2512   286906 :   0
  4 Drv :    288   294912    4258    4241 :     0      0     0     0 :   0
 
Seconds since statistics cleared: 86411

正常なノード:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  2 Hst :  27647780 2876604737536    5274   829929 :  11826653 237424963584    131   511517 :   0
  3 Hst :  27509975 2877446842368    5303   826519 :  12073420 238340426240    131   620620 :   0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  0 Drv : 136207478 28042508481024    3965   325577 :  7641267 528941565952    4254   45393 :   0
  4 Drv :    288   294912    4301    4219 :     0      0     0     0 :   0
 
Seconds since statistics cleared: 86411

  • R E A D S = S3 / SwiftのGET要求
  • W R I T E S = S3 / SwiftのPUT 要求
  • ByteXfered =スループット
  • Success  = IOPS 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.