StorageGRID 11.4でCassandra修復の進捗が遅いアラートとcassandra-reaperサービスの頻繁な再起動

最後の更新
PDFとして保存

Views:: 62

Visibility:: Public

Votes:: 0

Category:: storagegrid-webscale

Specialty:: sgrid

Last Updated:

環境

NetApp StorageGRID 11.4（pre-11.4.0.3）
新しい StorageGRID デプロイ
NetApp StorageGRID 環境は 11.3（pre-11.3.0.11 hotfix）からアップグレードされています

問題

StorageGRID 11.4の新規導入後、または11.3.0.11より前のリリース（例：11.3.0.10または11.3のその他のビルド）から11.4にアップグレードしたあと、StorageGRIDのGUIに次のアラートが表示されることがあります。

Cassandra repair progress slowは、サービスの利用不可や通信の問題など、多くの問題が原因である可能性があります。
問題がこの記事と一致していることを確認するために、確認できる追加の署名がいくつかあります。

The Cassandra repair progress slow アラートは2日以上継続しており、実効修復率は0%です。
The cassandra-reaperサービス（Cassandraの修復操作を担当）が、さまざまなストレージノードで頻繁に再起動しています。

これは、ストレージノード上の/var/local/log/servermanager.log file で確認できます。

Cassandra Reaperログは、/var/local/log/cassandra-reaper.logまたはlumberjackコレクションreaper.logの下にあり、整合性レベルQUORUMまたはEACH_QUORUMを達成できなかった場合の例外が含まれています。

WARN [storagegrid:615635d0-342b-11eb-b6cc-4bacd6a2d5fe:615c9e91-342b-11eb-b6cc-4bacd6a2d5fe] 2020-12-08 18:57:38,140 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment 615c9e91-342b-11eb-b6cc-4bacd6a2d5fe

com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency EACH_QUORUM (2 required but only 0 alive)

lumberjackコレクションのストレージノードのreaper_commands.txtからのCassandra reaper repair list、またはストレージノードへのSSHセッションでこのコマンドspreaper --reaper-host=localhost --reaper-port=9403 status-cluster storagegridを実行することで、一部またはすべてのキースペースの修復に、最後のイベントに対して次のメッセージが含まれていることを示します。

"creation_time": "2020-11-24T23:05:08Z", "current_time": "2020-12-08T18:59:39Z", "datacenters": [], "duration": "7 days 0 hours 2 minutes 13 seconds", "end_time": "2020-12-01T23:07:22Z", "estimated_time_of_arrival": null, "id": "7f8d00b0-2ea9-11eb-b76b-d7a5b22a5393", "incremental_repair": false, "intensity": 1.000, "keyspace_name": "storagegrid", "last_event": "Postponed a segment because no coordinator was reachable", "nodes": [], "owner": "auto-scheduling", "pause_time": null, "repair_parallelism": "PARALLEL", "repair_thread_count": 4, "repair_unit_id": "dc8dbfa0-17c7-11eb-b890-676ddd59fc8a", "segments_repaired": 0, "start_time": "2020-11-24T23:05:08Z", "state": "ABORTED",

"creation_time": "2020-11-17T20:50:58Z", "current_time": "2020-12-08T18:59:40Z", "datacenters": [], "duration": "7 days 0 hours 0 minutes 32 seconds", "end_time": "2020-11-24T20:51:31Z", "estimated_time_of_arrival": null, "id": "9882a450-2916-11eb-8180-07cae1e33f50", "incremental_repair": false, "intensity": 1.000, "keyspace_name": "reaper_db", "last_event": "Postponed a segment because no coordinator was reachable", "nodes": [], "owner": "auto-scheduling", "pause_time": null, "repair_parallelism": "PARALLEL", "repair_thread_count": 4, "repair_unit_id": "dc818aa0-17c7-11eb-b890-676ddd59fc8a", "segments_repaired": 0, "start_time": "2020-11-17T20:50:59Z", "state": "ABORTED",