Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Defragmentation unable get complete on large etcd cluster. #782

Open
ishan16696 opened this issue Sep 30, 2024 · 0 comments
Open
Labels
kind/enhancement Enhancement, improvement, extension

Comments

@ishan16696
Copy link
Member

Enhancement (What you would like to be added):
Currently, when etcd leader loses the leadership, backup-restore leader moves to unknown state and backup-restore leader closes the snapshotter as well as defragmentor, this seems correct but backup-restore leader can even closes the defragmentor while defragmentation is being running on etcd, due to that defragmentation may not get complete on etcd cluster which is not a good thing as etcd db size is already large which need to be come down.

Motivation (Why is this needed?):
Take this scenario:

  1. Start a large etcd cluster(with db size >7Gi) with backup-restore.
  2. Whenever backup-restore triggers the defragmentation on etcd and as etcd is large, so defragmentation might take more time and this makes etcd unavailable due to that etcd leader loses leadership.
  3. As the Etcd leader loses leadership, backup-restore leader moves to unknown state, and it tries to close the snapshottter as well as defragmentor.
  4. As backup-restore leader tries to close the defragmentor, it ends up closing a defragmentation which is currently running on etcd and cycle may continues.

Logs of backup-restore:

INFO[0596] All etcd members endPoints: [http://localhost:2379/]  actor=backup-restore-server job=defragmentor
INFO[0596] Starting the defragmentation as all members of etcd cluster are in healthy state  actor=backup-restore-server job=defragmentor
INFO[0596] Starting the defragmentation on etcd leader   actor=backup-restore-server job=defragmentor
INFO[0596] Defragmenting etcd member[http://localhost:2379/]  actor=backup-restore-server job=defragmentor
{"level":"warn","ts":"2024-09-23T11:01:08.843+0530","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://localhost:2379/","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
ERRO[0605] failed to get status of etcd endPoint: http://localhost:2379/ with error: context deadline exceeded  actor=leader-elector
ERRO[0605] failed to elect the backup-restore leader: context deadline exceeded  actor=leader-elector
INFO[0605] backup-restore stops leading...               actor=backup-restore-server
INFO[0605] Setting status to : 503                       actor=backup-restore-server
INFO[0605] Closing defragmentor.                         actor=backup-restore-server
INFO[0605] backup-restore is in: UnknownState            actor=leader-elector
INFO[0605] waiting for Re-election...                    actor=leader-elector
INFO[0605] handleSsrStopRequest: context canceled        actor=backup-restore-server
INFO[0605] Stopping handleSsrStopRequest handler...      actor=backup-restore-server
INFO[0605] Closing the Snapshot EventHandler.            actor=snapshotter
INFO[0605] Closing the Snapshotter...                    actor=snapshotter
{"level":"warn","ts":"2024-09-23T11:01:08.843+0530","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://localhost:2379/","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
ERRO[0605] failed to defragment etcd member[http://localhost:2379/] with error: context canceled  actor=backup-restore-server job=defragmentor
WARN[0605] failed to defrag data with error: context canceled  actor=backup-restore-server job=defragmentor
INFO[0605] Snapshotter stopped.                          actor=backup-restore-server
INFO[0605] Probing etcd...                               actor=backup-restore-server
INFO[0605] Shutting down...                              actor=backup-restore-server
INFO[0605] GC: Stop signal received. Closing garbage collector.  actor=snapshotter
{"level":"warn","ts":"2024-09-23T11:01:18.844+0530","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://localhost:2379/","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
ERRO[0615] failed to get status of etcd endPoint: http://localhost:2379/ with error: context deadline exceeded  actor=leader-elector
ERRO[0615] failed to elect the backup-restore leader: context deadline exceeded  actor=leader-elector
INFO[0615] backup-restore is in: UnknownState            actor=leader-elector
INFO[0615] waiting for Re-election...                    actor=leader-elector

Note: Currently I have seen this behaviour with singleton etcd cluster running on my local machine, it may be possible that this scenario is rare in production as in production cluster etcd usually runs on powerful machines which can make defragmentation finishes fast.

Approach/Hint to the implement solution (optional):

@ishan16696 ishan16696 added the kind/enhancement Enhancement, improvement, extension label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

1 participant