You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhancement (What you would like to be added):
Currently, when etcd leader loses the leadership, backup-restore leader moves to unknown state and backup-restore leader closes the snapshotter as well as defragmentor, this seems correct but backup-restore leader can even closes the defragmentor while defragmentation is being running on etcd, due to that defragmentation may not get complete on etcd cluster which is not a good thing as etcd db size is already large which need to be come down.
Motivation (Why is this needed?):
Take this scenario:
Start a large etcd cluster(with db size >7Gi) with backup-restore.
Whenever backup-restore triggers the defragmentation on etcd and as etcd is large, so defragmentation might take more time and this makes etcd unavailable due to that etcd leader loses leadership.
As the Etcd leader loses leadership, backup-restore leader moves to unknown state, and it tries to close the snapshottter as well as defragmentor.
As backup-restore leader tries to close the defragmentor, it ends up closing a defragmentation which is currently running on etcd and cycle may continues.
Logs of backup-restore:
INFO[0596] All etcd members endPoints: [http://localhost:2379/] actor=backup-restore-server job=defragmentor
INFO[0596] Starting the defragmentation as all members of etcd cluster are in healthy state actor=backup-restore-server job=defragmentor
INFO[0596] Starting the defragmentation on etcd leader actor=backup-restore-server job=defragmentor
INFO[0596] Defragmenting etcd member[http://localhost:2379/] actor=backup-restore-server job=defragmentor
{"level":"warn","ts":"2024-09-23T11:01:08.843+0530","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://localhost:2379/","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
ERRO[0605] failed to get status of etcd endPoint: http://localhost:2379/ with error: context deadline exceeded actor=leader-elector
ERRO[0605] failed to elect the backup-restore leader: context deadline exceeded actor=leader-elector
INFO[0605] backup-restore stops leading... actor=backup-restore-server
INFO[0605] Setting status to : 503 actor=backup-restore-server
INFO[0605] Closing defragmentor. actor=backup-restore-server
INFO[0605] backup-restore is in: UnknownState actor=leader-elector
INFO[0605] waiting for Re-election... actor=leader-elector
INFO[0605] handleSsrStopRequest: context canceled actor=backup-restore-server
INFO[0605] Stopping handleSsrStopRequest handler... actor=backup-restore-server
INFO[0605] Closing the Snapshot EventHandler. actor=snapshotter
INFO[0605] Closing the Snapshotter... actor=snapshotter
{"level":"warn","ts":"2024-09-23T11:01:08.843+0530","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://localhost:2379/","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
ERRO[0605] failed to defragment etcd member[http://localhost:2379/] with error: context canceled actor=backup-restore-server job=defragmentor
WARN[0605] failed to defrag data with error: context canceled actor=backup-restore-server job=defragmentor
INFO[0605] Snapshotter stopped. actor=backup-restore-server
INFO[0605] Probing etcd... actor=backup-restore-server
INFO[0605] Shutting down... actor=backup-restore-server
INFO[0605] GC: Stop signal received. Closing garbage collector. actor=snapshotter
{"level":"warn","ts":"2024-09-23T11:01:18.844+0530","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://localhost:2379/","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
ERRO[0615] failed to get status of etcd endPoint: http://localhost:2379/ with error: context deadline exceeded actor=leader-elector
ERRO[0615] failed to elect the backup-restore leader: context deadline exceeded actor=leader-elector
INFO[0615] backup-restore is in: UnknownState actor=leader-elector
INFO[0615] waiting for Re-election... actor=leader-elector
Note: Currently I have seen this behaviour with singleton etcd cluster running on my local machine, it may be possible that this scenario is rare in production as in production cluster etcd usually runs on powerful machines which can make defragmentation finishes fast.
Approach/Hint to the implement solution (optional):
The text was updated successfully, but these errors were encountered:
Enhancement (What you would like to be added):
Currently, when
etcd leader
loses the leadership, backup-restore leader moves tounknown
state andbackup-restore leader
closes the snapshotter as well as defragmentor, this seems correct but backup-restore leader can even closes the defragmentor whiledefragmentation
is being running on etcd, due to that defragmentation may not get complete on etcd cluster which is not a good thing as etcd db size is already large which need to be come down.Motivation (Why is this needed?):
Take this scenario:
>7Gi
) with backup-restore.backup-restore leader
tries to close the defragmentor, it ends up closing a defragmentation which is currently running on etcd and cycle may continues.Logs of backup-restore:
Approach/Hint to the implement solution (optional):
The text was updated successfully, but these errors were encountered: