-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restoration of a member in a multi-node cluster #645
Comments
I agree with the concern but take this scenario in 5 member cluster:
Now, as you already mentioned backup-restore will detects during initialisation phase it is a
Now, IMO initialisation will get re-trigger and backup-restore will re-detects this as |
no, it doesn’t require as in case of scale-up we want to avoid going to the wrong path if adding a learner failed. That's why we throw a |
@unmarshall can you please close this issue if you are satisfied with this comments #645 (comment) |
In a previous ticket we made a change where we attempt to add a learner a few times and then we give up and exit the container, resulting in restarting of a container. I discussed this with @shreyas-s-rao and we agreed that the earlier approach of trying to add-as-learner unlimited number of times was sufficient. A restart of a container does not alleviate this in any ways. So maybe it would probably make sense to remove the limit in both these situations and wait till an etcd-member is added as a learner (either in the case of a new member or a restart of an existing member in a multi-node cluster) |
Describe the bug:
If an existing etcd member crashed and now has come up again, then if the data directory is not longer valid then for a multi-node setup, the data directory is removed and only limited number of attempts are made to add as learner. Now consider a case where more than 1 member goes down and both are trying to recover (in a 5 member cluster). The quorum is still there so it can happen that both of the member attempt to add themselves as learners and one of them will fail.
Expected behavior:
In scale-up case where adding the current candidate as a learner is repeatedly attempted (upto 6 times). Similar thing should also be done when a restoration of a member in a multi-node cluster requires it to be added as a learner.
The text was updated successfully, but these errors were encountered: