Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery bug in the case of spare rank failures #58

Open
Matthew-Whitlock opened this issue Jul 28, 2021 · 0 comments
Open

Recovery bug in the case of spare rank failures #58

Matthew-Whitlock opened this issue Jul 28, 2021 · 0 comments
Assignees
Labels

Comments

@Matthew-Whitlock
Copy link
Collaborator

The rank offset system used to determine which rank a given spare replaces currently assumes no failures in the spare ranks. If spare-rank 0 failed and user-rank 0 failed, spare-rank 1 should replace user-rank 0, but it will currently assume that this will be handled by spare-rank 0.

The line in question is here:
https://github.com/epizon-project/Fenix/blob/master/src/fenix_process_recovery.c#L515

This should be updated the check for the number of failed spare ranks of lower rank than my own, then subtract that from the calculated rank_offset. Probably needs to be updated in other parts of the function as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant