You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.
Assume that replica1 is the new master. Based on the source code, Orchestrator first allows the new master replica1 to take over the old master’s siblings.
This presents a problem. Since the old master has rpl_semi_sync_master_wait_for_slave_count=2, but now only has one replica (replica1), all DML operations will be blocked while waiting for an ACK.
Next, Orchestrator will attempt to set read_only=1 on the old master, but since the DML operations are blocked (as mentioned), the set read_only=1 operation will also be blocked. If rpl_semi_sync_master_timeout is infinite, the switchover will hang indefinitely because ExecInstance does not have a timeout limit.
Even if rpl_semi_sync_master_timeout is not infinite, this situation will significantly increase switchover time, thus impacting the business even more.
In contrast, MHA’s switchover process avoids this issue because its process is as follows:
Block writes on the old master.
Wait for the new master to sync and remove the read-only restriction; at this point, the business can resume operations.
Change all replicas of the old master(except the new master), to the new master.
Finally, the old master change master to the new master.
I don’t understand why Orchestrator first lets the new master (replica1) take over the old master’s siblings. This approach introduces issues that MHA avoids.
The text was updated successfully, but these errors were encountered:
Fanduzi
changed the title
GracefulMasterTakeover logic flaw in semi-synchronous replication scenario
GracefulMasterTakeover logic flaw in semi-sync replication scenario
Sep 14, 2024
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Consider the following topology:
Assume that replica1 is the new master. Based on the source code, Orchestrator first allows the new master replica1 to take over the old master’s siblings.
The topology then becomes:
This presents a problem. Since the old master has rpl_semi_sync_master_wait_for_slave_count=2, but now only has one replica (replica1), all DML operations will be blocked while waiting for an ACK.
Next, Orchestrator will attempt to set read_only=1 on the old master, but since the DML operations are blocked (as mentioned), the set read_only=1 operation will also be blocked. If rpl_semi_sync_master_timeout is infinite, the switchover will hang indefinitely because ExecInstance does not have a timeout limit.
Even if rpl_semi_sync_master_timeout is not infinite, this situation will significantly increase switchover time, thus impacting the business even more.
In contrast, MHA’s switchover process avoids this issue because its process is as follows:
I don’t understand why Orchestrator first lets the new master (replica1) take over the old master’s siblings. This approach introduces issues that MHA avoids.
The text was updated successfully, but these errors were encountered: