GracefulMasterTakeover logic flaw in semi-sync replication scenario #1508

Fanduzi · 2024-09-14T08:09:21Z

Consider the following topology:

                                                         ,- replica1 (Rpl_semi_sync_slave_status=1)
                                                        ,
                                                       ,
  master (
          Rpl_semi_sync_master_status=1
          rpl_semi_sync_master_wait_no_slave=1         - - replica2 (Rpl_semi_sync_slave_status=1)
          rpl_semi_sync_master_wait_for_slave_count=2
         )  
                                                       `
                                                        `
                                                         `- replica3 (Rpl_semi_sync_slave_status=1)

Assume that replica1 is the new master. Based on the source code, Orchestrator first allows the new master replica1 to take over the old master’s siblings.

The topology then becomes:

                                                                                                    ,- replica1 (Rpl_semi_sync_slave_status=1)
  master (                                                                                        ,
          Rpl_semi_sync_master_status=1
          rpl_semi_sync_master_wait_no_slave=1         - replica1 (Rpl_semi_sync_slave_status=1)
          rpl_semi_sync_master_wait_for_slave_count=2
         )                                                                                        `
    
                                                                                                    `- replica3 (Rpl_semi_sync_slave_status=1)

This presents a problem. Since the old master has rpl_semi_sync_master_wait_for_slave_count=2, but now only has one replica (replica1), all DML operations will be blocked while waiting for an ACK.

Next, Orchestrator will attempt to set read_only=1 on the old master, but since the DML operations are blocked (as mentioned), the set read_only=1 operation will also be blocked. If rpl_semi_sync_master_timeout is infinite, the switchover will hang indefinitely because ExecInstance does not have a timeout limit.

Even if rpl_semi_sync_master_timeout is not infinite, this situation will significantly increase switchover time, thus impacting the business even more.

In contrast, MHA’s switchover process avoids this issue because its process is as follows:

Block writes on the old master.
Wait for the new master to sync and remove the read-only restriction; at this point, the business can resume operations.
Change all replicas of the old master（except the new master）, to the new master.
Finally, the old master change master to the new master.

I don’t understand why Orchestrator first lets the new master (replica1) take over the old master’s siblings. This approach introduces issues that MHA avoids.

The text was updated successfully, but these errors were encountered:

Fanduzi changed the title ~~GracefulMasterTakeover logic flaw in semi-synchronous replication scenario~~ GracefulMasterTakeover logic flaw in semi-sync replication scenario Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GracefulMasterTakeover logic flaw in semi-sync replication scenario #1508

GracefulMasterTakeover logic flaw in semi-sync replication scenario #1508

Fanduzi commented Sep 14, 2024 •

edited

Loading

GracefulMasterTakeover logic flaw in semi-sync replication scenario #1508

GracefulMasterTakeover logic flaw in semi-sync replication scenario #1508

Comments

Fanduzi commented Sep 14, 2024 • edited Loading

Fanduzi commented Sep 14, 2024 •

edited

Loading