https://github.com/ManageIQ/manageiq/pull/13065
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/b5c09d8b31e062d019afaf50b408ad0cda4db9b8 commit b5c09d8b31e062d019afaf50b408ad0cda4db9b8 Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Dec 8 12:03:57 2016 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Dec 8 12:11:53 2016 -0500 Add logging around master server failover https://bugzilla.redhat.com/show_bug.cgi?id=1402943 app/models/miq_server/server_monitor.rb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/f4bb169c9ddeda867846f29aac8a84ac2df09556 commit f4bb169c9ddeda867846f29aac8a84ac2df09556 Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Dec 8 12:11:27 2016 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Dec 8 12:20:37 2016 -0500 Abort takeover only if an active master exists https://bugzilla.redhat.com/show_bug.cgi?id=1402943 Previously, we would abort if a different master existed, even if it was shut down. * server 1 is master and shuts down * server 3 runs monitor_servers, becomes master and shuts down * server 2 runs monitor_servers AFTER 3 becomes master server 2 wouldn't take over as master because it sees the inactive server 3 as master. app/models/miq_server/server_monitor.rb | 10 ++++++--- spec/models/miq_server/server_monitor_spec.rb | 32 +++++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 3 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/bbf28c21d5a406c48036377ca27a8722dc17994c commit bbf28c21d5a406c48036377ca27a8722dc17994c Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Dec 8 14:48:23 2016 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Dec 8 15:04:42 2016 -0500 make_master_server uncached again! https://bugzilla.redhat.com/show_bug.cgi?id=1402943 We lock on the region row and base all of our server is_master queries and changes on it, therefore, it's really important we don't have a cached region. app/models/miq_server/server_monitor.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/c7d72a05479373803f3c79fa435cb022e7c7cc49 commit c7d72a05479373803f3c79fa435cb022e7c7cc49 Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Dec 8 15:37:37 2016 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Dec 8 15:39:59 2016 -0500 Test a restarted server takeover from a stopped master https://bugzilla.redhat.com/show_bug.cgi?id=1402943 spec/models/miq_server/server_monitor_spec.rb | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
To test this, you have to two options: 1) start a single appliance 2) shutdown the only appliance, where is_master is 't' in the miq_servers table 3) start a new appliance while that "old" master in step 1) is_master t, status: stopped 4) see if new appliance becomes is_master 't' and the role workers start... UI, webservice, websocket, etc. Or, try to get perfect timing: 1) start 3 appliances, 1 will be master 2) shut down the is_master t appliance 3) As soon as an appliance becomes master, shut down the new master 4) See if the last remaining appliance takes over as master and starts role workers (UI, etc.)
Hi Joe, so I have tested this with 3 appliances with the following results. Tested scenario: - 3 appliances (A, B, C) 1. A set up with internal DB 2. A evmserverd stopped 3. B set up with external DB, pointed at A (same region) 4. B's UI came up and active as master 5. C set up with external DB, pointed at A (same region) 6. C's UI comes up, but does not become master at this point 7. B evmserverd stopped 8. C's UI is still active and its promoted to master Should 'C's' UI be available at this point or should it only be active once 'B' has been stopped?
Luke, It looks like you're trying to test the harder scenario in comment 7. The results you have are correct and are what I'm expecting, but I don't think it confirms the issue. The test for this timing issue shows how hard it would be to test it: https://github.com/ManageIQ/manageiq/pull/13065/files#diff-c27880fbafb49bc4827550741f32dd54R463 I suggest you try to confirm the second scenario, the most likely situation especially when upgrading: 1) start appliance A, as master (is_master true in the miq_servers table) 2) stop appliance A, make sure the status is stopped in miq_servers 3) start appliance B, make sure it takes over as is_master true (appliance A becomes is_master false), and the UI role starts on B If you do these 3 steps on an older version, step 3 would have appliance A as is_master, B would start as a non-master and never take over... no roles would activate.
Verified on 5.8.0.9 including upgrade from 5.7
New commit detected on ManageIQ/manageiq/darga: https://github.com/ManageIQ/manageiq/commit/a966d398f11277c999425c9c17a5bfc3c551c456 commit a966d398f11277c999425c9c17a5bfc3c551c456 Author: Nick Carboni <ncarboni> AuthorDate: Thu Dec 8 17:15:23 2016 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Feb 17 15:48:04 2017 -0500 Merge pull request #13065 from jrafanie/fix_master_server_failover_race_condition Fix master server failover race condition (cherry picked from commit 1eafadd79813e7472d12fca8842fa12ac60bd6ee) https://bugzilla.redhat.com/show_bug.cgi?id=1402943 app/models/miq_server/server_monitor.rb | 20 ++++++----- spec/models/miq_server/server_monitor_spec.rb | 49 +++++++++++++++++++++++++++ 2 files changed, 61 insertions(+), 8 deletions(-)