Bug 1402943 - After performing an upgrade, no role workers start on new appliances
Summary: After performing an upgrade, no role workers start on new appliances
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: GA
: 5.8.0
Assignee: Joe Rafaniello
QA Contact: luke couzens
URL:
Whiteboard:
Depends On:
Blocks: 1403983 1434964
TreeView+ depends on / blocked
 
Reported: 2016-12-08 17:04 UTC by Jared Deubel
Modified: 2020-01-17 16:19 UTC (History)
9 users (show)

Fixed In Version: 5.8.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1403983 1434964 (view as bug list)
Environment:
Last Closed: 2017-06-12 17:10:40 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 3 CFME Bot 2016-12-08 22:16:02 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/b5c09d8b31e062d019afaf50b408ad0cda4db9b8

commit b5c09d8b31e062d019afaf50b408ad0cda4db9b8
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Dec 8 12:03:57 2016 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Dec 8 12:11:53 2016 -0500

    Add logging around master server failover
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1402943

 app/models/miq_server/server_monitor.rb | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comment 4 CFME Bot 2016-12-08 22:16:07 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/f4bb169c9ddeda867846f29aac8a84ac2df09556

commit f4bb169c9ddeda867846f29aac8a84ac2df09556
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Dec 8 12:11:27 2016 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Dec 8 12:20:37 2016 -0500

    Abort takeover only if an active master exists
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1402943
    
    Previously, we would abort if a different master existed, even if it was
    shut down.
    
    * server 1 is master and shuts down
    * server 3 runs monitor_servers, becomes master and shuts down
    * server 2 runs monitor_servers AFTER 3 becomes master
    
    server 2 wouldn't take over as master because it sees the inactive
    server 3 as master.

 app/models/miq_server/server_monitor.rb       | 10 ++++++---
 spec/models/miq_server/server_monitor_spec.rb | 32 +++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 3 deletions(-)

Comment 5 CFME Bot 2016-12-08 22:16:11 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/bbf28c21d5a406c48036377ca27a8722dc17994c

commit bbf28c21d5a406c48036377ca27a8722dc17994c
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Dec 8 14:48:23 2016 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Dec 8 15:04:42 2016 -0500

    make_master_server uncached again!
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1402943
    
    We lock on the region row and base all of our server is_master
    queries and changes on it, therefore, it's really important we don't
    have a cached region.

 app/models/miq_server/server_monitor.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 6 CFME Bot 2016-12-08 22:16:17 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/c7d72a05479373803f3c79fa435cb022e7c7cc49

commit c7d72a05479373803f3c79fa435cb022e7c7cc49
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Dec 8 15:37:37 2016 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Dec 8 15:39:59 2016 -0500

    Test a restarted server takeover from a stopped master
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1402943

 spec/models/miq_server/server_monitor_spec.rb | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

Comment 7 Joe Rafaniello 2016-12-09 21:10:00 UTC
To test this, you have to two options:

1) start a single appliance
2) shutdown the only appliance, where is_master is 't' in the miq_servers table
3) start a new appliance while that "old" master in step 1) is_master t, status: stopped
4) see if new appliance becomes is_master 't' and the role workers start... UI, webservice, websocket, etc. 

Or, try to get perfect timing:

1) start 3 appliances, 1 will be master
2) shut down the is_master t appliance
3) As soon as an appliance becomes master, shut down the new master
4) See if the last remaining appliance takes over as master and starts role workers (UI, etc.)

Comment 10 luke couzens 2017-03-28 14:55:32 UTC
Hi Joe, so I have tested this with 3 appliances with the following results.

Tested scenario:
 - 3 appliances (A, B, C)
 1. A set up with internal DB
 2. A evmserverd stopped
 3. B set up with external DB, pointed at A (same region)
 4. B's UI came up and active as master
 5. C set up with external DB, pointed at A (same region)
 6. C's UI comes up, but does not become master at this point
 7. B evmserverd stopped
 8. C's UI is still active and its promoted to master

Should 'C's' UI be available at this point or should it only be active once 'B' has been stopped?

Comment 11 Joe Rafaniello 2017-03-28 19:41:46 UTC
Luke,

It looks like you're trying to test the harder scenario in comment 7.  The results you have are correct and are what I'm expecting, but I don't think it confirms the issue.

The test for this timing issue shows how hard it would be to test it:
https://github.com/ManageIQ/manageiq/pull/13065/files#diff-c27880fbafb49bc4827550741f32dd54R463

I suggest you try to confirm the second scenario, the most likely situation especially when upgrading:

1) start appliance A, as master (is_master true in the miq_servers table)
2) stop appliance A, make sure the status is stopped in miq_servers
3) start appliance B, make sure it takes over as is_master true (appliance A becomes is_master false), and the UI role starts on B

If you do these 3 steps on an older version, step 3 would have appliance A as is_master, B would start as a non-master and never take over... no roles would activate.

Comment 12 luke couzens 2017-04-05 10:45:19 UTC
Verified on 5.8.0.9 including upgrade from 5.7

Comment 13 CFME Bot 2017-04-20 22:51:35 UTC
New commit detected on ManageIQ/manageiq/darga:
https://github.com/ManageIQ/manageiq/commit/a966d398f11277c999425c9c17a5bfc3c551c456

commit a966d398f11277c999425c9c17a5bfc3c551c456
Author:     Nick Carboni <ncarboni>
AuthorDate: Thu Dec 8 17:15:23 2016 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Feb 17 15:48:04 2017 -0500

    Merge pull request #13065 from jrafanie/fix_master_server_failover_race_condition
    
    Fix master server failover race condition
    (cherry picked from commit 1eafadd79813e7472d12fca8842fa12ac60bd6ee)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1402943

 app/models/miq_server/server_monitor.rb       | 20 ++++++-----
 spec/models/miq_server/server_monitor_spec.rb | 49 +++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 8 deletions(-)


Note You need to log in before you can comment on or make changes to this bug.