Bug 1564567

Summary: User Interface does not come up after reboot
Product: Red Hat CloudForms Management Engine Reporter: Ryan Spagnola <rspagnol>
Component: ApplianceAssignee: Joe Rafaniello <jrafanie>
Status: CLOSED CURRENTRELEASE QA Contact: Tasos Papaioannou <tpapaioa>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.7.0CC: abellott, cpelland, jrafanie, obarenbo, tpapaioa
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.10.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: 5.10.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1568158 1568159 (view as bug list) Environment:
Last Closed: 2019-02-11 14:01:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1568158, 1568159    

Description Ryan Spagnola 2018-04-06 15:50:30 UTC
Description of problem:
Even after adjusting memory thresholds the appliance has disabled roles (see additional information below)

Version-Release number of selected component (if applicable):
5.7.3.2

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
With rail console:
[root@dnvrco03-cfui10-01 vmdb]# bin/rails c
Loading production environment (Rails 5.0.2)
irb(main):001:0> MiqServer.my_server.assigned_role_names
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
=> ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"]

But in role_management.rb, I added this log:
  def has_active_role?(role)
    _log.info("RoleNames:#{active_role_names.inspect}, checking role:#{role}.")
    active_role_names.include?(role.to_s.strip.downcase)
  end

In my evm.log, I saw this:
[----] I, [2018-04-06T00:12:42.566424 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_inventory.
[----] I, [2018-04-06T00:12:42.569954 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:reporting.
[----] I, [2018-04-06T00:12:42.572754 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:smartproxy.
[----] I, [2018-04-06T00:12:42.574435 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_inventory.
[----] I, [2018-04-06T00:12:42.575909 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_metrics_collector.
[----] I, [2018-04-06T00:12:42.577229 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:websocket.
[----] I, [2018-04-06T00:12:42.585827 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:vmdb_storage_bridge.


The problem is: the roles for this server is not loaded before miq_worker started to run.

~~~~

Here is more info:

[root@dnvrco03-cfui10-01 vmdb]# bin/rails c
Loading production environment (Rails 5.0.2)
irb(main):001:0> MiqServer.my_server.assigned_role_names
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
=> ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"]
irb(main):002:0> MiqServer.my_server.active_role_names
=> []
irb(main):003:0> MiqServer.my_server.inactive_role_names
=> ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"]

Comment 3 Joe Rafaniello 2018-04-06 20:41:14 UTC
We had a discussion with the customer with the hope of fixing the
problem but also trying to understand the root cause.

The customer reported they rebooted appliances after reconfiguring
memory thresholds.

When the appliances were rebooted, the server responsible for
distributing roles (master server) was changed. The new master server
was then encountering a timeout when it was activating roles.  This
prevented restarted appliances from being given roles.  Upon further
inspection, we found a higher latency to the database from the master
server encountering this timeout.  This latency could be responsible
for the inability to assign roles due to the timeout.

We forced the master server to move to a different appliance without
such a large latency. When a new master server took over, previously
restarted appliances started to be given roles as expected.

We believe the default 1 minute timeout for this very important work
is too small so we will be increasing it.

Comment 5 CFME Bot 2018-04-10 13:32:07 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/1f564cddadf625bfaf044fa6b1b6932f45c8d8dd
commit 1f564cddadf625bfaf044fa6b1b6932f45c8d8dd
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Apr  6 17:37:03 2018 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Apr  6 17:37:03 2018 -0400

    Add timeout knob for monitoring server roles

    https://bugzilla.redhat.com/show_bug.cgi?id=1564567

    Monitoring server roles as the master server is so important, it should
    finish and not ever timeout. If it times out, servers will not be able
    to gain roles. Previously, the default lock timeout of 1 minute is too
    low in situations where the master server has higher than normal
    latency to the database.  We need to give it more time to finish before
    timing it out.

    Additionally, we can specify this value in advanced settings in the server
    section if 5.minutes is still not enough or just a wrong value.

 app/models/miq_server/role_management.rb | 6 +-
 config/settings.yml | 1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

Comment 6 Joe Rafaniello 2018-04-10 15:05:13 UTC
We added a monitor_server_roles_timeout setting in the "advanced settings" "server" section.  We now default to 5 minutes, previously 1 minute, and this value can be configured on a case by case basis.

Comment 9 Tasos Papaioannou 2018-07-09 18:48:23 UTC
Verified on 5.10.0.3.