1564567 – User Interface does not come up after reboot

Bug 1564567 - User Interface does not come up after reboot

Summary: User Interface does not come up after reboot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.7.0
Hardware:	All
OS:	All
Priority:	high
Severity:	urgent
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Joe Rafaniello
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1568158 1568159
TreeView+	depends on / blocked

Reported:	2018-04-06 15:50 UTC by Ryan Spagnola
Modified:	2021-09-09 13:38 UTC (History)
CC List:	5 users (show)
Fixed In Version:	5.10.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1568158 1568159 (view as bug list)
Environment:
Last Closed:	2019-02-11 14:01:50 UTC
Category:	Bug
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ryan Spagnola 2018-04-06 15:50:30 UTC

Description of problem:
Even after adjusting memory thresholds the appliance has disabled roles (see additional information below)

Version-Release number of selected component (if applicable):
5.7.3.2

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
With rail console:
[root@dnvrco03-cfui10-01 vmdb]# bin/rails c
Loading production environment (Rails 5.0.2)
irb(main):001:0> MiqServer.my_server.assigned_role_names
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
=> ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"]

But in role_management.rb, I added this log:
  def has_active_role?(role)
    _log.info("RoleNames:#{active_role_names.inspect}, checking role:#{role}.")
    active_role_names.include?(role.to_s.strip.downcase)
  end

In my evm.log, I saw this:
[----] I, [2018-04-06T00:12:42.566424 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_inventory.
[----] I, [2018-04-06T00:12:42.569954 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:reporting.
[----] I, [2018-04-06T00:12:42.572754 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:smartproxy.
[----] I, [2018-04-06T00:12:42.574435 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_inventory.
[----] I, [2018-04-06T00:12:42.575909 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_metrics_collector.
[----] I, [2018-04-06T00:12:42.577229 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:websocket.
[----] I, [2018-04-06T00:12:42.585827 #23033:bd7130]  INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:vmdb_storage_bridge.


The problem is: the roles for this server is not loaded before miq_worker started to run.

~~~~

Here is more info:

[root@dnvrco03-cfui10-01 vmdb]# bin/rails c
Loading production environment (Rails 5.0.2)
irb(main):001:0> MiqServer.my_server.assigned_role_names
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
=> ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"]
irb(main):002:0> MiqServer.my_server.active_role_names
=> []
irb(main):003:0> MiqServer.my_server.inactive_role_names
=> ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"]

Comment 3 Joe Rafaniello 2018-04-06 20:41:14 UTC

We had a discussion with the customer with the hope of fixing the
problem but also trying to understand the root cause.

The customer reported they rebooted appliances after reconfiguring
memory thresholds.

When the appliances were rebooted, the server responsible for
distributing roles (master server) was changed. The new master server
was then encountering a timeout when it was activating roles.  This
prevented restarted appliances from being given roles.  Upon further
inspection, we found a higher latency to the database from the master
server encountering this timeout.  This latency could be responsible
for the inability to assign roles due to the timeout.

We forced the master server to move to a different appliance without
such a large latency. When a new master server took over, previously
restarted appliances started to be given roles as expected.

We believe the default 1 minute timeout for this very important work
is too small so we will be increasing it.

Comment 4 CFME Bot 2018-04-09 17:25:48 UTC

https://github.com/ManageIQ/manageiq/pull/17265

Comment 5 CFME Bot 2018-04-10 13:32:07 UTC

New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/1f564cddadf625bfaf044fa6b1b6932f45c8d8dd
commit 1f564cddadf625bfaf044fa6b1b6932f45c8d8dd
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Apr  6 17:37:03 2018 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Apr  6 17:37:03 2018 -0400

    Add timeout knob for monitoring server roles

    https://bugzilla.redhat.com/show_bug.cgi?id=1564567

    Monitoring server roles as the master server is so important, it should
    finish and not ever timeout. If it times out, servers will not be able
    to gain roles. Previously, the default lock timeout of 1 minute is too
    low in situations where the master server has higher than normal
    latency to the database.  We need to give it more time to finish before
    timing it out.

    Additionally, we can specify this value in advanced settings in the server
    section if 5.minutes is still not enough or just a wrong value.

 app/models/miq_server/role_management.rb | 6 +-
 config/settings.yml | 1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

Comment 6 Joe Rafaniello 2018-04-10 15:05:13 UTC

We added a monitor_server_roles_timeout setting in the "advanced settings" "server" section.  We now default to 5 minutes, previously 1 minute, and this value can be configured on a case by case basis.

Comment 9 Tasos Papaioannou 2018-07-09 18:48:23 UTC

Verified on 5.10.0.3.

Note You need to log in before you can comment on or make changes to this bug.