Description of problem: Even after adjusting memory thresholds the appliance has disabled roles (see additional information below) Version-Release number of selected component (if applicable): 5.7.3.2 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: With rail console: [root@dnvrco03-cfui10-01 vmdb]# bin/rails c Loading production environment (Rails 5.0.2) irb(main):001:0> MiqServer.my_server.assigned_role_names PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0 => ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"] But in role_management.rb, I added this log: def has_active_role?(role) _log.info("RoleNames:#{active_role_names.inspect}, checking role:#{role}.") active_role_names.include?(role.to_s.strip.downcase) end In my evm.log, I saw this: [----] I, [2018-04-06T00:12:42.566424 #23033:bd7130] INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_inventory. [----] I, [2018-04-06T00:12:42.569954 #23033:bd7130] INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:reporting. [----] I, [2018-04-06T00:12:42.572754 #23033:bd7130] INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:smartproxy. [----] I, [2018-04-06T00:12:42.574435 #23033:bd7130] INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_inventory. [----] I, [2018-04-06T00:12:42.575909 #23033:bd7130] INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:storage_metrics_collector. [----] I, [2018-04-06T00:12:42.577229 #23033:bd7130] INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:websocket. [----] I, [2018-04-06T00:12:42.585827 #23033:bd7130] INFO -- : MIQ(MiqServer#has_active_role?) RoleNames:[], checking role:vmdb_storage_bridge. The problem is: the roles for this server is not loaded before miq_worker started to run. ~~~~ Here is more info: [root@dnvrco03-cfui10-01 vmdb]# bin/rails c Loading production environment (Rails 5.0.2) irb(main):001:0> MiqServer.my_server.assigned_role_names PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0 => ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"] irb(main):002:0> MiqServer.my_server.active_role_names => [] irb(main):003:0> MiqServer.my_server.inactive_role_names => ["database_owner", "event", "notifier", "user_interface", "web_services", "websocket"]
We had a discussion with the customer with the hope of fixing the problem but also trying to understand the root cause. The customer reported they rebooted appliances after reconfiguring memory thresholds. When the appliances were rebooted, the server responsible for distributing roles (master server) was changed. The new master server was then encountering a timeout when it was activating roles. This prevented restarted appliances from being given roles. Upon further inspection, we found a higher latency to the database from the master server encountering this timeout. This latency could be responsible for the inability to assign roles due to the timeout. We forced the master server to move to a different appliance without such a large latency. When a new master server took over, previously restarted appliances started to be given roles as expected. We believe the default 1 minute timeout for this very important work is too small so we will be increasing it.
https://github.com/ManageIQ/manageiq/pull/17265
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/1f564cddadf625bfaf044fa6b1b6932f45c8d8dd commit 1f564cddadf625bfaf044fa6b1b6932f45c8d8dd Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Apr 6 17:37:03 2018 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Apr 6 17:37:03 2018 -0400 Add timeout knob for monitoring server roles https://bugzilla.redhat.com/show_bug.cgi?id=1564567 Monitoring server roles as the master server is so important, it should finish and not ever timeout. If it times out, servers will not be able to gain roles. Previously, the default lock timeout of 1 minute is too low in situations where the master server has higher than normal latency to the database. We need to give it more time to finish before timing it out. Additionally, we can specify this value in advanced settings in the server section if 5.minutes is still not enough or just a wrong value. app/models/miq_server/role_management.rb | 6 +- config/settings.yml | 1 + 2 files changed, 6 insertions(+), 1 deletion(-)
We added a monitor_server_roles_timeout setting in the "advanced settings" "server" section. We now default to 5 minutes, previously 1 minute, and this value can be configured on a case by case basis.
Verified on 5.10.0.3.