Bug 1704835 - HA DB fail-over after minor version upgrade using webui fails on missing ///tmp/worker_monitor...
Summary: HA DB fail-over after minor version upgrade using webui fails on missing ///...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.10.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: GA
: 5.11.0
Assignee: Nick Carboni
QA Contact: Jaroslav Henner
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-30 15:45 UTC by Jaroslav Henner
Modified: 2019-12-12 13:36 UTC (History)
5 users (show)

Fixed In Version: 5.11.0.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-12-12 13:36:21 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
evm-failover-monitor log (11.27 KB, text/plain)
2019-04-30 16:18 UTC, Jaroslav Henner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:4199 0 None None None 2019-12-12 13:36:35 UTC

Description Jaroslav Henner 2019-04-30 15:45:58 UTC
Description of problem:
After upgrade of the evm instance using the webgui method and stopping the primary DB server, the evmserverd goes into an state when it responds with error and won't recover from this.

Version-Release number of selected component (if applicable):
5.10.3.3 - > 5.10.4.0

How reproducible:
 * quite often with the update step in place (maybe 80% at least).
 * never without the update step

Steps to Reproduce:
1. Prepare env with 3 appliances. 1 primary, 1 secondary DB and one evm service. HA monitor enabled on the evm appliance.

2. update the evm appliance using webui (cfme.fixtures.cli.update_appliance() in the integration_tests)
3. stop the primary DB (systemctl stop $APPLIANCE_PG_SERVICE)

Actual results:
The evmserverd fails to recover, complaining about the missing monitor unix socket file


Expected results:
The evmserverd recovers, DB handed over to the former secondary

Additional info:

Comment 2 Jaroslav Henner 2019-04-30 16:16:10 UTC
I tried 

systemctl restart evm-failover-monitor

on the evm appliance after upgrade, before stopping the PG service. The error didn't reproduce -- the server recovered OK

Comment 3 Jaroslav Henner 2019-04-30 16:18:10 UTC
Created attachment 1560359 [details]
evm-failover-monitor log

Comment 5 Nick Carboni 2019-05-08 20:13:01 UTC
I suspect there was some issue with the source being moved out from under the running process.

This was the specific error in the failover monitor:

May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require': cannot load such file -- MiqSockUtil (LoadError)
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `block in require'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:259:in `load_dependency'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /var/www/miq/vmdb/app/models/host.rb:2:in `<top (required)>'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `block in require'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:259:in `load_dependency'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:380:in `block in require_or_load'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:37:in `block in load_interlock'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies/interlock.rb:12:in `block in loading'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/concurrency/share_lock.rb:150:in `exclusive'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies/interlock.rb:11:in `loading'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:37:in `load_interlock'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:358:in `require_or_load'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:512:in `load_missing_constant'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:203:in `const_missing'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:544:in `load_missing_constant'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:203:in `const_missing'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /var/www/miq/vmdb/app/models/miq_event.rb:17:in `<class:MiqEvent>'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /var/www/miq/vmdb/app/models/miq_event.rb:1:in `<top (required)>'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `block in require'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:259:in `load_dependency'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:380:in `block in require_or_load'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:37:in `block in load_interlock'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies/interlock.rb:12:in `block in loading'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/concurrency/share_lock.rb:150:in `exclusive'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies/interlock.rb:11:in `loading'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:37:in `load_interlock'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:358:in `require_or_load'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:512:in `load_missing_constant'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:203:in `const_missing'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:544:in `load_missing_constant'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:203:in `const_missing'
May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /var/www/miq/vmdb/lib/evm_database.rb:113:in `raise_server_event'

This file (MiqSockUtil) is provided by manageiq-gems-pending. And since that gem is git-based it's more than likely that the location of the file changed during the upgrade.
This means that the running process could have had an out-of-date location in memory for where to go looking for that particular file causing the load error.

This would also mean that restarting the failover monitor service after the upgrade should solve the issue, which Jaroslav says is indeed the case in comment 2.
This is also a generally good idea since the failover monitor code itself could have been changed during the upgrade process and we would want the latest running.

Comment 6 Jaroslav Henner 2019-07-23 13:04:05 UTC
This seems to work. I removed a code that is restarting the evm_failover_monitor after the update and then when the failover occurs, the webui survives that on the cfme 5.11.0.15, but the same code fails on 5.10.7.1. So VERIFIED.

Comment 8 errata-xmlrpc 2019-12-12 13:36:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4199


Note You need to log in before you can comment on or make changes to this bug.