Description of problem: After upgrade of the evm instance using the webgui method and stopping the primary DB server, the evmserverd goes into an state when it responds with error and won't recover from this. Version-Release number of selected component (if applicable): 5.10.3.3 - > 5.10.4.0 How reproducible: * quite often with the update step in place (maybe 80% at least). * never without the update step Steps to Reproduce: 1. Prepare env with 3 appliances. 1 primary, 1 secondary DB and one evm service. HA monitor enabled on the evm appliance. 2. update the evm appliance using webui (cfme.fixtures.cli.update_appliance() in the integration_tests) 3. stop the primary DB (systemctl stop $APPLIANCE_PG_SERVICE) Actual results: The evmserverd fails to recover, complaining about the missing monitor unix socket file Expected results: The evmserverd recovers, DB handed over to the former secondary Additional info:
I tried systemctl restart evm-failover-monitor on the evm appliance after upgrade, before stopping the PG service. The error didn't reproduce -- the server recovered OK
Created attachment 1560359 [details] evm-failover-monitor log
I suspect there was some issue with the source being moved out from under the running process. This was the specific error in the failover monitor: May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require': cannot load such file -- MiqSockUtil (LoadError) May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `block in require' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:259:in `load_dependency' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /var/www/miq/vmdb/app/models/host.rb:2:in `<top (required)>' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `block in require' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:259:in `load_dependency' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:380:in `block in require_or_load' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:37:in `block in load_interlock' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies/interlock.rb:12:in `block in loading' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/concurrency/share_lock.rb:150:in `exclusive' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies/interlock.rb:11:in `loading' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:37:in `load_interlock' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:358:in `require_or_load' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:512:in `load_missing_constant' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:203:in `const_missing' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:544:in `load_missing_constant' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:203:in `const_missing' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /var/www/miq/vmdb/app/models/miq_event.rb:17:in `<class:MiqEvent>' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /var/www/miq/vmdb/app/models/miq_event.rb:1:in `<top (required)>' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `block in require' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:259:in `load_dependency' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:293:in `require' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:380:in `block in require_or_load' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:37:in `block in load_interlock' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies/interlock.rb:12:in `block in loading' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/concurrency/share_lock.rb:150:in `exclusive' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies/interlock.rb:11:in `loading' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:37:in `load_interlock' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:358:in `require_or_load' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:512:in `load_missing_constant' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:203:in `const_missing' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:544:in `load_missing_constant' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /opt/rh/cfme-gemset/gems/activesupport-5.0.7.2/lib/active_support/dependencies.rb:203:in `const_missing' May 08 15:17:42 localhost.localdomain evm-failover-monitor[4358]: from /var/www/miq/vmdb/lib/evm_database.rb:113:in `raise_server_event' This file (MiqSockUtil) is provided by manageiq-gems-pending. And since that gem is git-based it's more than likely that the location of the file changed during the upgrade. This means that the running process could have had an out-of-date location in memory for where to go looking for that particular file causing the load error. This would also mean that restarting the failover monitor service after the upgrade should solve the issue, which Jaroslav says is indeed the case in comment 2. This is also a generally good idea since the failover monitor code itself could have been changed during the upgrade process and we would want the latest running.
This seems to work. I removed a code that is restarting the evm_failover_monitor after the update and then when the failover occurs, the webui survives that on the cfme 5.11.0.15, but the same code fails on 5.10.7.1. So VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4199