Description of problem: Running "yum update" on a RHEL 6 host running RHEV 3.2 packages (vdsm 4.10) to update directly to RHEV 3.4 packages (vdsm 4.14) causes vdsm to fail to start due to libvirt module not being configured: Version-Release number of selected component (if applicable): vdsm-4.14.7-3.el6ev How reproducible: Always Steps to Reproduce: 1. Have a RHEL 6 host running RHEV 3.2 packages. 2. Run "yum update" 3. Actual results: - postinstall script fails: [...] Installing : 1:libguestfs-tools-c-1.20.11-2.el6.x86_64 42/58 Updating : vdsm-4.14.7-3.el6ev.x86_64 43/58 warning: /etc/vdsm/vdsm.conf created as /etc/vdsm/vdsm.conf.rpmnew Checking configuration status... Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 145, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 142, in main return tool_command[cmd]["command"](*args[1:]) File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 230, in configure service.service_stop(s) File "/usr/lib64/python2.6/site-packages/vdsm/tool/service.py", line 370, in service_stop return _runAlts(_srvStopAlts, srvName) File "/usr/lib64/python2.6/site-packages/vdsm/tool/service.py", line 351, in _runAlts "%s failed" % alt.func_name, out, err) vdsm.tool.service.ServiceOperationError: ServiceOperationError: _serviceStop failed Sending stop signal sanlock (2081): [ OK ] Waiting for sanlock (2081) to stop:[FAILED] Cleanup : vdsm-4.10.2-25.0.el6ev.x86_64 44/58 Cleanup : vdsm-cli-4.10.2-25.0.el6ev.noarch 45/58 - vdsm does not start due to libvirt module not being configured: # /etc/init.d/vdsmd start initctl: Job is already running: libvirtd vdsm: Running mkdirs vdsm: Running configure_coredump vdsm: Running configure_vdsm_logs vdsm: Running run_init_hooks vdsm: Running gencerts vdsm: Running check_is_configured libvirt is not configured for vdsm yet sanlock service is already configured Modules libvirt are not configured Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 145, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 142, in main return tool_command[cmd]["command"](*args[1:]) File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 273, in isconfigured raise RuntimeError(msg) RuntimeError: One of the modules is not configured to work with VDSM. To configure the module use the following: 'vdsm-tool configure [module_name]'. If all modules are not configured try to use: 'vdsm-tool configure --force' (The force flag will stop the module's service and start it afterwards automatically to load the new configuration.) vdsm: stopped during execute check_is_configured task (task returned with error code 1). vdsm start [FAILED] Expected results: postinstall script succeeds, libvirt module is configured and vdsmd starts successfully. Additional info:
Hi, I have reproduced the report. I would say that the vdsm spec schema for upgrade is the failure. Today: %postun (from vdsm-4.14.7-3) <snip> if ! %{_bindir}/vdsm-tool is-configured --module libvirt >/dev/null 2>&1; then if ! %{_bindir}/vdsm-tool configure --module libvirt --force \ >/dev/null 2>&1; then # fallback to vdsmd reconfigure api - This change may be removed # when vdsm won't support any downgrade\upgrade to versions that # don't include vdsm-tool configure api (vdsm <= 3.3) for f in '/lib/systemd/systemd-vdsmd' '/etc/init.d/vdsmd'; do if [ -f "${f}" ]; then "${f}" reconfigure >/dev/null 2>&1 || : fi done fi fi </snip> During the upgrade from vdsm-4.10.2-25.0 to vdsm-4.14.7-3 the %postun section that will be executed is from the old package (vdsm-4.10.2-25.0) so the logic won't work and the host will keep the old header/footer in the conf files so libvirt/vdsm-tool will complain. I have moved the code to %post section with if [ "$1" = "2" ]; and the upgrade worked as expected. Dan/Yaniv, please let me know your thoughts before I send the patch. @Julio, thanks a lot for providing the machine for the tests. If you don't mind to keep it until we close this bug would be nice.
(In reply to Douglas Schilling Landgraf from comment #4) > @Julio, thanks a lot for providing the machine for the tests. If you don't > mind to keep it until we close this bug would be nice. Sure.
Dan, do we support direct update from 3.2 to 3.4? Since a lot changed in the vdsm-tool/reconfigure area we should know if we need to be backward compatible with 3.2 as well (reconfigure commands in old spec %postun).
We attempted to support that, and this breakage was not intentional # fallback to vdsmd reconfigure api - This change may be removed # when vdsm won't support any downgrade\upgrade to versions that # don't include vdsm-tool configure api (vdsm <= 3.3) Yaniv, I believe that we agreed to break only a specific downgrade path - do you recall?
I'm only assured that 3.3<->3.4 and 3.4<->3.5 works properly. almost certain that 3.3<->3.5 works also fine. But I never tried to test nor to verify how 3.2->3.3 works. 3.2 includes different approach for the reconfigure (vdsm does it during each restart). If we should handle it I assume we can hack 3.3 so that 3.2 to 3.3 will work. But personally I'm against it and it will require 3.3 backports. I wouldn't hack also 3.4 to allow upgrading 3.2 to 3.4
(In reply to Yaniv Bronhaim from comment #8) > > But I never tried to test nor to verify how 3.2->3.3 works. This is pretty bad; we should, as rhev-3.3 is still supported. Might it accidentally work, since there was no conf change?
And indeed, bug 1080107 exists. We should solve it.
The bug in that case is the failure in stopping sanlock service "Waiting for sanlock (2081) to stop:[FAILED]" This occurs because you upgrade vdsm while vdsm was connected to storage domain, sanlock holds leases and can't stop the service that way. We fixed it as part of http://gerrit.ovirt.org/28007 do we want some kind of backport fix for that in 3.3 or can we suggest to put the host in maintenance before performing the upgrade?
Would an upgrade-while-sanlock-is-taken to a version with "Setting enum for isconfigured" succeed? If so, we'd better backport it, to allow a more robust upgrade path. Moving to maintenance before upgrade is always recommended, but not always possible. We strove very hard to allow live upgrade of vdsm, and should keep that working, as much as we can. Julio, can you confirm that the `yum upgrade` has taken place while the host was operational?
(In reply to Dan Kenigsberg from comment #12) > Julio, can you confirm that the `yum upgrade` has taken place while the host > was operational? I'm afraid not: host is in maintenance mode when the problem occurs. I've re-provisioned the reproducer host in comment #1 so the problem can be reproduced again. After running "yum update" run "yum history undo <x>" to undo the upgrade to be able to reproduce it again. The host is already in maintenance mode. As a side note this becomes more of an issue due to 'vdsm' packages for 3.3 and 3.4 being in the same channel, so customers still running 3.2 and running "yum update" (as documentation suggests) will try to update directly to 3.4 and hit this.
you are right about the bug when upgrading 3.2 to 3.4 that requires additional run of "vdsm-tool configure", but the exception you mentioned in the description is not due to that. other than that, we did backport the enum to 3.4, and I think that this introduced the bug in the following loop we queue the module to configure sys.stdout.write("\nChecking configuration status...\n\n") for c in __configurers: if c.getName() in args.modules: override = args.force and (c.isconfigured != CONFIGURED) <---- if not override and not c.validate(): raise RuntimeError( "Configuration of %s is invalid" % c.getName() ) if override: configurer_to_trigger.append(c) but as it seems we trigger configure now only while override is true. which is only when --force pass to the call. it should also trigger always when c.isconfigured != CONFIGURE. I suspect that this is the bug here and the version you use don't pass the --force flag.
lets just be sure where it reproduces. I used latest 3.2 and latest 3.4 which is 4.10.2 to 4.14.10. I posted http://gerrit.ovirt.org/#/c/29992/ , as this is a bug anyway. but to be its related to the case you raise, lets investigate the versions first on your reproduction. I found another issue in /etc/init.d/vdsmd which still expected boolean value from isconfigured. and force is not passed to the reconfigure call in the spec if ! %{_bindir}/vdsm-tool is-configured --module libvirt >/dev/null 2>&1; then if ! %{_bindir}/vdsm-tool configure --module libvirt --force \ >/dev/null 2>&1; then # fallback to vdsmd reconfigure api - This change may be removed # when vdsm won't support any downgrade\upgrade to versions that # don't include vdsm-tool configure api (vdsm <= 3.3) for f in '/lib/systemd/systemd-vdsmd' '/etc/init.d/vdsmd'; do if [ -f "${f}" ]; then "${f}" reconfigure >/dev/null 2>&1 || : fi done fi fi reconfigure() { if [ "${1}" = "force" ] || ! "$VDSM_TOOL" is-configured; then "$VDSM_TOOL" configure "--force" fi }
(In reply to Yaniv Bronhaim from comment #15) > lets just be sure where it reproduces. I used latest 3.2 and latest 3.4 > which is 4.10.2 to 4.14.10. I set the reproducer back to previous status (vdsm-4.10.2-25.0.el6ev). I understand that the upgrade should work even from the _non_ latest 3.2 since documentation states "yum update" is all customers need to update hosts.
was the error same as you specified in the description (sanlock failed to stop)? or just you couldn't start vdsm after the upgrade without running vdsm-tool configure? can you please re-verify the upgrade with the last fix (http://gerrit.ovirt.org/#/c/29992/)? it works for me.
Hi Yaniv, - postinstall script failed during package installation (while "yum update" was installing the new packages): [...] Installing : 1:libguestfs-tools-c-1.20.11-2.el6.x86_64 42/58 Updating : vdsm-4.14.7-3.el6ev.x86_64 43/58 warning: /etc/vdsm/vdsm.conf created as /etc/vdsm/vdsm.conf.rpmnew Checking configuration status... Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 145, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 142, in main return tool_command[cmd]["command"](*args[1:]) File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 230, in configure service.service_stop(s) File "/usr/lib64/python2.6/site-packages/vdsm/tool/service.py", line 370, in service_stop return _runAlts(_srvStopAlts, srvName) File "/usr/lib64/python2.6/site-packages/vdsm/tool/service.py", line 351, in _runAlts "%s failed" % alt.func_name, out, err) vdsm.tool.service.ServiceOperationError: ServiceOperationError: _serviceStop failed Sending stop signal sanlock (2081): [ OK ] Waiting for sanlock (2081) to stop:[FAILED] Cleanup : vdsm-4.10.2-25.0.el6ev.x86_64 44/58 - after that vdsm was stopped and a manual attempt to start it failed: # /etc/init.d/vdsmd start initctl: Job is already running: libvirtd vdsm: Running mkdirs vdsm: Running configure_coredump vdsm: Running configure_vdsm_logs vdsm: Running run_init_hooks vdsm: Running gencerts vdsm: Running check_is_configured libvirt is not configured for vdsm yet sanlock service is already configured Modules libvirt are not configured Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 145, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 142, in main return tool_command[cmd]["command"](*args[1:]) File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 273, in isconfigured raise RuntimeError(msg) RuntimeError: One of the modules is not configured to work with VDSM. To configure the module use the following: 'vdsm-tool configure [module_name]'. If all modules are not configured try to use: 'vdsm-tool configure --force' (The force flag will stop the module's service and start it afterwards automatically to load the new configuration.) vdsm: stopped during execute check_is_configured task (task returned with error code 1). vdsm start [FAILED] Upgrade was from vdsm-4.10.2-25.0.el6ev to vdsm-4.14.7-3.el6ev. I've re-tested with vdsm/4.14.10/8.git0e027d8.el6ev and I can't reproduce the problem any more but now I can't reproduce the problem with vdsm-4.14.7-3.el6ev either (even after re-creating the reproducer).
Ok, there was a bug in that area that is solved by http://gerrit.ovirt.org/#/c/29992/ thanks for the information
yum update went OK but then it revealed an error. # /etc/init.d/vdsmd start initctl: Job is already running: libvirtd vdsm: Running mkdirs vdsm: Running configure_coredump vdsm: Running configure_vdsm_logs vdsm: Running run_init_hooks vdsm: Running gencerts vdsm: Running check_is_configured libvirt is not configured for vdsm yet Modules libvirt are not configured Error: One of the modules is not configured to work with VDSM. To configure the module use the following: 'vdsm-tool configure [module_name]'. If all modules are not configured try to use: 'vdsm-tool configure --force' (The force flag will stop the module's service and start it afterwards automatically to load the new configuration.) vdsm: stopped during execute check_is_configured task (task returned with error code 1). vdsm start # rpm -qa vdsm\* libvirt\* libvirt-client-0.10.2-29.el6_5.10.x86_64 libvirt-python-0.10.2-29.el6_5.10.x86_64 vdsm-python-zombiereaper-4.16.1-6.gita4a4614.el6.noarch vdsm-xmlrpc-4.16.1-6.gita4a4614.el6.noarch vdsm-4.16.1-6.gita4a4614.el6.x86_64 libvirt-0.10.2-29.el6_5.10.x86_64 vdsm-python-4.16.1-6.gita4a4614.el6.x86_64 vdsm-jsonrpc-4.16.1-6.gita4a4614.el6.noarch vdsm-cli-4.16.1-6.gita4a4614.el6.noarch libvirt-lock-sanlock-0.10.2-29.el6_5.10.x86_64 vdsm-yajsonrpc-4.16.1-6.gita4a4614.el6.noarch
# grep vdsm /var/log/yum.log Aug 06 11:53:14 Installed: vdsm-python-4.10.2-30.1.el6ev.x86_64 Aug 06 11:53:14 Installed: vdsm-xmlrpc-4.10.2-30.1.el6ev.noarch Aug 06 11:53:19 Installed: vdsm-cli-4.10.2-30.1.el6ev.noarch Aug 06 11:54:36 Installed: vdsm-4.10.2-30.1.el6ev.x86_64 Aug 06 13:50:09 Installed: vdsm-python-zombiereaper-4.16.1-6.gita4a4614.el6.noarch Aug 06 13:50:20 Updated: vdsm-python-4.16.1-6.gita4a4614.el6.x86_64 Aug 06 13:50:20 Updated: vdsm-xmlrpc-4.16.1-6.gita4a4614.el6.noarch Aug 06 13:50:28 Installed: vdsm-yajsonrpc-4.16.1-6.gita4a4614.el6.noarch Aug 06 13:50:28 Installed: vdsm-jsonrpc-4.16.1-6.gita4a4614.el6.noarch Aug 06 13:50:39 Updated: vdsm-4.16.1-6.gita4a4614.el6.x86_64 Aug 06 13:50:40 Updated: vdsm-cli-4.16.1-6.gita4a4614.el6.noarch
ok ok ok . now i get it.. after bit of time I figured that out :/ we added to the spec in the %postun section to run the reconfigure thing. BUT this section is the postun which means, run when the package gets uninstalled. so if we upgrade any version that has this, we will reconfigure as expected. BUT in 4.10.2 we don't have this part at all. so uninstalling it doesn't run for us the reconfigure. you're right that you see the prints "Checking configuration status... Running configure... " but those are not related. and a bit of lie.. ill swallow those prints later on. this comes from the sanlock configure we perform on each installation (from the new package spec, as it is in the %post section) so to summaries, upgrade from ovirt-3.2 to >= ovirt-3.3 will require "vdsm-tool configure --force" call manually (which the host-deploy does for you). but upgrade between >= ovirt-3.3 will work normally. I don't see how we fix it nicely. I suggest to keep it as is. Danken, agree?
When thinking about it again our approach for handling the upgrade phase during the installation is wrong. VDSM should not do anything on the system without manual or third party intervention except starting itself. If the system is not configured, during startup, vdsm should alert the user to rerun "vdsm-tool configure" or whatever required. This what happens on manual rpm installation or being done automatically when using host-deploy. The same should be when running upgrade. Yum won't restart the service after the upgrade, and if VDSM needs to be reconfigured after the upgrade, it should alert about it during start(\restart) of the service and the user should run manually the configure call. Using ovirt-engine web ui, the user can put the host on maintenance and resinstall the host, which will initiate the host-deploy again and this will rerun the configure automatically after installation. In both ways the configure is not handled by VDSM directly. [1] fix is wrong as far as I see now. unless we tried to save some kind of behavior that I forgot. Dan, did I forget something? [1] http://gerrit.ovirt.org/#/c/20394/4/vdsm.spec.in,cm
For the record, I like the approach taken by http://gerrit.ovirt.org/#/c/31561/16/init/vdsmd_init_common.sh.in
pushing that forward.
there's an issue, the host after upgrade and activation in Admin Portal finishes in 'Non Responsive' state, although it should be Up. - during upgrade... ~~~ ... Updating : vdsm-4.16.7.2-1.el6ev.x86_64 40/44 warning: /etc/vdsm/vdsm.conf created as /etc/vdsm/vdsm.conf.rpmnew Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 224, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 210, in main except vdsm.tool.UsageError as e: AttributeError: 'module' object has no attribute 'UsageError' Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 224, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 210, in main except vdsm.tool.UsageError as e: AttributeError: 'module' object has no attribute 'UsageError' Cleanup : vdsm-4.10.2-30.1.el6ev.x86_64 ... ~~~ - after host's activation ~~~ ... 2014-11-04 11:01:07,009 INFO [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (pool-3-thread-49) [4b5cd67a] START, ActivateVdsVDSCommand(HostName = dell-r210ii-03, HostId = b0ea2412-dd99-4482-9c61-b17222b4e51a), log id: 37a0414c 2014-11-04 11:01:07,018 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) [4b5cd67a] Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused ... ~~~ - vdsmd on the host is not running after update ~~~ # /etc/init.d/vdsmd status VDS daemon is not running ~~~ - starting vdsmd manually makes the host become Up ~~~ # /etc/init.d/vdsmd start initctl: Job is already running: libvirtd vdsm: Running mkdirs vdsm: Running configure_coredump vdsm: Running configure_vdsm_logs vdsm: Running run_init_hooks vdsm: Running upgraded_version_check Checking configuration status... libvirt is not configured for vdsm yet Running configure... Reconfiguration of libvirt is done. Done configuring modules to VDSM. vdsm: Running check_is_configured libvirt is already configured for vdsm vdsm: Running validate_configuration SUCCESS: ssl configured to true. No conflicts vdsm: Running prepare_transient_repository vdsm: Running syslog_available vdsm: Running nwfilter vdsm: Running dummybr vdsm: Running load_needed_modules vdsm: Running tune_system vdsm: Running test_space vdsm: Running test_lo vdsm: Running unified_network_persistence_upgrade vdsm: Running restore_nets libvirt: Network Driver error : Network not found: no network with matching name 'vdsm-rhevm' vdsm: Running upgrade_300_nets Starting up vdsm daemon: vdsm start [ OK ] ~~~ IMO this is fail, manual intervention has never been needed to switch a host from maintenance back to Up state.
Created attachment 953523 [details] upgrade_output.txt, engine.log
Removing the call to sebool-setup from spec fixes it: http://gerrit.ovirt.org/#/c/34445/13/vdsm.spec.in,cm the tag you check doesn't include this change the problem is that the UsageError exception is raised on %post while vdsm-tool new code is executed and the import vdsm.tool uses the old vdsm code. we need to remove all vdsm-tool calls from the %post stage to avoid such issues later on . but the above patch probably fix the thing you actually see now as the setboolean call started in last coreutils version to throw exception when selinux is disabled can you confirm and close or do you think I miss something?
I'll recheck with version specified in #44.
updated from vdsm-4.10.2-30.1.el6ev.x86_64 -> vdsm-4.16.7.3-1.el6ev.x86_64 generally ok, but there are some events which i'll try to reproduce... 2014-Nov-10, 16:11 Status of host dell-r210ii-03 was set to Up. oVirt 2014-Nov-10, 16:11 Host dell-r210ii-03 is initializing. Message: Recovering from crash or Initializing oVirt 2014-Nov-10, 16:11 Host dell-r210ii-03 is non responsive. oVirt 2014-Nov-10, 16:11 Host dell-r210ii-03 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. oVirt 2014-Nov-10, 16:11 Host dell-r210ii-03 was activated by admin. 28f023af oVirt 2014-Nov-10, 16:00 Host dell-r210ii-03 was switched to Maintenance mode by admin. 228e4e71 oVirt
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html