Description: After rhevh 6.5 20150115 upgraded to rhevh 6.6 20150325.0.el6ev, the rhevh 6.6 host is on non-response status in rhevm portal. Test version: # rpm -q ovirt-node vdsm libvirt vdsm-reg kernel ovirt-node-3.2.1-12.el6.noarch vdsm-4.16.8.1-6.el6ev.x86_64 libvirt-0.10.2-46.el6_6.2.x86_64 vdsm-reg-4.16.8.1-6.el6ev.noarch kernel-2.6.32-504.8.1.el6.x86_64 # cat /etc/system-release Red Hat Enterprise Virtualization Hypervisor 6.6 (20150325.0.el6ev) rhevm 3.5.1-0.2.el6ev Test steps: 1. Installed RHEVH 6.5 20150115 with [bond+ vlan] network, add rhevh via rhevm portal into 3.4 compatibility version in rhevm 3.5.1-0.2.el6ev 2. Created 2 VMs running on it successful. 3. Shutdown the 2 VMs, then maintenance the rhevh 6.5 host. 4. Upgrade rhevh 6.5 via rhevm portal to rhev-hypervisor6-6.6-20150325.0.iso which from bug 1194068#c36. 5. RHEV-H 6.6 reboot automatically after upgrade, and stared successful. 6. RHEV-H 6.6 network is up, can ping successful to rhevm. 7. BUT rhevh 6.6 is non responsive status in RHEVM portal. 8. checked the engine.log in rhevm, there display the below. 9. Try to maintenance rhevh 6.6 host then active it again, but can not active rhevh host. Actual result: rhevh 6.6 host always non-responsive status in rhevm portal after upgrade. Expected result: rhevh 6.6 host UP after upgrade from rhevh 6.5 <snip> 2015-03-26 12:37:03,288 ERROR [org.ovirt.engine.core.bll.SshSoftFencingCommand] (org.ovirt.thread.pool-7-thread-32) [41eeb795] SSH Soft Fencing command failed on host 192.168.22.162: Command returned failure code 1 during SSH session 'root.22.162' Stdout: Stderr: Error: ServiceOperationError: _serviceRestart failed Shutting down vdsm daemon: [FAILED] vdsm watchdog stop[ OK ] vdsm: not running[FAILED] vdsm: Running run_final_hooks vdsm stop[ OK ] vdsm: Running mkdirs vdsm: Running configure_coredump vdsm: Running configure_vdsm_logs vdsm: Running wait_for_network vdsm: Running run_init_hooks vdsm: Running upgraded_version_check vdsm: Running check_is_configured libvirt is already configured for vdsm Modules sebool are not configured vdsm: stopped during execute check_is_configured task (task returned with error code 1). vdsm start[FAILED] Error: One of the modules is not configured to work with VDSM. To configure the module use the following: 'vdsm-tool configure [--module module-name]'. If all modules are not configured try to use: 'vdsm-tool configure --force' (The force flag will stop the module's service and start it afterwards automatically to load the new configuration.) Stacktrace: java.io.IOException: Command returned failure code 1 during SSH session 'root.22.162': java.io.IOException: Command returned failure code 1 during SSH session 'root.22.162' at org.ovirt.engine.core.uutils.ssh.SSHClient.executeCommand(SSHClient.java:527) [uutils.jar:] at org.ovirt.engine.core.bll.SshSoftFencingCommand.executeSshSoftFencingCommand(SshSoftFencingCommand.java:91) [bll.jar:] at org.ovirt.engine.core.bll.SshSoftFencingCommand.executeCommand(SshSoftFencingCommand.java:53) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1193) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1332) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1957) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1356) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:353) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:435) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:416) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:622) [bll.jar:] at sun.reflect.GeneratedMethodAccessor149.invoke(Unknown Source) [:1.7.0_45] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_45] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_45] </snip>
Created attachment 1006781 [details] sosreport from node side
Created attachment 1006782 [details] varlog.tar.gz
Created attachment 1006783 [details] nonresponse.png
Created attachment 1006784 [details] engine.log
Is this issue fixed when you use permissive mode, Ying?
(In reply to Fabian Deutsch from comment #7) > Is this issue fixed when you use permissive mode, Ying? No, it isn't. boot rhevh 6.6 with enforcing=0, and after rhevh 6.6 started, make the host to maintenance in rhevm, and then active it again, still non-responsive.
Thanks Ying. During upgrade it looks like vdsmd is getting restarted, but the sebool module fails when checking if the correct sebooleans are checked: 2015-03-26 12:37:02,235 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (org.ovirt.thread.pool-7-thread-13) [1d149eb8] FINISH, SetVdsStatusVDSCommand, log id: 3f0cb07 2015-03-26 12:37:02,244 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (org.ovirt.thread.pool-7-thread-13) [1d149eb8] Activate finished. Lock released. Monitoring can run now for host 192.168.22.162 from data-center vlan_1194068 2015-03-26 12:37:02,252 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-13) [1d149eb8] Correlation ID: 1d149eb8, Job ID: 34d865ad-3c76-497c-a680-91f290d387df, Call Stack: null, Custom Event ID: -1, Message: Host 192.168.22.162 was activated by admin@internal. 2015-03-26 12:37:02,260 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (org.ovirt.thread.pool-7-thread-13) [1d149eb8] Lock freed to object EngineLock [exclusiveLocks= key: e9be7942-0987-4364-80e0-8d811c8235c5 value: VDS , sharedLocks= ] 2015-03-26 12:37:03,288 ERROR [org.ovirt.engine.core.bll.SshSoftFencingCommand] (org.ovirt.thread.pool-7-thread-32) [41eeb795] SSH Soft Fencing command failed on host 192.168.22.162: Command returned failure code 1 during SSH session 'root.22.162' Stdout: Stderr: Error: ServiceOperationError: _serviceRestart failed Shutting down vdsm daemon: [FAILED] vdsm watchdog stop[ OK ] vdsm: not running[FAILED] vdsm: Running run_final_hooks vdsm stop[ OK ] vdsm: Running mkdirs vdsm: Running configure_coredump vdsm: Running configure_vdsm_logs vdsm: Running wait_for_network vdsm: Running run_init_hooks vdsm: Running upgraded_version_check vdsm: Running check_is_configured libvirt is already configured for vdsm Modules sebool are not configured vdsm: stopped during execute check_is_configured task (task returned with error code 1). vdsm start[FAILED] Error: One of the modules is not configured to work with VDSM. To configure the module use the following: 'vdsm-tool configure [--module module-name]'. If all modules are not configured try to use: 'vdsm-tool configure --force' (The force flag will stop the module's service and start it afterwards automatically to load the new configuration.) The module is correct, the booleans are not set. The problem is: Even if we set the booleans in the upgraded image, then the booleans will only be set on the next boot, and not when the check is run. My question is: Why do we only see this now, and not earlier?
Oved, do you know of any recent change in that area? Dan/Oved, should Engine ignore this failure on the RHEV-H case after upgrade, or should vdsm ignore this failure on RHEV-H in general? The right appraoch IMO would be that Engine recogizes that this check will always fail on RHEV-H (if the sebools on the old image are different to the bools on the new image) and should thus not do this check.
Why would we ignore it? Yaniv - can you take a look?
My thinking was actually wrong. When an image is deployed, som part is restarting vdsmd, and that vdsmd service is from the "old" or current image, thus the problem I laied out in comment 10 is not valid.
Fabian, I understand that comment 10 is not valid, but I do not understand what you DO think about this bug. If the value of sebooleans is not persisted across upgrade/boot, the the node should run a conditional `is-configured && vdsm-tool --module sebool configure` after upgrade/boot.
We do have such a hook: https://gerrit.ovirt.org/gitweb?p=ovirt-node-plugin-vdsm.git;a=blob;f=hooks/on-boot/02-vdsm-sebool-config;h=60a203bc47d07453f0a5a550fd29680d2d534a0d;hb=HEAD Which is run on every boot. But in the sosreport from node in comment 1 I see that many sebooleans are set incorrectly: virt_use_fusefs off virt_use_nfs off virt_use_samba off virt_use_sanlock off sanlock_use_fusefs off sanlock_use_nfs off sanlock_use_samba off Which should - AFAIK - be set to "on", this is at least done by the vdsm sebool configurator.
Created attachment 1007090 [details] 1206139.tar.gz
I haven't looked at the logs yet, but is this a regression of the ONBOOT issue from a few days ago? Fabian - It's quite late here and I won't get to look at this until the morning, but I'll ensure the hooks are running. The fact that all the sebooleans looked good this afternoon is a good indicator that they are, but this also means that the onboot issue may have a different fix than moving the hooks...
No such issue on RHEVH 6.6 20150128 GA build.(ovirt-node-3.2.1-6.el6.noarch, vdsm-4.16.8.1-6.el6ev.x86_64) And no such issue on RHEVH 6.6 20150312 build.(ovirt-node-3.2.1-10.el6.noarch ,vdsm-4.16.12-2.el6ev.x86_64)
I believe that we are seeing bug 1174611 in comment 16 again, because the vdsm in the image from comment 15 is vdsm-4.16.8.1-16, which does not contain the fix for bug 1174611, the fix only got introduced with vdsm-4.1.6.11. I'll rebuild a new scratch iso with the latest vdsm (from vt14.1).
See comment 22, here already cloned this bug to 3.5.z bug #1206645, so remove this 3.5.1 tracker bug 1193058 from this bug.
I can not verify this bug because it is blocked by two bugs Bug 1271273 - Failed to add RHEV-H 3.5.z-7.1 host to RHEV-M 3.6 with 3.5 cluster due to missing ovirtmgmt network Bug 1275956 - Broken upgrade / In Boot process show IOError info after upgrading RHEVH via TUI/AUTO/RHEVM, the original password is not available to login to new RHEV-H
Still block by bug 1271273, I will verify this bug after 1271273 fixed.
Test version: rhev-hypervisor7-7.2-20151129.1 + ovirt-node-3.2.3-29.el7.noarch rhev-hypervisor7-7.2-20160120.0 + ovirt-node-3.6.1-3.0.el7ev.noarch RHEV-M 3.6 (rhevm-3.6.2.6-0.1.el6) Test steps: 1. Installed RHEV-H rhev-hypervisor7-7.2-20151129.1 with [bond+ vlan] network, 2. Add rhevh via rhevm 3.6 with 3.5 compatibility version 2. Created VMs running on it successful. 3. Shutdown the 2 VMs, then maintenance the rhevh 7.2 host. 4. Upgrade rhev-h 7.2 via rhevm portal to rhev-hypervisor7-7.2-20160120.0. Test result: 1. RHEV-H 7.2 reboot automatically after upgrade, and stared successful. 2. RHEV-H 7.2 network status show as: connected RHEVM 3. RHEV-H can up in RHEV-M side. So the bug is fixed, change bug status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0378.html