Bug 1206139 - RHEVH Host is always on non-responsive status after upgrade
Summary: RHEVH Host is always on non-responsive status after upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-node
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Ryan Barry
QA Contact: cshao
URL:
Whiteboard:
Depends On: 1270177 1271273 1275956
Blocks: 1206645
TreeView+ depends on / blocked
 
Reported: 2015-03-26 12:29 UTC by Ying Cui
Modified: 2016-03-09 14:20 UTC (History)
14 users (show)

Fixed In Version: ovirt-node-3.3.0-0.4.20150906git14a6024.el7ev
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1206645 (view as bug list)
Environment:
Last Closed: 2016-03-09 14:20:04 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sosreport from node side (6.36 MB, application/x-xz)
2015-03-26 12:31 UTC, Ying Cui
no flags Details
varlog.tar.gz (308.56 KB, application/x-gzip)
2015-03-26 12:31 UTC, Ying Cui
no flags Details
nonresponse.png (34.00 KB, image/png)
2015-03-26 12:32 UTC, Ying Cui
no flags Details
engine.log (7.46 MB, text/plain)
2015-03-26 12:33 UTC, Ying Cui
no flags Details
1206139.tar.gz (7.18 MB, application/x-gzip)
2015-03-27 03:52 UTC, cshao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0378 0 normal SHIPPED_LIVE ovirt-node bug fix and enhancement update for RHEV 3.6 2016-03-09 19:06:36 UTC
oVirt gerrit 39212 0 None MERGED logger: replace .debug statement for .warning Never
oVirt gerrit 39233 0 None ABANDONED spec: Enable all required sebooleans Never
oVirt gerrit 39248 0 master ABANDONED init: Move on-boot hooks to a reachable place Never
oVirt gerrit 39264 0 ovirt-3.5 MERGED init: Move on-boot hooks to a reachable place Never

Description Ying Cui 2015-03-26 12:29:46 UTC
Description:
After rhevh 6.5 20150115 upgraded to rhevh 6.6 20150325.0.el6ev, the rhevh 6.6 host is on non-response status in rhevm portal.

Test version:
# rpm -q ovirt-node vdsm libvirt vdsm-reg kernel 
ovirt-node-3.2.1-12.el6.noarch
vdsm-4.16.8.1-6.el6ev.x86_64
libvirt-0.10.2-46.el6_6.2.x86_64
vdsm-reg-4.16.8.1-6.el6ev.noarch
kernel-2.6.32-504.8.1.el6.x86_64
# cat /etc/system-release
Red Hat Enterprise Virtualization Hypervisor 6.6 (20150325.0.el6ev)
rhevm 3.5.1-0.2.el6ev

Test steps:
1. Installed RHEVH 6.5 20150115 with [bond+ vlan] network, add rhevh via rhevm portal into 3.4 compatibility version in rhevm 3.5.1-0.2.el6ev
2. Created 2 VMs running on it successful.
3. Shutdown the 2 VMs, then maintenance the rhevh 6.5 host.
4. Upgrade rhevh 6.5 via rhevm portal to rhev-hypervisor6-6.6-20150325.0.iso which from bug 1194068#c36.
5. RHEV-H 6.6 reboot automatically after upgrade, and stared successful.
6. RHEV-H 6.6 network is up, can ping successful to rhevm.
7. BUT rhevh 6.6 is non responsive status in RHEVM portal.
8. checked the engine.log in rhevm, there display the below.
9. Try to maintenance rhevh 6.6 host then active it again, but can not active rhevh host.

Actual result:
rhevh 6.6 host always non-responsive status in rhevm portal after upgrade.

Expected result:
rhevh 6.6 host UP after upgrade from rhevh 6.5

<snip>
2015-03-26 12:37:03,288 ERROR [org.ovirt.engine.core.bll.SshSoftFencingCommand] (org.ovirt.thread.pool-7-thread-32) [41eeb795] SSH Soft Fencing command failed on host 192.168.22.162: Command returned failure code 1 during SSH session 'root.22.162'
Stdout: 
Stderr: Error:  ServiceOperationError: _serviceRestart failed
Shutting down vdsm daemon: 
[FAILED]
vdsm watchdog stop[  OK  ]
vdsm: not running[FAILED]
vdsm: Running run_final_hooks
vdsm stop[  OK  ]
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running configure_vdsm_logs
vdsm: Running wait_for_network
vdsm: Running run_init_hooks
vdsm: Running upgraded_version_check
vdsm: Running check_is_configured
libvirt is already configured for vdsm
Modules sebool are not configured
 vdsm: stopped during execute check_is_configured task (task returned with error code 1).
vdsm start[FAILED]

Error:  

One of the modules is not configured to work with VDSM.
To configure the module use the following:
'vdsm-tool configure [--module module-name]'.

If all modules are not configured try to use:
'vdsm-tool configure --force'
(The force flag will stop the module's service and start it
afterwards automatically to load the new configuration.)
Stacktrace: java.io.IOException: Command returned failure code 1 during SSH session 'root.22.162': java.io.IOException: Command returned failure code 1 during SSH session 'root.22.162'
    at org.ovirt.engine.core.uutils.ssh.SSHClient.executeCommand(SSHClient.java:527) [uutils.jar:]
    at org.ovirt.engine.core.bll.SshSoftFencingCommand.executeSshSoftFencingCommand(SshSoftFencingCommand.java:91) [bll.jar:]
    at org.ovirt.engine.core.bll.SshSoftFencingCommand.executeCommand(SshSoftFencingCommand.java:53) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1193) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1332) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1957) [bll.jar:]
    at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
    at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
    at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1356) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:353) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:435) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:416) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:622) [bll.jar:]
    at sun.reflect.GeneratedMethodAccessor149.invoke(Unknown Source) [:1.7.0_45]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_45]
    at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_45]
</snip>

Comment 1 Ying Cui 2015-03-26 12:31:26 UTC
Created attachment 1006781 [details]
sosreport from node side

Comment 2 Ying Cui 2015-03-26 12:31:56 UTC
Created attachment 1006782 [details]
varlog.tar.gz

Comment 3 Ying Cui 2015-03-26 12:32:18 UTC
Created attachment 1006783 [details]
nonresponse.png

Comment 4 Ying Cui 2015-03-26 12:33:17 UTC
Created attachment 1006784 [details]
engine.log

Comment 7 Fabian Deutsch 2015-03-26 13:09:12 UTC
Is this issue fixed when you use permissive mode, Ying?

Comment 8 Ying Cui 2015-03-26 13:35:24 UTC
(In reply to Fabian Deutsch from comment #7)
> Is this issue fixed when you use permissive mode, Ying?

No, it isn't.
boot rhevh 6.6 with enforcing=0, and after rhevh 6.6 started, make the host to maintenance in rhevm, and then active it again, still non-responsive.

Comment 9 Fabian Deutsch 2015-03-26 14:11:21 UTC
Thanks Ying.

During  upgrade it looks like vdsmd is getting restarted, but the sebool module fails when checking if the correct sebooleans are checked:

2015-03-26 12:37:02,235 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (org.ovirt.thread.pool-7-thread-13) [1d149eb8] FINISH, SetVdsStatusVDSCommand, log id: 3f0cb07
2015-03-26 12:37:02,244 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (org.ovirt.thread.pool-7-thread-13) [1d149eb8] Activate finished. Lock released. Monitoring can run now for host 192.168.22.162 from data-center vlan_1194068
2015-03-26 12:37:02,252 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-13) [1d149eb8] Correlation ID: 1d149eb8, Job ID: 34d865ad-3c76-497c-a680-91f290d387df, Call Stack: null, Custom Event ID: -1, Message: Host 192.168.22.162 was activated by admin@internal.
2015-03-26 12:37:02,260 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (org.ovirt.thread.pool-7-thread-13) [1d149eb8] Lock freed to object EngineLock [exclusiveLocks= key: e9be7942-0987-4364-80e0-8d811c8235c5 value: VDS
, sharedLocks= ]
2015-03-26 12:37:03,288 ERROR [org.ovirt.engine.core.bll.SshSoftFencingCommand] (org.ovirt.thread.pool-7-thread-32) [41eeb795] SSH Soft Fencing command failed on host 192.168.22.162: Command returned failure code 1 during SSH session 'root.22.162'
Stdout: 
Stderr: Error:  ServiceOperationError: _serviceRestart failed
Shutting down vdsm daemon: 
[FAILED]
vdsm watchdog stop[  OK  ]
vdsm: not running[FAILED]
vdsm: Running run_final_hooks
vdsm stop[  OK  ]
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running configure_vdsm_logs
vdsm: Running wait_for_network
vdsm: Running run_init_hooks
vdsm: Running upgraded_version_check
vdsm: Running check_is_configured
libvirt is already configured for vdsm
Modules sebool are not configured
 vdsm: stopped during execute check_is_configured task (task returned with error code 1).
vdsm start[FAILED]

Error:  

One of the modules is not configured to work with VDSM.
To configure the module use the following:
'vdsm-tool configure [--module module-name]'.

If all modules are not configured try to use:
'vdsm-tool configure --force'
(The force flag will stop the module's service and start it
afterwards automatically to load the new configuration.)


The module is correct, the booleans are not set.

The problem is: Even if we set the booleans in the upgraded image, then the booleans will only be set on the next boot, and not when the check is run.

My question is: Why do we only see this now, and not earlier?

Comment 10 Fabian Deutsch 2015-03-26 14:14:14 UTC
Oved, do you know of any recent change in that area?

Dan/Oved, should Engine ignore this failure on the RHEV-H case after upgrade,
or should vdsm ignore this failure on RHEV-H in general?

The right appraoch IMO would be that Engine recogizes that this check will always fail on RHEV-H (if the sebools on the old image are different to the bools on the new image) and should thus not do this check.

Comment 11 Oved Ourfali 2015-03-26 14:24:38 UTC
Why would we ignore it?
Yaniv - can you take a look?

Comment 12 Fabian Deutsch 2015-03-26 14:36:57 UTC
My thinking was actually wrong.

When an image is deployed, som part is restarting vdsmd, and that vdsmd service is from the "old" or current image, thus the problem I laied out in comment 10 is not valid.

Comment 13 Dan Kenigsberg 2015-03-26 14:56:35 UTC
Fabian, I understand that comment 10 is not valid, but I do not understand what you DO think about this bug.
If the value of sebooleans is not persisted across upgrade/boot, the the node should run a conditional `is-configured && vdsm-tool --module sebool configure` after upgrade/boot.

Comment 14 Fabian Deutsch 2015-03-26 15:24:38 UTC
We do have such a hook:
https://gerrit.ovirt.org/gitweb?p=ovirt-node-plugin-vdsm.git;a=blob;f=hooks/on-boot/02-vdsm-sebool-config;h=60a203bc47d07453f0a5a550fd29680d2d534a0d;hb=HEAD

Which is run on every boot.

But in the sosreport from node in comment 1 I see that many sebooleans are set incorrectly:

virt_use_fusefs                             off
virt_use_nfs                                off
virt_use_samba                              off
virt_use_sanlock                            off
sanlock_use_fusefs                          off
sanlock_use_nfs                             off
sanlock_use_samba                           off

Which should - AFAIK - be set to "on", this is at least done by the vdsm sebool configurator.

Comment 17 cshao 2015-03-27 03:52:58 UTC
Created attachment 1007090 [details]
1206139.tar.gz

Comment 18 Ryan Barry 2015-03-27 04:46:30 UTC
I haven't looked at the logs yet, but is this a regression of the ONBOOT issue from a few days ago?

Fabian -

It's quite late here and I won't get to look at this until the morning, but I'll ensure the hooks are running. The fact that all the sebooleans looked good this afternoon is a good indicator that they are, but this also means that the onboot issue may have a different fix than moving the hooks...

Comment 19 Ying Cui 2015-03-27 09:34:13 UTC
No such issue on RHEVH 6.6 20150128 GA build.(ovirt-node-3.2.1-6.el6.noarch, vdsm-4.16.8.1-6.el6ev.x86_64)
And no such issue on RHEVH 6.6 20150312 build.(ovirt-node-3.2.1-10.el6.noarch
,vdsm-4.16.12-2.el6ev.x86_64)

Comment 21 Fabian Deutsch 2015-03-27 10:10:28 UTC
I believe that we are seeing bug 1174611 in comment 16 again, because the vdsm in the image from comment 15 is vdsm-4.16.8.1-16, which does not contain the fix for bug 1174611, the fix only got introduced with vdsm-4.1.6.11.

I'll rebuild a new scratch iso with the latest vdsm (from vt14.1).

Comment 23 Ying Cui 2015-03-30 07:42:19 UTC
See comment 22, here already cloned this bug to 3.5.z bug #1206645, so remove this 3.5.1 tracker bug 1193058 from this bug.

Comment 25 yileye 2015-11-19 07:45:57 UTC
I can not verify this bug because it is blocked by two bugs 
Bug 1271273 - Failed to add RHEV-H 3.5.z-7.1 host to RHEV-M 3.6 with 3.5 cluster due to missing ovirtmgmt network
Bug 1275956 - Broken upgrade / In Boot process show IOError info after upgrading RHEVH via TUI/AUTO/RHEVM, the original password is not available to login to new RHEV-H

Comment 27 cshao 2016-01-12 06:40:21 UTC
Still block by bug 1271273, I will verify this bug after 1271273 fixed.

Comment 28 cshao 2016-01-26 11:21:56 UTC
Test version:
rhev-hypervisor7-7.2-20151129.1 +  ovirt-node-3.2.3-29.el7.noarch
rhev-hypervisor7-7.2-20160120.0 +  ovirt-node-3.6.1-3.0.el7ev.noarch
RHEV-M 3.6 (rhevm-3.6.2.6-0.1.el6)

Test steps:
1. Installed RHEV-H rhev-hypervisor7-7.2-20151129.1 with [bond+ vlan] network,
2. Add rhevh via rhevm 3.6 with 3.5 compatibility version
2. Created VMs running on it successful.
3. Shutdown the 2 VMs, then maintenance the rhevh 7.2 host.
4. Upgrade rhev-h 7.2 via rhevm portal to rhev-hypervisor7-7.2-20160120.0.

Test result:
1. RHEV-H 7.2 reboot automatically after upgrade, and stared successful.
2. RHEV-H 7.2 network status show as: connected RHEVM
3. RHEV-H can up in RHEV-M side.

So the bug is fixed, change bug status to VERIFIED.

Comment 30 errata-xmlrpc 2016-03-09 14:20:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0378.html


Note You need to log in before you can comment on or make changes to this bug.