Bug 1467185

Summary: clicking on upgrade leaves the ovirt-node in install failed state
Product: [oVirt] ovirt-node Reporter: RamaKasturi <knarra>
Component: Installation & UpdateAssignee: Ryan Barry <rbarry>
Status: CLOSED CURRENTRELEASE QA Contact: Huijuan Zhao <huzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1CC: bugs, cshao, dguo, huzhao, jiawu, knarra, qiyuan, rbarry, sbonazzo, yaniwang, ycui, yzhao
Target Milestone: ---Flags: rbarry: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 09:47:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Node RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Attaching screenshot for the event messages in UI none

Description RamaKasturi 2017-07-03 06:48:43 UTC
Description of problem:
Installed latest released version of 4.1.2 ovirt node. I see that there exists an upgrade button next to the host in the UI. Clicking on the upgrade button results in leaving the state of host to installed_failed. Below are the events which i see during the upgrade process.

Attached the screenshot which shows the event messages.

Version-Release number of selected component (if applicable):
RHVM - 4.1.2.3-0.1.el7
RHVH - 4.1-0.20170522.0+1

How reproducible:
Always

Steps to Reproduce:
1. Install HC using latest released bits of RHV-H which is 4.1.2
2. An upgrade button is shown next to the host in the UI
3. Click on upgrade.

Actual results:
Clicking on upgrade leaves the host into 'Install Failed' state.

Expected results:
1) An upgrade icon next to the host should not be shown since there is no upgrade available on the host.
2) clicking on upgrade icon should either install the updated versions if any or should say there is nothing to upgrade and the icon should disappear.
3) Clicking on upgrade should not leave the host in installed failed state.

Additional info:

Comment 1 RamaKasturi 2017-07-03 06:51:09 UTC
Created attachment 1293752 [details]
Attaching screenshot for the event messages in UI

Comment 2 RamaKasturi 2017-07-03 07:45:12 UTC
copied /tmp/imgbased.log, sosreports from the machine where the issue occured and engine.log into the location below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/

Comment 3 RamaKasturi 2017-07-03 10:10:41 UTC
Following exception is seen in the engine logs:
==================================================
2017-07-03 05:53:59,401-04 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [cfe76218-65da-4f0b-bc2c-343d3735f646] Error during deploy dialog: java.io.IOException: Unexpected connection termination
        at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:376) [otopi.jar:]
        at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:393) [otopi.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.threadMain(VdsDeployBase.java:304) [bll.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.lambda$new$0(VdsDeployBase.java:383) [bll.jar:]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]

2017-07-03 05:53:59,404-04 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (pool-5-thread-1) [cfe76218-65da-4f0b-bc2c-343d3735f646] SSH error running command root.eng.blr.redhat.com:'umask 0077; MYTMP="$(TMPDIR="${OVIRT_TMPDIR}" mktemp -d -t ovirt-XXXXXXXXXX)"; trap "chmod -R u+rwX \"${MYTMP}\" > /dev/null 2>&1; rm -fr \"${MYTMP}\" > /dev/null 2>&1" 0; tar --warning=no-timestamp -C "${MYTMP}" -x &&  "${MYTMP}"/ovirt-host-mgmt DIALOG/dialect=str:machine DIALOG/customization=bool:True': SSH session hard timeout host 'root.eng.blr.redhat.com'
2017-07-03 05:53:59,404-04 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (pool-5-thread-1) [cfe76218-65da-4f0b-bc2c-343d3735f646] Exception: javax.naming.TimeLimitExceededException: SSH session hard timeout host 'root.eng.blr.redhat.com'
        at org.ovirt.engine.core.uutils.ssh.SSHClient.executeCommand(SSHClient.java:475) [uutils.jar:]
        at org.ovirt.engine.core.uutils.ssh.SSHDialog.executeCommand(SSHDialog.java:317) [uutils.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.execute(VdsDeployBase.java:563) [bll.jar:]
        at org.ovirt.engine.core.bll.host.HostUpgradeManager.update(HostUpgradeManager.java:99) [bll.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.UpgradeHostInternalCommand.executeCommand(UpgradeHostInternalCommand.java:72) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1451) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397) [bll.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:511) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:756) [bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_131]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_131]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437)
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:70) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.doMethodInterception(Jsr299BindingsInterceptor.java:80) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.processInvocation(Jsr299BindingsInterceptor.java:93) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:63)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)

Comment 4 Ryan Barry 2017-07-03 10:25:30 UTC
There was an upgrade available, and it was successfully pulled and installed.

The host was upgraded to 4.1.2 async (for the stackguard CVE)

Since the 4.1.3 upgrade performs additional tasks and takes longer, we also requested https://bugzilla.redhat.com/show_bug.cgi?id=1455667

There are a couple of possible reasons for this:

* The disk was too slow
* The upgrade RPM took too long to retrieve
* RHV-M is not 4.1.3

Using a RHV-H 4.1.3 repo will also add timestamps to imgbased.log, which will give a better idea of what is taking longer, but for now, this is either a DUPLICATE or NOTABUG.

Please upgrade RHV-M to 4.1.3 and retest.

Comment 5 RamaKasturi 2017-07-03 10:29:18 UTC
Thanks Ryan for the update. Let me try the same with RHV-M 4.1.3 and retest this.

so for RHHI customers who are on RHV-H 4.1.2, do we recommend them to upgrade to RHV-M to 4.1.3 first and upgrade the nodes next due to the timeout issue which is being fixed with 4.1.3 ?

Comment 6 Ryan Barry 2017-07-03 10:31:05 UTC
RHV-H 4.1.3 also threads all of the upgrade operations instead of running them sequentially, so it will upgrade much faster in general.

Upgrading RHV-M 4.1.3 first is probably a good idea, but not strictly  necessary

Comment 7 RamaKasturi 2017-07-04 09:48:45 UTC
I have updated the engine to latest Red Hat Virtualization Manager Version: 4.1.3.5-0.1.el7.


I had three nodes, first node it succeeded, second one i saw that node HA status was in 'Local Maintenance" and upgrading third node gives me the error "Now i tried to upgrade the RHV-H node from UI i still see the issue where "processing stopped due to timeout."

i have copied the /tmp/imgbased.log, vdsm and engine logs in the location below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/

Comment 8 Ryan Barry 2017-07-04 14:30:13 UTC
(In reply to RamaKasturi from comment #7)
> I had three nodes, first node it succeeded, second one i saw that node HA
> status was in 'Local Maintenance" and upgrading third node gives me the
> error "Now i tried to upgrade the RHV-H node from UI i still see the issue
> where "processing stopped due to timeout."
> 
> i have copied the /tmp/imgbased.log, vdsm and engine logs in the location
> below.
> 
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/

I'm not sure whether going into "Local Maintenance" for HA is normal or not, since this is handled by ovirt-hosted-engine-agent/broker.

Unless you are also upgrading to RHVH 4.1.3 (which adds timestamps to the logs), it's not possible to say what's happening here.

It's likely that the disks on the other host were simply too slow. The last reports from virt QE show that mkfs is taking ~4 minutes on systems with slow disks (or a large number of disks in the VG), which is not something under our control.

Comment 9 Ryan Barry 2017-07-18 09:47:09 UTC
Is this still reproducible?

Comment 10 RamaKasturi 2017-07-18 12:17:07 UTC
Hi Ryan,
   
   I could not get time to try again to see if this is still reproducible. But i think there should be a note added in the guide asking the user to upgrade the engine to 4.1.3 before proceeding with the upgrade of RHV-H. 

Thanks
kasturi