1467185 – clicking on upgrade leaves the ovirt-node in install failed state

Bug 1467185 - clicking on upgrade leaves the ovirt-node in install failed state

Summary: clicking on upgrade leaves the ovirt-node in install failed state

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-node
Classification:	oVirt
Component:	Installation & Update
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ryan Barry
QA Contact:	Huijuan Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-03 06:48 UTC by RamaKasturi
Modified:	2017-08-01 09:47 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-08-01 09:47:08 UTC
oVirt Team:	Node
Embargoed:
Dependent Products:
Flags:	rbarry: needinfo-

Attachments	(Terms of Use)
Attaching screenshot for the event messages in UI (163.07 KB, image/png) 2017-07-03 06:51 UTC, RamaKasturi	no flags	Details
View All

Description RamaKasturi 2017-07-03 06:48:43 UTC

Description of problem:
Installed latest released version of 4.1.2 ovirt node. I see that there exists an upgrade button next to the host in the UI. Clicking on the upgrade button results in leaving the state of host to installed_failed. Below are the events which i see during the upgrade process.

Attached the screenshot which shows the event messages.

Version-Release number of selected component (if applicable):
RHVM - 4.1.2.3-0.1.el7
RHVH - 4.1-0.20170522.0+1

How reproducible:
Always

Steps to Reproduce:
1. Install HC using latest released bits of RHV-H which is 4.1.2
2. An upgrade button is shown next to the host in the UI
3. Click on upgrade.

Actual results:
Clicking on upgrade leaves the host into 'Install Failed' state.

Expected results:
1) An upgrade icon next to the host should not be shown since there is no upgrade available on the host.
2) clicking on upgrade icon should either install the updated versions if any or should say there is nothing to upgrade and the icon should disappear.
3) Clicking on upgrade should not leave the host in installed failed state.

Additional info:

Comment 1 RamaKasturi 2017-07-03 06:51:09 UTC

Created attachment 1293752 [details]
Attaching screenshot for the event messages in UI

Comment 2 RamaKasturi 2017-07-03 07:45:12 UTC

copied /tmp/imgbased.log, sosreports from the machine where the issue occured and engine.log into the location below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/

Comment 3 RamaKasturi 2017-07-03 10:10:41 UTC

Following exception is seen in the engine logs:
==================================================
2017-07-03 05:53:59,401-04 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [cfe76218-65da-4f0b-bc2c-343d3735f646] Error during deploy dialog: java.io.IOException: Unexpected connection termination
        at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:376) [otopi.jar:]
        at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:393) [otopi.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.threadMain(VdsDeployBase.java:304) [bll.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.lambda$new$0(VdsDeployBase.java:383) [bll.jar:]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]

2017-07-03 05:53:59,404-04 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (pool-5-thread-1) [cfe76218-65da-4f0b-bc2c-343d3735f646] SSH error running command root.eng.blr.redhat.com:'umask 0077; MYTMP="$(TMPDIR="${OVIRT_TMPDIR}" mktemp -d -t ovirt-XXXXXXXXXX)"; trap "chmod -R u+rwX \"${MYTMP}\" > /dev/null 2>&1; rm -fr \"${MYTMP}\" > /dev/null 2>&1" 0; tar --warning=no-timestamp -C "${MYTMP}" -x &&  "${MYTMP}"/ovirt-host-mgmt DIALOG/dialect=str:machine DIALOG/customization=bool:True': SSH session hard timeout host 'root.eng.blr.redhat.com'
2017-07-03 05:53:59,404-04 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (pool-5-thread-1) [cfe76218-65da-4f0b-bc2c-343d3735f646] Exception: javax.naming.TimeLimitExceededException: SSH session hard timeout host 'root.eng.blr.redhat.com'
        at org.ovirt.engine.core.uutils.ssh.SSHClient.executeCommand(SSHClient.java:475) [uutils.jar:]
        at org.ovirt.engine.core.uutils.ssh.SSHDialog.executeCommand(SSHDialog.java:317) [uutils.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.execute(VdsDeployBase.java:563) [bll.jar:]
        at org.ovirt.engine.core.bll.host.HostUpgradeManager.update(HostUpgradeManager.java:99) [bll.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.UpgradeHostInternalCommand.executeCommand(UpgradeHostInternalCommand.java:72) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1451) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397) [bll.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:511) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:756) [bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_131]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_131]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437)
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:70) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.doMethodInterception(Jsr299BindingsInterceptor.java:80) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.processInvocation(Jsr299BindingsInterceptor.java:93) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:63)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)

Comment 4 Ryan Barry 2017-07-03 10:25:30 UTC

There was an upgrade available, and it was successfully pulled and installed.

The host was upgraded to 4.1.2 async (for the stackguard CVE)

Since the 4.1.3 upgrade performs additional tasks and takes longer, we also requested https://bugzilla.redhat.com/show_bug.cgi?id=1455667

There are a couple of possible reasons for this:

* The disk was too slow
* The upgrade RPM took too long to retrieve
* RHV-M is not 4.1.3

Using a RHV-H 4.1.3 repo will also add timestamps to imgbased.log, which will give a better idea of what is taking longer, but for now, this is either a DUPLICATE or NOTABUG.

Please upgrade RHV-M to 4.1.3 and retest.

Comment 5 RamaKasturi 2017-07-03 10:29:18 UTC

Thanks Ryan for the update. Let me try the same with RHV-M 4.1.3 and retest this.

so for RHHI customers who are on RHV-H 4.1.2, do we recommend them to upgrade to RHV-M to 4.1.3 first and upgrade the nodes next due to the timeout issue which is being fixed with 4.1.3 ?

Comment 6 Ryan Barry 2017-07-03 10:31:05 UTC

RHV-H 4.1.3 also threads all of the upgrade operations instead of running them sequentially, so it will upgrade much faster in general.

Upgrading RHV-M 4.1.3 first is probably a good idea, but not strictly  necessary

Comment 7 RamaKasturi 2017-07-04 09:48:45 UTC

I have updated the engine to latest Red Hat Virtualization Manager Version: 4.1.3.5-0.1.el7.


I had three nodes, first node it succeeded, second one i saw that node HA status was in 'Local Maintenance" and upgrading third node gives me the error "Now i tried to upgrade the RHV-H node from UI i still see the issue where "processing stopped due to timeout."

i have copied the /tmp/imgbased.log, vdsm and engine logs in the location below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/

Comment 8 Ryan Barry 2017-07-04 14:30:13 UTC

(In reply to RamaKasturi from comment #7)
> I had three nodes, first node it succeeded, second one i saw that node HA
> status was in 'Local Maintenance" and upgrading third node gives me the
> error "Now i tried to upgrade the RHV-H node from UI i still see the issue
> where "processing stopped due to timeout."
> 
> i have copied the /tmp/imgbased.log, vdsm and engine logs in the location
> below.
> 
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/

I'm not sure whether going into "Local Maintenance" for HA is normal or not, since this is handled by ovirt-hosted-engine-agent/broker.

Unless you are also upgrading to RHVH 4.1.3 (which adds timestamps to the logs), it's not possible to say what's happening here.

It's likely that the disks on the other host were simply too slow. The last reports from virt QE show that mkfs is taking ~4 minutes on systems with slow disks (or a large number of disks in the VG), which is not something under our control.

Comment 9 Ryan Barry 2017-07-18 09:47:09 UTC

Is this still reproducible?

Comment 10 RamaKasturi 2017-07-18 12:17:07 UTC

Hi Ryan,
   
   I could not get time to try again to see if this is still reproducible. But i think there should be a note added in the guide asking the user to upgrade the engine to 4.1.3 before proceeding with the upgrade of RHV-H. 

Thanks
kasturi

Note You need to log in before you can comment on or make changes to this bug.