Bug 1467185
Summary: | clicking on upgrade leaves the ovirt-node in install failed state | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-node | Reporter: | RamaKasturi <knarra> | ||||
Component: | Installation & Update | Assignee: | Ryan Barry <rbarry> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Huijuan Zhao <huzhao> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.1 | CC: | bugs, cshao, dguo, huzhao, jiawu, knarra, qiyuan, rbarry, sbonazzo, yaniwang, ycui, yzhao | ||||
Target Milestone: | --- | Flags: | rbarry:
needinfo-
|
||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-08-01 09:47:08 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Node | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
RamaKasturi
2017-07-03 06:48:43 UTC
Created attachment 1293752 [details]
Attaching screenshot for the event messages in UI
copied /tmp/imgbased.log, sosreports from the machine where the issue occured and engine.log into the location below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/ Following exception is seen in the engine logs: ================================================== 2017-07-03 05:53:59,401-04 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [cfe76218-65da-4f0b-bc2c-343d3735f646] Error during deploy dialog: java.io.IOException: Unexpected connection termination at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:376) [otopi.jar:] at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:393) [otopi.jar:] at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.threadMain(VdsDeployBase.java:304) [bll.jar:] at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.lambda$new$0(VdsDeployBase.java:383) [bll.jar:] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131] 2017-07-03 05:53:59,404-04 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (pool-5-thread-1) [cfe76218-65da-4f0b-bc2c-343d3735f646] SSH error running command root.eng.blr.redhat.com:'umask 0077; MYTMP="$(TMPDIR="${OVIRT_TMPDIR}" mktemp -d -t ovirt-XXXXXXXXXX)"; trap "chmod -R u+rwX \"${MYTMP}\" > /dev/null 2>&1; rm -fr \"${MYTMP}\" > /dev/null 2>&1" 0; tar --warning=no-timestamp -C "${MYTMP}" -x && "${MYTMP}"/ovirt-host-mgmt DIALOG/dialect=str:machine DIALOG/customization=bool:True': SSH session hard timeout host 'root.eng.blr.redhat.com' 2017-07-03 05:53:59,404-04 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (pool-5-thread-1) [cfe76218-65da-4f0b-bc2c-343d3735f646] Exception: javax.naming.TimeLimitExceededException: SSH session hard timeout host 'root.eng.blr.redhat.com' at org.ovirt.engine.core.uutils.ssh.SSHClient.executeCommand(SSHClient.java:475) [uutils.jar:] at org.ovirt.engine.core.uutils.ssh.SSHDialog.executeCommand(SSHDialog.java:317) [uutils.jar:] at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.execute(VdsDeployBase.java:563) [bll.jar:] at org.ovirt.engine.core.bll.host.HostUpgradeManager.update(HostUpgradeManager.java:99) [bll.jar:] at org.ovirt.engine.core.bll.hostdeploy.UpgradeHostInternalCommand.executeCommand(UpgradeHostInternalCommand.java:72) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1451) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397) [bll.jar:] at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:511) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:756) [bll.jar:] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_131] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_131] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131] at org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437) at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:70) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2] at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.doMethodInterception(Jsr299BindingsInterceptor.java:80) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2] at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.processInvocation(Jsr299BindingsInterceptor.java:93) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2] at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:63) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) There was an upgrade available, and it was successfully pulled and installed. The host was upgraded to 4.1.2 async (for the stackguard CVE) Since the 4.1.3 upgrade performs additional tasks and takes longer, we also requested https://bugzilla.redhat.com/show_bug.cgi?id=1455667 There are a couple of possible reasons for this: * The disk was too slow * The upgrade RPM took too long to retrieve * RHV-M is not 4.1.3 Using a RHV-H 4.1.3 repo will also add timestamps to imgbased.log, which will give a better idea of what is taking longer, but for now, this is either a DUPLICATE or NOTABUG. Please upgrade RHV-M to 4.1.3 and retest. Thanks Ryan for the update. Let me try the same with RHV-M 4.1.3 and retest this. so for RHHI customers who are on RHV-H 4.1.2, do we recommend them to upgrade to RHV-M to 4.1.3 first and upgrade the nodes next due to the timeout issue which is being fixed with 4.1.3 ? RHV-H 4.1.3 also threads all of the upgrade operations instead of running them sequentially, so it will upgrade much faster in general. Upgrading RHV-M 4.1.3 first is probably a good idea, but not strictly necessary I have updated the engine to latest Red Hat Virtualization Manager Version: 4.1.3.5-0.1.el7. I had three nodes, first node it succeeded, second one i saw that node HA status was in 'Local Maintenance" and upgrading third node gives me the error "Now i tried to upgrade the RHV-H node from UI i still see the issue where "processing stopped due to timeout." i have copied the /tmp/imgbased.log, vdsm and engine logs in the location below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/ (In reply to RamaKasturi from comment #7) > I had three nodes, first node it succeeded, second one i saw that node HA > status was in 'Local Maintenance" and upgrading third node gives me the > error "Now i tried to upgrade the RHV-H node from UI i still see the issue > where "processing stopped due to timeout." > > i have copied the /tmp/imgbased.log, vdsm and engine logs in the location > below. > > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1467185/ I'm not sure whether going into "Local Maintenance" for HA is normal or not, since this is handled by ovirt-hosted-engine-agent/broker. Unless you are also upgrading to RHVH 4.1.3 (which adds timestamps to the logs), it's not possible to say what's happening here. It's likely that the disks on the other host were simply too slow. The last reports from virt QE show that mkfs is taking ~4 minutes on systems with slow disks (or a large number of disks in the VG), which is not something under our control. Is this still reproducible? Hi Ryan, I could not get time to try again to see if this is still reproducible. But i think there should be a note added in the guide asking the user to upgrade the engine to 4.1.3 before proceeding with the upgrade of RHV-H. Thanks kasturi |