Bug 1093882 - timeout, install failed: Sending iso from ovirt-engine to ovirt-node during upgrade
Summary: timeout, install failed: Sending iso from ovirt-engine to ovirt-node during u...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.5.0
Assignee: Alon Bar-Lev
QA Contact: Jiri Belka
URL:
Whiteboard: infra
: 1111086 1154862 1202810 (view as bug list)
Depends On:
Blocks: 1095889
TreeView+ depends on / blocked
 
Reported: 2014-05-03 01:30 UTC by Douglas Schilling Landgraf
Modified: 2016-02-10 19:33 UTC (History)
15 users (show)

Fixed In Version: ovirt-3.5.0-alpha1
Clone Of:
: 1095889 (view as bug list)
Environment:
Last Closed: 2014-10-17 12:20:33 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)
engine.log (12.76 KB, application/x-gzip)
2014-05-03 01:36 UTC, Douglas Schilling Landgraf
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 26777 0 master MERGED host-deploy: upgrade to apache-sshd 0.11.0 Never

Description Douglas Schilling Landgraf 2014-05-03 01:30:14 UTC
Description of problem:
During the upgrade process of oVirt Node to a different version via ovirt-engine it might fail with "Install Failed". When this error happens (currently often) the engine.log says "timeout".

Version-Release number of selected component (if applicable):
- ovirt-engine: 3.5 and 3.4
- sshd-core-0.7.0.jar

How reproducible:
- On oVirt Engine machine install oVirt Node rpm
- Deploy oVirt Node 
- Put in maintenance mode
- Click in upgrade button and select the iso -> OK

Additional info:
The below command works from engine to node:
# cat iso | gzip | ssh host "gunzip -q > /data/updates/iso | md5sum -b /data/updates/iso | cut -d ' ' -f 1 >&2"

2014-05-02 19:30:22,662 ERROR [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (org.ovirt.thread.pool-6-thread-21) [252a8ecd] Timeout during node 192.168.100.128 upgrade: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.100.128' at org.ovirt.engine.core.utils.ssh.SSHClient.executeCommand(SSHClient.java:499) [utils.jar:] at org.ovirt.engine.core.utils.ssh.SSHClient.sendFile(SSHClient.java:633) [utils.jar:] at org.ovirt.engine.core.utils.ssh.SSHDialog.sendFile(SSHDialog.java:374) [utils.jar:] at org.ovirt.engine.core.bll.OVirtNodeUpgrade.execute(OVirtNodeUpgrade.java:200) [bll.jar:] at org.ovirt.engine.core.bll.UpgradeOvirtNodeInternalCommand.upgradeNode(UpgradeOvirtNodeInternalCommand.java:149) [bll.jar:] at org.ovirt.engine.core.bll.UpgradeOvirtNodeInternalCommand.executeCommand(UpgradeOvirtNodeInternalCommand.java:131) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1127) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1212) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1888) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1232) [bll.jar:]  at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:351) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:189) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:156) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:165) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:97) [utils.jar:]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_51] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_51]

Comment 1 Douglas Schilling Landgraf 2014-05-03 01:36:44 UTC
Created attachment 892068 [details]
engine.log

Comment 2 Alon Bar-Lev 2014-05-03 22:07:57 UTC
Hi,

Please use the test program I sent you before opening this bug.
Please focus only on test program, do not add the noise factor of the engine.

Please remove the mina jar from class path.
Replace the sshd-core jar with[1].
Try again to run the test in a loop.

From the few times I managed to reproduce this, it seems that it is an internal bug of sshd-core-0.7.0, it simply does not return from the wait function when:

INFO: Received SSH_MSG_DISCONNECT (reason=2, msg=Received ieof for nonexistent channel 0.)

However, this version of sshd-core is used long time without significant field reports, since 3.0, so I am not that alarmed.

I cannot reproduce this at all with[1], there is already pending patch for migration[2], I am waiting for upstream release.

Please try it out.

[1] http://apache.mivzakim.net/mina/sshd/0.11.0/dist/sshd-core-0.11.0.jar
[2] http://gerrit.ovirt.org/#/c/26777/

Comment 3 Douglas Schilling Landgraf 2014-05-04 22:32:30 UTC
Hi,

(In reply to Alon Bar-Lev from comment #2)
> Hi,
> 
> Please use the test program I sent you before opening this bug.

I tested it before creating this bug but still no success with ssh 0.0.7. I have created the bug only to track it.

> Please focus only on test program, do not add the noise factor of the engine.
> 
> Please remove the mina jar from class path.
> Replace the sshd-core jar with[1].
> Try again to run the test in a loop.
> 
> From the few times I managed to reproduce this, it seems that it is an
> internal bug of sshd-core-0.7.0, it simply does not return from the wait
> function when:
> 
> INFO: Received SSH_MSG_DISCONNECT (reason=2, msg=Received ieof for
> nonexistent channel 0.)

Agreed, all times I see this issue I remember to see this error.. even in ssh logs. I have updated the Makefile and it works nicely now.

> 
> However, this version of sshd-core is used long time without significant
> field reports, since 3.0, so I am not that alarmed.
I noticed this problem in 3.4/3.5 only.

> I cannot reproduce this at all with[1], there is already pending patch for
> migration[2], I am waiting for upstream release.
> 

I have tested your patch against ovirt-engine as well and now I don't see any error, all good.

> Please try it out.
> 
> [1] http://apache.mivzakim.net/mina/sshd/0.11.0/dist/sshd-core-0.11.0.jar
> [2] http://gerrit.ovirt.org/#/c/26777/

Comment 4 Alon Bar-Lev 2014-05-05 06:08:08 UTC
Thanks for testing!

Setting this for 3.5 as it is too late for infra change for 3.4.

Comment 5 Alon Bar-Lev 2014-07-07 06:56:02 UTC
*** Bug 1111086 has been marked as a duplicate of this bug. ***

Comment 6 Alon Bar-Lev 2014-10-07 08:47:36 UTC
workaround until 3.5 is to put apache-sshd-0.11.0 jar[1] in engine module path.

1. copy /usr/share/ovirt-engine/modules/org/apache/sshd directory to /usr/share/ovirt-engine-workaround/modules/org/apache/sshd

2. replace /usr/share/ovirt-engine-workaround/modules/org/apache/sshd/sshd-core.jar with [1].

3. add /etc/ovirt-engine/engine.conf.d/80-sshd-core-workaround.conf
---
ENGINE_JAVA_MODULEPATH="/usr/share/ovirt-engine-workaround/modules:${ENGINE_JAVA_MODULEPATH}"
---

4. restart engine.

[1] http://search.maven.org/remotecontent?filepath=org/apache/sshd/sshd-core/0.11.0/sshd-core-0.11.0.jar

Comment 7 Sandro Bonazzola 2014-10-17 12:20:33 UTC
oVirt 3.5 has been released and should include the fix for this issue.

Comment 8 Fabian Deutsch 2014-10-21 08:44:23 UTC
*** Bug 1154862 has been marked as a duplicate of this bug. ***

Comment 9 Alon Bar-Lev 2014-10-21 08:47:29 UTC
Addition to comment#6: this issue can usually be resolved using manual retry without any change of software.

Comment 10 Alon Bar-Lev 2015-04-07 08:44:55 UTC
*** Bug 1202810 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.