Description of problem: During the upgrade process of oVirt Node to a different version via ovirt-engine it might fail with "Install Failed". When this error happens (currently often) the engine.log says "timeout". Version-Release number of selected component (if applicable): - ovirt-engine: 3.5 and 3.4 - sshd-core-0.7.0.jar How reproducible: - On oVirt Engine machine install oVirt Node rpm - Deploy oVirt Node - Put in maintenance mode - Click in upgrade button and select the iso -> OK Additional info: The below command works from engine to node: # cat iso | gzip | ssh host "gunzip -q > /data/updates/iso | md5sum -b /data/updates/iso | cut -d ' ' -f 1 >&2" 2014-05-02 19:30:22,662 ERROR [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (org.ovirt.thread.pool-6-thread-21) [252a8ecd] Timeout during node 192.168.100.128 upgrade: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.100.128' at org.ovirt.engine.core.utils.ssh.SSHClient.executeCommand(SSHClient.java:499) [utils.jar:] at org.ovirt.engine.core.utils.ssh.SSHClient.sendFile(SSHClient.java:633) [utils.jar:] at org.ovirt.engine.core.utils.ssh.SSHDialog.sendFile(SSHDialog.java:374) [utils.jar:] at org.ovirt.engine.core.bll.OVirtNodeUpgrade.execute(OVirtNodeUpgrade.java:200) [bll.jar:] at org.ovirt.engine.core.bll.UpgradeOvirtNodeInternalCommand.upgradeNode(UpgradeOvirtNodeInternalCommand.java:149) [bll.jar:] at org.ovirt.engine.core.bll.UpgradeOvirtNodeInternalCommand.executeCommand(UpgradeOvirtNodeInternalCommand.java:131) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1127) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1212) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1888) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1232) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:351) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:189) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:156) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:165) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:97) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_51] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_51]
Created attachment 892068 [details] engine.log
Hi, Please use the test program I sent you before opening this bug. Please focus only on test program, do not add the noise factor of the engine. Please remove the mina jar from class path. Replace the sshd-core jar with[1]. Try again to run the test in a loop. From the few times I managed to reproduce this, it seems that it is an internal bug of sshd-core-0.7.0, it simply does not return from the wait function when: INFO: Received SSH_MSG_DISCONNECT (reason=2, msg=Received ieof for nonexistent channel 0.) However, this version of sshd-core is used long time without significant field reports, since 3.0, so I am not that alarmed. I cannot reproduce this at all with[1], there is already pending patch for migration[2], I am waiting for upstream release. Please try it out. [1] http://apache.mivzakim.net/mina/sshd/0.11.0/dist/sshd-core-0.11.0.jar [2] http://gerrit.ovirt.org/#/c/26777/
Hi, (In reply to Alon Bar-Lev from comment #2) > Hi, > > Please use the test program I sent you before opening this bug. I tested it before creating this bug but still no success with ssh 0.0.7. I have created the bug only to track it. > Please focus only on test program, do not add the noise factor of the engine. > > Please remove the mina jar from class path. > Replace the sshd-core jar with[1]. > Try again to run the test in a loop. > > From the few times I managed to reproduce this, it seems that it is an > internal bug of sshd-core-0.7.0, it simply does not return from the wait > function when: > > INFO: Received SSH_MSG_DISCONNECT (reason=2, msg=Received ieof for > nonexistent channel 0.) Agreed, all times I see this issue I remember to see this error.. even in ssh logs. I have updated the Makefile and it works nicely now. > > However, this version of sshd-core is used long time without significant > field reports, since 3.0, so I am not that alarmed. I noticed this problem in 3.4/3.5 only. > I cannot reproduce this at all with[1], there is already pending patch for > migration[2], I am waiting for upstream release. > I have tested your patch against ovirt-engine as well and now I don't see any error, all good. > Please try it out. > > [1] http://apache.mivzakim.net/mina/sshd/0.11.0/dist/sshd-core-0.11.0.jar > [2] http://gerrit.ovirt.org/#/c/26777/
Thanks for testing! Setting this for 3.5 as it is too late for infra change for 3.4.
*** Bug 1111086 has been marked as a duplicate of this bug. ***
workaround until 3.5 is to put apache-sshd-0.11.0 jar[1] in engine module path. 1. copy /usr/share/ovirt-engine/modules/org/apache/sshd directory to /usr/share/ovirt-engine-workaround/modules/org/apache/sshd 2. replace /usr/share/ovirt-engine-workaround/modules/org/apache/sshd/sshd-core.jar with [1]. 3. add /etc/ovirt-engine/engine.conf.d/80-sshd-core-workaround.conf --- ENGINE_JAVA_MODULEPATH="/usr/share/ovirt-engine-workaround/modules:${ENGINE_JAVA_MODULEPATH}" --- 4. restart engine. [1] http://search.maven.org/remotecontent?filepath=org/apache/sshd/sshd-core/0.11.0/sshd-core-0.11.0.jar
oVirt 3.5 has been released and should include the fix for this issue.
*** Bug 1154862 has been marked as a duplicate of this bug. ***
Addition to comment#6: this issue can usually be resolved using manual retry without any change of software.
*** Bug 1202810 has been marked as a duplicate of this bug. ***