Hide Forgot
Created attachment 1163225 [details] engine.log Description of problem: running ovirt-system tests on Lago, I'm getting (from time to time!) failure to install a host. Engine log: 2016-05-31 08:03:16,884 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] SqlCall for procedure [GetVdsByHostName] compiled 2016-05-31 08:03:16,896 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] Compiled stored procedure. Call string is [{call getstorage_poolsbyv dsgroupid(?)}] 2016-05-31 08:03:16,896 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] SqlCall for procedure [Getstorage_poolsByVdsGroupId] compiled 2016-05-31 08:03:16,906 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-10) [5f91f170] Connecting 'root.201.3' 2016-05-31 08:03:16,987 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-10) [5f91f170] Connect error: java.io.IOException: SSH session closed during connection 'root.201.3' at org.ovirt.engine.core.uutils.ssh.SSHClient.connect(SSHClient.java:309) [uutils.jar:] at org.ovirt.engine.core.bll.utils.EngineSSHClient.connect(EngineSSHClient.java:55) [bll.jar:] at org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.canConnect(AddVdsCommand.java:451) [bll.jar:] at org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.canDoAction(AddVdsCommand.java:347) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.internalCanDoAction(CommandBase.java:791) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:372) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:484) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:466) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:419) [bll.jar:] ... 2016-05-31 08:03:16,994 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-10) [5f91f170] Failed to establish session with host 'lago_basic_suite_3_6_host1': SSH session closed during connection 'root.201.3' 2016-05-31 08:03:16,995 WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-10) [5f91f170] CanDoAction of action 'AddVds' failed for user admin@internal. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server 192.168.201.3,VDS_CANNOT_CONNECT_TO_SERVER 2016-05-31 08:03:17,014 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-10) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.] Server log: 2016-05-31 08:03:16,948 INFO [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Client session created 2016-05-31 08:03:16,955 INFO [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Server version string: SSH-2.0-OpenSSH_6.6.1 2016-05-31 08:03:16,955 INFO [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Kex: server->client aes128-ctr hmac-sha2-256 none 2016-05-31 08:03:16,955 INFO [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Kex: client->server aes128-ctr hmac-sha2-256 none 2016-05-31 08:03:16,983 WARN [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-2) Exception caught: org.apache.sshd.common.SshException: KeyExchange signature verification failed at org.apache.sshd.client.kex.AbstractDHGClient.next(AbstractDHGClient.java:132) at org.apache.sshd.common.session.AbstractSession.doHandleMessage(AbstractSession.java:393) at org.apache.sshd.common.session.AbstractSession.handleMessage(AbstractSession.java:295) at org.apache.sshd.client.session.ClientSessionImpl.handleMessage(ClientSessionImpl.java:266) at org.apache.sshd.common.session.AbstractSession.decode(AbstractSession.java:720) at org.apache.sshd.common.session.AbstractSession.messageReceived(AbstractSession.java:277) at org.apache.sshd.common.AbstractSessionIoHandler.messageReceived(AbstractSessionIoHandler.java:54) at org.apache.sshd.common.io.nio2.Nio2Session$1.completed(Nio2Session.java:188) at org.apache.sshd.common.io.nio2.Nio2Session$1.completed(Nio2Session.java:174) at sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:126) [rt.jar:1.8.0_91] at sun.nio.ch.Invoker$2.run(Invoker.java:218) [rt.jar:1.8.0_91] at sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112) [rt.jar:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_91] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91] And indeed on the host: sshd[9772]: error: Received disconnect from 192.168.201.2: 3: KeyExchange signature verification failed [preauth] Version-Release number of selected component (if applicable): ovirt-engine-3.6.7-1.el6.noarch ovirt-host-deploy-java-1.4.1-1.el6.noarch How reproducible: Sometimes. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: ovirt-host-deploy-1.4.1-1.el6.noarch ovirt-image-uploader-3.6.0-1.el6.noarch ovirt-engine-setup-plugin-websocket-proxy-3.6.7-1.el6.noarch ovirt-engine-tools-3.6.7-1.el6.noarch ovirt-engine-setup-plugin-ovirt-engine-3.6.7-1.el6.noarch ovirt-engine-backend-3.6.7-1.el6.noarch ovirt-engine-extension-aaa-jdbc-1.0.7-1.el6.noarch ovirt-setup-lib-1.0.1-1.el6.noarch ebay-cors-filter-1.0.1-0.1.ovirt.el6.noarch ovirt-host-deploy-java-1.4.1-1.el6.noarch ovirt-engine-cli-3.6.2.0-1.el6.noarch ovirt-iso-uploader-3.6.0-1.el6.noarch ovirt-engine-wildfly-overlay-8.0.5-1.el6.noarch ovirt-vmconsole-proxy-1.0.2-1.el6.noarch ovirt-engine-wildfly-8.2.1-1.el6.x86_64 ovirt-engine-tools-backup-3.6.7-1.el6.noarch ovirt-engine-websocket-proxy-3.6.7-1.el6.noarch ovirt-engine-webadmin-portal-3.6.7-1.el6.noarch ovirt-engine-vmconsole-proxy-helper-3.6.7-1.el6.noarch ovirt-engine-setup-3.6.7-1.el6.noarch ovirt-engine-restapi-3.6.7-1.el6.noarch ovirt-engine-dbscripts-3.6.7-1.el6.noarch ovirt-engine-3.6.7-1.el6.noarch ovirt-engine-extensions-api-impl-3.6.7-1.el6.noarch ovirt-engine-lib-3.6.7-1.el6.noarch ovirt-engine-sdk-python-3.6.5.0-1.el6.noarch ovirt-vmconsole-1.0.2-1.el6.noarch ovirt-engine-setup-base-3.6.7-1.el6.noarch ovirt-engine-setup-plugin-ovirt-engine-common-3.6.7-1.el6.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-3.6.7-1.el6.noarch ovirt-engine-userportal-3.6.7-1.el6.noarch
Created attachment 1163226 [details] server.log
Oved, looks here: https://issues.jboss.org/browse/ENTESB-2101 They were facing a similar issue in JBoss Fuse 6.2. It seams somehow related to an older version of org.apache.sshd.sshd-core. Paolo Antinori said that requiring version v.0.12.0 of org.apache.sshd.sshd-core seams enough to prevent it. In 3.6 we still have: <dependency> <groupId>org.apache.sshd</groupId> <artifactId>sshd-core</artifactId> <version>${sshd-core.version}</version> </dependency> ... <sshd-core.version>0.11.0</sshd-core.version>
Perhaps it's https://issues.apache.org/jira/browse/SSHD-330 . I'm wondering why not to upgrade to 0.14. While I agree much newer versions (which I believe we upgraded to in 4.0?) probably a more scary to upgrade to in a 3.6.x, we need to look at the fixes in 0.13, 0.14 and understand if they might affect us: 0.13: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310849&version=12327342 0.14: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310849&version=12329012 BTW, I'm pretty sure it was not 1 out of 256 (my failures), but other than two failures today, I could not reproduce it yet... I'll see if it does.
We are building with 0.11.0 also on 4.0 RC
I just successfully installed several hosts with 0.12.0 version of apache-sshd and no code changes were required for this upgrade, so we should use apache-sshd 0.12.0 for 3.6/4.0. Independently on this bug we have already started to create patches to use latest stable version of apache-sshd 1.2.0. But there were significant API changes between 0.11 and 1.2.0, so for this upgrade will be targeted to 4.1
(In reply to Martin Perina from comment #5) > I just successfully installed several hosts with 0.12.0 version of > apache-sshd and no code changes were required for this upgrade, so we should > use apache-sshd 0.12.0 for 3.6/4.0. I installed ~20-50 hosts. Yet 2 in the same day failed. So we need to run it A LOT. > Independently on this bug we have already started to create patches to use > latest stable version of apache-sshd 1.2.0. But there were significant API > changes between 0.11 and 1.2.0, so for this upgrade will be targeted to 4.1 +1.
I've installed 100 hosts (4 ones for 25 times) on ovirt-engine 4.0.2.6-0.1.el7ev using python sdk. None of those had any SSH problems.