Bug 1341204 - Host deploy fails (from time to time!) on KeyExchange signature verification failed
Summary: Host deploy fails (from time to time!) on KeyExchange signature verification ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Host-Deploy
Version: 3.6.7
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ovirt-4.0.0-rc2
: 4.0.0
Assignee: Martin Perina
QA Contact: Aleksei Slaikovskii
URL:
Whiteboard:
Depends On: 1367107
Blocks: 1341652 1347337
TreeView+ depends on / blocked
 
Reported: 2016-05-31 13:06 UTC by Yaniv Kaul
Modified: 2016-08-17 14:36 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1347337 (view as bug list)
Environment:
Last Closed: 2016-08-17 14:36:17 UTC
oVirt Team: Infra
rule-engine: ovirt-4.0.0+
mgoldboi: planning_ack+
mperina: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)
engine.log (15.41 KB, application/x-gzip)
2016-05-31 13:06 UTC, Yaniv Kaul
no flags Details
server.log (5.19 KB, application/x-gzip)
2016-05-31 13:07 UTC, Yaniv Kaul
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 58438 0 master MERGED core: Upgrade apache-sshd to 0.12.0 2016-06-04 12:19:58 UTC
oVirt gerrit 58619 0 ovirt-engine-4.0 MERGED core: Upgrade apache-sshd to 0.12.0 2016-06-04 17:05:59 UTC

Description Yaniv Kaul 2016-05-31 13:06:44 UTC
Created attachment 1163225 [details]
engine.log

Description of problem:
running ovirt-system tests on Lago, I'm getting (from time to time!) failure to install a host. 
Engine log:
2016-05-31 08:03:16,884 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] SqlCall for procedure [GetVdsByHostName] compiled
2016-05-31 08:03:16,896 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] Compiled stored procedure. Call string is [{call getstorage_poolsbyv
dsgroupid(?)}]
2016-05-31 08:03:16,896 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] SqlCall for procedure [Getstorage_poolsByVdsGroupId] compiled
2016-05-31 08:03:16,906 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-10) [5f91f170] Connecting 'root.201.3'
2016-05-31 08:03:16,987 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-10) [5f91f170] Connect error: java.io.IOException: SSH session closed during connection 'root.201.3'
        at org.ovirt.engine.core.uutils.ssh.SSHClient.connect(SSHClient.java:309) [uutils.jar:]
        at org.ovirt.engine.core.bll.utils.EngineSSHClient.connect(EngineSSHClient.java:55) [bll.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.canConnect(AddVdsCommand.java:451) [bll.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.canDoAction(AddVdsCommand.java:347) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalCanDoAction(CommandBase.java:791) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:372) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:484) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:466) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:419) [bll.jar:]

...
2016-05-31 08:03:16,994 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-10) [5f91f170] Failed to establish session with host 'lago_basic_suite_3_6_host1': SSH session closed during connection 'root.201.3'
2016-05-31 08:03:16,995 WARN  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-10) [5f91f170] CanDoAction of action 'AddVds' failed for user admin@internal. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server 192.168.201.3,VDS_CANNOT_CONNECT_TO_SERVER
2016-05-31 08:03:17,014 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-10) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.]




Server log:
2016-05-31 08:03:16,948 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Client session created
2016-05-31 08:03:16,955 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Server version string: SSH-2.0-OpenSSH_6.6.1
2016-05-31 08:03:16,955 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Kex: server->client aes128-ctr hmac-sha2-256 none
2016-05-31 08:03:16,955 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Kex: client->server aes128-ctr hmac-sha2-256 none
2016-05-31 08:03:16,983 WARN  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-2) Exception caught: org.apache.sshd.common.SshException: KeyExchange signature verification failed
        at org.apache.sshd.client.kex.AbstractDHGClient.next(AbstractDHGClient.java:132)
        at org.apache.sshd.common.session.AbstractSession.doHandleMessage(AbstractSession.java:393)
        at org.apache.sshd.common.session.AbstractSession.handleMessage(AbstractSession.java:295)
        at org.apache.sshd.client.session.ClientSessionImpl.handleMessage(ClientSessionImpl.java:266)
        at org.apache.sshd.common.session.AbstractSession.decode(AbstractSession.java:720)
        at org.apache.sshd.common.session.AbstractSession.messageReceived(AbstractSession.java:277)
        at org.apache.sshd.common.AbstractSessionIoHandler.messageReceived(AbstractSessionIoHandler.java:54)
        at org.apache.sshd.common.io.nio2.Nio2Session$1.completed(Nio2Session.java:188)
        at org.apache.sshd.common.io.nio2.Nio2Session$1.completed(Nio2Session.java:174)
        at sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:126) [rt.jar:1.8.0_91]
        at sun.nio.ch.Invoker$2.run(Invoker.java:218) [rt.jar:1.8.0_91]
        at sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112) [rt.jar:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_91]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91]


And indeed on the host:
sshd[9772]: error: Received disconnect from 192.168.201.2: 3: KeyExchange signature verification failed [preauth]

Version-Release number of selected component (if applicable):
ovirt-engine-3.6.7-1.el6.noarch
ovirt-host-deploy-java-1.4.1-1.el6.noarch

How reproducible:
Sometimes.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
ovirt-host-deploy-1.4.1-1.el6.noarch
ovirt-image-uploader-3.6.0-1.el6.noarch
ovirt-engine-setup-plugin-websocket-proxy-3.6.7-1.el6.noarch
ovirt-engine-tools-3.6.7-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-3.6.7-1.el6.noarch
ovirt-engine-backend-3.6.7-1.el6.noarch
ovirt-engine-extension-aaa-jdbc-1.0.7-1.el6.noarch
ovirt-setup-lib-1.0.1-1.el6.noarch
ebay-cors-filter-1.0.1-0.1.ovirt.el6.noarch
ovirt-host-deploy-java-1.4.1-1.el6.noarch
ovirt-engine-cli-3.6.2.0-1.el6.noarch
ovirt-iso-uploader-3.6.0-1.el6.noarch
ovirt-engine-wildfly-overlay-8.0.5-1.el6.noarch
ovirt-vmconsole-proxy-1.0.2-1.el6.noarch
ovirt-engine-wildfly-8.2.1-1.el6.x86_64
ovirt-engine-tools-backup-3.6.7-1.el6.noarch
ovirt-engine-websocket-proxy-3.6.7-1.el6.noarch
ovirt-engine-webadmin-portal-3.6.7-1.el6.noarch
ovirt-engine-vmconsole-proxy-helper-3.6.7-1.el6.noarch
ovirt-engine-setup-3.6.7-1.el6.noarch
ovirt-engine-restapi-3.6.7-1.el6.noarch
ovirt-engine-dbscripts-3.6.7-1.el6.noarch
ovirt-engine-3.6.7-1.el6.noarch
ovirt-engine-extensions-api-impl-3.6.7-1.el6.noarch
ovirt-engine-lib-3.6.7-1.el6.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el6.noarch
ovirt-vmconsole-1.0.2-1.el6.noarch
ovirt-engine-setup-base-3.6.7-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-3.6.7-1.el6.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-3.6.7-1.el6.noarch
ovirt-engine-userportal-3.6.7-1.el6.noarch

Comment 1 Yaniv Kaul 2016-05-31 13:07:05 UTC
Created attachment 1163226 [details]
server.log

Comment 2 Simone Tiraboschi 2016-05-31 14:14:18 UTC
Oved, looks here:
https://issues.jboss.org/browse/ENTESB-2101
They were facing a similar issue in JBoss Fuse 6.2.
It seams somehow related to an older version of org.apache.sshd.sshd-core.

Paolo Antinori said that requiring version v.0.12.0 of org.apache.sshd.sshd-core seams enough to prevent it.

In 3.6 we still have:

      <dependency>
        <groupId>org.apache.sshd</groupId>
        <artifactId>sshd-core</artifactId>
        <version>${sshd-core.version}</version>
      </dependency>
...
<sshd-core.version>0.11.0</sshd-core.version>

Comment 3 Yaniv Kaul 2016-05-31 15:24:27 UTC
Perhaps it's https://issues.apache.org/jira/browse/SSHD-330 .

I'm wondering why not to upgrade to 0.14. While I agree much newer versions (which I believe we upgraded to in 4.0?) probably a more scary to upgrade to in a 3.6.x, we need to look at the fixes in 0.13, 0.14 and understand if they might affect us:
0.13: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310849&version=12327342
0.14: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310849&version=12329012

BTW, I'm pretty sure it was not 1 out of 256 (my failures), but other than two failures today, I could not reproduce it yet... I'll see if it does.

Comment 4 Simone Tiraboschi 2016-05-31 16:01:58 UTC
We are building with 0.11.0 also on 4.0 RC

Comment 5 Martin Perina 2016-06-01 11:45:45 UTC
I just successfully installed several hosts with 0.12.0 version of apache-sshd and no code changes were required for this upgrade, so we should use apache-sshd 0.12.0 for 3.6/4.0.
Independently on this bug we have already started to create patches to use latest stable version of apache-sshd 1.2.0. But there were significant API changes between 0.11 and 1.2.0, so for this upgrade will be targeted to 4.1

Comment 6 Yaniv Kaul 2016-06-01 13:28:54 UTC
(In reply to Martin Perina from comment #5)
> I just successfully installed several hosts with 0.12.0 version of
> apache-sshd and no code changes were required for this upgrade, so we should
> use apache-sshd 0.12.0 for 3.6/4.0.

I installed ~20-50 hosts. Yet 2 in the same day failed. So we need to run it A LOT.

> Independently on this bug we have already started to create patches to use
> latest stable version of apache-sshd 1.2.0. But there were significant API
> changes between 0.11 and 1.2.0, so for this upgrade will be targeted to 4.1

+1.

Comment 9 Aleksei Slaikovskii 2016-08-16 12:03:09 UTC
I've installed 100 hosts (4 ones for 25 times) on ovirt-engine 4.0.2.6-0.1.el7ev using python sdk.
None of those had any SSH problems.


Note You need to log in before you can comment on or make changes to this bug.